# Video 12. Linear Regression Analysis
***

- Sample venture capitalist firm
- For simplicity it considers only a single variable (R&D) in this case to find out in which companies to invest in
- When the sample data is plotted:

![](img/200319/1.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- Companies spending more on R&D make good profit, so the venture capital firm wants to invest in them


### Independent & Dependent Variables
- Example: Based on the amount of rainfall, how much would be the crop yield?

**INDEPENDENT VARIABLE**
- a variable whose value does not change by the effect of other variables and is used to manipulate the dependent variable
- often denoted as **X**
- for our case it would be rainfall

**DEPENDENT VARIABLE**
- a variable whose value change when there is any manipulation in the values of independent variables
- often denoted as **Y**
- for our case it would be crop yield


### Numerical & Categorical Values

![](img/200319/2.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>


### ML Algorithms for Linear Regression
- LR falls in the category of Supervised ML algorithms
- The whole group of Regression algorithms contains:
    - Simple Linear Regression
    - Multiple Linear Regression
    - Polynomial Linear Regression
    
    
### Applications of Linear Regression
- Economic Growth
    - used to determine economic growth of a country or a state in the coming quarter
    - can also be used to predict the GDP of a country
- Product Price
    - can be used to predict what would be the price of a product in the future
- Housing Sales
    - to estimate the number of houses a builder would sell and at what price in the coming months
- Score Prediction
    - to predict the number of runs a player would score in the coming matches based on previous performance
    
    
### Understanding Linear Regression
- Linear Regression is a statistical model used to predict the relationship between independent and dependent variables by examining two factors:
    - Which variables in particular are significant predictiors of the outcome variables?
    - How significant is the Regression line to make predictions with highest possible accuracy
    
- The simplest form of a simple linear regression equation with one dependent and one independent variable is represented by:

$$ \large y = mx + c $$

![](img/200319/3.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

Where:
- y = dependent variable
- x = independent variable
- c = coefficient of the line
- m = slope of the line

$$ \large m = \frac{y_2 - y_1}{x_2 - x_1} $$

Let's go back to the rainfall - crop yield example. If we plot rainfall on the x-axis and crop yield on the y-axis we can draw a regression line where we can predict the amount of crop yield based on the rainfall:

![](img/200319/4.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>


#### Intuition behind the Regression line
- Lets consider a sample dataset with 5 rows and find out how to draw the regression line:

![](img/200319/5.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- If we go ahead and plot those point on a graph we can see how a line would fir perfectly through the middle:

![](img/200319/6.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- The next we want to know is what the *mean* is:
    - **3** for the X variable
    - **4** for the Y variable
- If we plot the means on the graph it makes a nice line through the middle:

![](img/200319/7.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- From that point we can calculate the relevant features in the dataset: $ X^2, Y^2 and XY $, along with their sums:

![](img/200319/8.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- And we can get the formulas for $ m $ and $ c $:

![](img/200319/9.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- Now we can go and find out the predicted values of Y for corresponding values of X using the linear equation where **m = 0.6** and **c = 2.2**
- The blue points represent the **actual Y values** and the brown points represent the predicted Y values
- The distance between the actual and predicted values are known as **residuals or errors**
- The best fit line should have the least sum of squares of there errors also known as **e square**

![](img/200319/10.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- The sum of squared errors for this regression line is 2.4
- We check this error for each line and conclude the best fir line having the least e square value

![](img/200319/11.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- We keep mobing this line through the data points to make sure the Best fir line has the least square distance between the data points and the regression line
- There are a lots of ways to minimize the distance between the line and the data points like: Sum of Squared Errors, Sum of Absolute Errors, Root Mean Square Error...


<br><br>
## Multiple Linear Regression

![](img/200319/12.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- Multiple variables coming in
- Instead of $ x $ we have $ x_1, x_2, x_3 $ etc
- Instead of having one slope, each variable has it's own slope attached to it - $ m_1, m_2, m_3 $ etc


<br><br>
## Multiple Linear Regression Implementation
- Instead of looking just at R&D we'll look at multiple features:

![](img/200319/13.png)
<br>**Source:** *https://youtu.be/NUXdtN1W1FE?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy*<br>

- From there we'll try to predict what the profit would be

The dataset from the video is not available online so I'll only write the code without executing it.

        import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        import seaborn as sns
        %matplotlib inline

        companies = pd.read_csv("Companies.csv")
        X = companies.iloc[:, :-1].values
        y = companies.iloc[:, 4].values
        
        companies.head()
        
        
       # data visualization  - correlation matrix
       sns.heatmap(companies.corr())
       
       
       # Encoding categorical data
           # the third column in the dataset is categorical
           # linear regression does not know how to process categorical data
           # we need to change values to numbers
       from sklearn.preprocessing import LabelEncoder, OneHotEncoder
       labelencoder = LabelEncoder()
       
       # this line changes categorical values to numbers
       X[:, 3] = labelencoder.fit_transform(X[:, 3])
       
       # One Hot Encoder
           # makes binarization of integer categories
           # creates a new column for every category
               # puts 1 if is in category
               # 0 if isn't
       onehotencoder = OneHotEncoder(categorical_features = [3])
       
       # preparates the data to be a row of numbers
       X = onehotencoder.fit_transform(X).toarray()
       
       
       # Avoiding dummy variable trap:
       X = X[:, 1:]
       
       
       
       # Splitting the data into train and test set
       from sklearn.model selection import train_test_split
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
       
       
       
       # Fitting Multiple Linear Regression Model to Training set:
       from sklearn.linear_model import LinearRegression
       model_fit = LinearRegression()
       model_fit.fit(X_train, y_train)
       
       
       
       # Making predictions
       y_pred = model_fit.predict(X_test)
       
       
       
       # Calculating coefficients and the intercept
       print(model_fit.coef_)
       print(model_fit.intercept_)
       
       
       
       # Calculating the R squared value
           # tells us how good the prediction is
           # not a percentage, but should be above 0.9
       from sklearn.metrics import r2_score
       r2_score(y_test, y_pred)