In [1]:
# Multiple Linear Regression

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
# set theme
sns.set()
sns.set(rc={'figure.figsize':(10,8)})

# Multiple Linear Regression

## $ y = b_0 + b_1*x_1 + b_2*x_2 +  . . . . . . + b_n*x_n$

y - __Dependant variables__    
$x_1, x_2 ,  . . .x_n$ - __Independant variables__  

Remember to consider the assumptions to check for linear regression.

Below dataset considers 50 start-ups, and they want to see the correlations between profit and the money that have been spent on different expenses R&D, admin, marketing and also in which state the company operates.

So, profit is our dependent variable and the rest are our independent variables.


#### References:

1. [Dealing with categorial variables in python](https://www.datacamp.com/community/tutorials/categorical-data)
2. [Analytics Vidhya - Working with Categorical variables](https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/)
3. [Dummy variable trap in regression](https://analyticstraining.com/understanding-dummy-variable-traps-regression/)

In [4]:
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [5]:
# Unique values in State column
dataset.State.unique()

array(['New York', 'California', 'Florida'], dtype=object)

## creating dummy variables for State column

__Dummy variables__ alternatively called as indicator variables take discrete values such as 1 or 0 marking the presence or absence of a particular category. By default we can use only variables of numeric nature in a regression model. Therefore if the variable is of character by nature, we will have to transform into a quantitative variable. A simple transformation is not a dummy variable. A dummy is when we create an indicator variable. 

__Number of dummy variables to be created in dataset = n - 1__  

where, n = number of unique values in the considered categorical column

In [6]:
# drop_first = True  --- to remove the dummy variable trap

dataset = pd.get_dummies(dataset, columns = ['State'], drop_first = True)
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,State_Florida,State_New York
0,165349.2,136897.8,471784.1,192261.83,0,1
1,162597.7,151377.59,443898.53,191792.06,0,0
2,153441.51,101145.55,407934.54,191050.39,1,0
3,144372.41,118671.85,383199.62,182901.99,0,1
4,142107.34,91391.77,366168.42,166187.94,1,0


## $y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + b_4*D_2 + b_5*D_3$

$D_2$ - State_Florida  
$D_3$ - State_New York  

We dropped the dummy variable $D_1$ for _State_California_ 

This is because one of the state can act as baseline indicator and does not provide any incremental information to the model. 
The obvious question is how to decide which variable to drop? The answer is any. For a continuous independent variable – `Y = alpha + beta * X`, we interpret the beta coefficient as follows – A unit change in the independent variable X will bring about beta change in the dependent variable Y.

However, how will you interpret a categorical independent variable? Let us say if gender is your independent variable, it may not be right to interpret it as when one unit change in male!

The correct approach in this case is to interpret the coefficient with respect to the baseline dummy or the dummy that we did not add in the model. 

And lets assume that the variables $D_2$ and $D_3$ does stay in the model or are significant and, so  $D_2$ gets a +ve coefficient and $D_3$ gets a -ve coefficient. 

A +ve correlation for State_Florida means that as compared to State_California there is more profit in State_Florida. And a -ve correlation for State_New York means that as compared to State_California there is less profit in State_New York.

If we use all three States in the model, you will get an error and the output maybe erroneous. But we are still accounting for all the information or rather adding all the States do not provide any incremental information to the model.

__Dummy variable trap__ is also alternatively called as __a case of perfect multicollinearity__. 

 [2 ways to avoid multi-colinarity](https://www.youtube.com/watchreload=9&v=qrWx3OjZL3o):
- Number of dummy variables to be created in dataset = n - 1, where n is the number of categories in that variable
- Or, Do not write/use the constant in linear regression

***
***

## p-value


Before I talk about what the p-value is, let’s talk about what __it isn’t__.

- __The p-value is NOT the probability the claim is true__. Of course, this would be an amazing thing to know! Think of it “there is 10% chance that this medicine works”. Unfortunately, this just isnt the case. Actually determining this probability would be really tough if not impossible!
- __The p-value is NOT the probability the null hypothesis is true__. Another one that seems so logical it has to be right! This one is much closer to the reality, but again it is way too strong of a statement.

A __small p-value__ indicates that by pure luck alone, it would be unlikely to get a sample like the one we have if the null hypothesis is true. If this is small enough we start thinking that maybe we aren’t super lucky and instead our assumption about the null being true is wrong. Thats why we reject null hypothesis when we get a small p-value.

A __large p-value__ indicates that it would be pretty normal to get a sample like ours if the null hypothesis is true. So you can see, there is no reason here to change our minds like we did with a small p-value.

A `p-value` tells us how likely it is to get a result like this if the NULL nypothesis is true.

### [How to Interpret P-values and Coefficients in Regression Analysis](https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/)

__P-values__ and __coefficients__ in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The __coefficients__ describe the mathematical relationship between each independent variable and the dependent variable. The __p-values__ for the coefficients indicate whether these relationships are statistically significant.

After fitting a regression model, [check the residual plots](https://statisticsbyjim.com/regression/check-residual-plots-regression-analysis/) first to be sure that you have unbiased estimates. After that, it’s time to interpret the statistical output.

#### Interpreting P-Values for Variables in a Regression Model

Regression analysis is a form of inferential statistics. __The p-values help determine whether the relationships that you observe in your sample also exist in the larger population.__

The __p-value__ for each independent variable tests the __null hypothesis__ that the variable has no correlation with the dependent variable. If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is effect at the population level.

#### __Ho__: The variable x has no correlation with the dependent variable y.

#### __Ha__: The variable x has a correlation with the dependent variable y.


If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there is a non-zero correlation. Changes in the independent variable are associated with changes in the response at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.


On the other hand, a p-value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

Example:

The regression output example below shows that the South and North predictor variables are statistically significant because their p-values equal 0.000. On the other hand, East is not statistically significant because its p-value (0.092) is greater than the usual significance level of 0.05.

![](reference_p_val.PNG)

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model. For the results above, we would consider removing East. Keeping variables that are not statistically significant can reduce the model’s precision.

### Interpreting Regression Coefficients for Linear Relationships

The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others

Eg: $ Weight(in kg) = -114.3 + 106.5*Height_M$   

The height coefficient in the regression equation is 106.5. This coefficient represents the mean increase of weight in kilograms for every additional one meter in height. If your height increases by 1 meter, the average weight increases by 106.5 kilograms.

### Use Polynomial Terms to Model Curvature in Linear Models

In linear regression, you can use polynomial terms model curves in your data. It is important to keep in mind that we’re still using linear regression to model curvature rather than nonlinear regression. 

[Difference between Linear and Nonlinear Regression Models](https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/)

***

Reference:

- [p-value](https://www.mathbootcamps.com/what-is-a-p-value/)
- [WikiHow calculate p-value](https://www.wikihow.com/Calculate-P-Value)