# Multiple Linear Regression

Adjusted R-squared:

(1) The R-squared measures how much of the total variability is explained by our model

(2) Multiple regressions are always better than simple ones, as with each addtional variable you add, the explanatory power may only increase or stay the same.

(3) Adjusted R-squared is always less than R-squared. The adjusted R-squared penalize excessive use of variables.


# Import the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

# Load data

In [4]:
data = pd.read_csv('1.02. Multiple linear regression.csv')

In [5]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


In [6]:
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


## Create the multiple linear regression 

Declare the dependent and independent variables

In [8]:
x = data[['SAT', 'Rand 1,2,3']]
y = data['GPA']

In [9]:
reg = LinearRegression()
reg.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [11]:
reg.intercept_

0.29603261264909353

### Calculate the R-squared
R-squared is a universal measure to evaluate how well linear regression fare and compare. <br>
Adjusted R-square is more appropriate for multi linear regression.

In [15]:
R_squared_value = reg.score(x,y)
R_squared_value

0.40668119528142815

### Formula for Adjusted R^2
$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [21]:
x.shape     # n - the numner of observations, p - the number of predictors

(84, 2)

In [24]:
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
r2_adj = 1 - (1 - r2)*(n-1)/(n-p-1)
r2_adj

0.39203134825134

### Regression assumptions:
(1) Linearity <br />
(2) No endogeneity - e.g., omitted variables <br />
(3) Normality and homoscedasticity <br />
(4) No autocorrelation <br />
(5) No multicollinearity <br />


## Feature selection (F-regression) 
It creates simple linear regressions for each feature and the dependent variable. <br>
Note that for a simple linear regression, the p-value of F-stat coincides with the p-value of the only independent variable. <br>
Features with a p-value > 0.5 can be discarded.

In [25]:
from sklearn.feature_selection import f_regression

In [28]:
f_regression(x, y)    # first is f-stats; p-values are in the second array.

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

In [30]:
p_values = f_regression(x,y)[1]
p_values

array([7.19951844e-11, 6.76291372e-01])

In [31]:
p_values.round(3)

array([0.   , 0.676])

The above p_values indicate SAT is usefu, while "rand 1,2,3" is useless. 

The f-regression is simplistic and does not take into account the interrelation of the features. it should be used with cautions.

## Creating a summary table

In [37]:
reg_summary = pd.DataFrame(data=x.columns.values, columns=['Features'])

eg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [40]:
reg_summary['coefficient'] = reg.coef_
reg_summary['p-value'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,coefficient,p-value
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


p-values are one of the best ways to determine if a variable is redundant. <br>
but they provide no information whatsoever about HOW USEFUL a variable is.