# Multiple linear regression

## Import the relevant libraries

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [3]:
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [4]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SAT,84.0,1845.27381,104.530661,1634.0,1772.0,1846.0,1934.0,2050.0
"Rand 1,2,3",84.0,2.059524,0.855192,1.0,1.0,2.0,3.0,3.0
GPA,84.0,3.330238,0.271617,2.4,3.19,3.38,3.5025,3.81


## Create the multiple linear regression

### Creating with StatsModels

In [11]:
x1_sm = data[['SAT','Rand 1,2,3']]
y_sm = data['GPA']

In [12]:
x_sm = sm.add_constant(x1_sm)
ols_result = sm.OLS(y_sm,x_sm).fit()
ols_result.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Fri, 01 Dec 2023",Prob (F-statistic):,6.58e-10
Time:,15:47:23,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


### Declare the dependent and independent variables

In [5]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']

### Regression itself

In [6]:
reg = LinearRegression()
reg.fit(x,y)

In [7]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [8]:
reg.intercept_

0.29603261264909486

In [13]:
reg.score(x,y)

0.4066811952814282

#### Adjusted R-Squared

Formula:

$
R^2_{adj.} = 1 - (1 - R^2)*\frac{n-1}{n-p-1}
$

Where,
- n = 84 (number of observations)
- p = 2 (number of predictors)

In [14]:
x.shape

(84, 2)

In [16]:
adj_r2 = 1-(1-0.4066811952814282)*(84-1)/(84-2-1)
adj_r2

0.39203134825134

In [17]:
adj_r2 - reg.score(x,y)

-0.014649847030088203

As you can see above, the Adjusted R-Squared is lower than the R-Squared, which means that one of the predictors have little explanatory power

### Regression Model

Thus, the Multiple Linear Regression model, will be:

$
GPA = 0.29603261264909486 + 0.00165354 * SAT - 0.00826982 * Rand \,1,2,3
$