# Multiple Linear Regression

## Import relevant libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()

## Load the data

In [3]:
raw_data = pd.read_csv('1.03.+Dummies.csv')
raw_data

Unnamed: 0,SAT,GPA,Attendance
0,1714,2.40,No
1,1664,2.52,No
2,1760,2.54,No
3,1685,2.74,No
4,1693,2.83,No
...,...,...,...
79,1936,3.71,Yes
80,1810,3.71,Yes
81,1987,3.73,No
82,1962,3.76,Yes


## Create a dummy variable for 'Attendance'

In [4]:
data = raw_data.copy()
data['Attendance'] = data['Attendance'].map({'Yes':1, 'No':0})
data.head()

Unnamed: 0,SAT,GPA,Attendance
0,1714,2.4,0
1,1664,2.52,0
2,1760,2.54,0
3,1685,2.74,0
4,1693,2.83,0


## Create the regression

### Declare the dependent and independent variable

In [6]:
y = data['GPA']
x1 = data[['SAT', 'Attendance']]

### Regression

In [7]:
x = sm.add_constant(x1)
results = sm.OLS(y, x).fit()

In [8]:
results.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.565
Model:,OLS,Adj. R-squared:,0.555
Method:,Least Squares,F-statistic:,52.7
Date:,"Wed, 17 Jul 2024",Prob (F-statistic):,2.19e-15
Time:,11:06:11,Log-Likelihood:,25.798
No. Observations:,84,AIC:,-45.6
Df Residuals:,81,BIC:,-38.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6439,0.358,1.797,0.076,-0.069,1.357
SAT,0.0014,0.000,7.141,0.000,0.001,0.002
Attendance,0.2226,0.041,5.451,0.000,0.141,0.304

0,1,2,3
Omnibus:,19.56,Durbin-Watson:,1.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.189
Skew:,-1.028,Prob(JB):,1.25e-06
Kurtosis:,4.881,Cond. No.,33500.0


## How to make predictions based on the regressions we create

In [9]:
x

Unnamed: 0,const,SAT,Attendance
0,1.0,1714,0
1,1.0,1664,0
2,1.0,1760,0
3,1.0,1685,0
4,1.0,1693,0
...,...,...,...
79,1.0,1936,1
80,1.0,1810,1
81,1.0,1987,0
82,1.0,1962,1


Suppose we want to predict the GPA of two student said Bob and Lisa, which Bob's SAT score is 1720 and Lisa's is 1683. From the attendance data, we have the information that Bob's attendance was not 75% and Lisa is above 75%, so Bob's is No but Lisa's Yes.

In [11]:
new_data = pd.DataFrame({'const':1, 'SAT':[1720, 1683], 'Attendance':[0, 1]})

In [12]:
new_data

Unnamed: 0,const,SAT,Attendance
0,1,1720,0
1,1,1683,1


In [13]:
new_data = new_data[['const', 'SAT', 'Attendance']]

In [14]:
new_data

Unnamed: 0,const,SAT,Attendance
0,1,1720,0
1,1,1683,1


In [16]:
predictions = results.predict(new_data)
predictions

0    3.051509
1    3.222361
dtype: float64

In [19]:
predictionsdf = pd.DataFrame({'predictions':predictions})  #transform the result in to data frame
predictionsdf

Unnamed: 0,predictions
0,3.051509
1,3.222361


In [25]:
joined = new_data.join(predictionsdf)  #join our predictions result to our new_data table
joined.rename(index={'Bob':0, 'Lisa':1}) #rename the index
joined

Unnamed: 0,const,SAT,Attendance,predictions
0,1,1720,0,3.051509
1,1,1683,1,3.222361
