# Multiple linear regression and adjusted R-squared

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn
seaborn.set()

## Load the data

In [2]:
# Load the data from a .csv in the same folder

data = pd.read_csv('1.02. Multiple linear regression.csv')

In [3]:
# Let's check what's inside this data frame
data

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.40,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2
...,...,...,...
79,1936,3.71,3
80,1810,3.71,1
81,1987,3.73,3
82,1962,3.76,1


In [4]:
# This method gives us very nice descriptive statistics. We don't need this as of now, but will later on!

#REMEMBER, describe method gives you DESCRIPTIVE STATISTICS
data.describe() 

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


## Create your first multiple regression

In [5]:
data['SAT']

0     1714
1     1664
2     1760
3     1685
4     1693
      ... 
79    1936
80    1810
81    1987
82    1962
83    2050
Name: SAT, Length: 84, dtype: int64

In [6]:
# Following the regression equation, our dependent variable (y) is the GPA
y = data['GPA']

#we are labeling our independent variable BUT with 2 explanatory variables
#x1 is a dataframe with 2 series
x1 = data[['SAT','Rand 1,2,3']] 

In [7]:
# Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)

# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y 
#and an idependent x. The fit method, which you can think of as a method that will apply a specific estimation
#technique to obtain the fit of the model.
results = sm.OLS(y,x).fit()

#print summary of regression
results.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Mon, 15 Jul 2024",Prob (F-statistic):,6.58e-10
Time:,14:06:21,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


Notice how the bew R-Squared is slightly increased from the previous Simple Linear Regression, but the Adj R-Squared is lower in this model than the previous one. **You have added information (the rand column) but lost value. Note to cherry pick your data to exclude uneccessary data**

The model actually points our the impractical variable on the **COEFFICIENT TABLE, specifically on the P>|t| section**. The Rand 1,2,3 variable got a p-value of 0.762, resulting in us not rejecting the null hypothesis at the 76% significance level. This is **extremely high considering that for a coefficent to be statistically significant, you want a p-value of less than 0.05**. The conclusion is that the variable, Rand 1,2,3, worsens the explanatory power of the model, as reflected in the lower Adj R-Squared, and is also insignificant. Therefore, it should be DROPPED. 

**Old Model:** yhat = 0.275 + 0.0017x1
    
**New Model:** yhat = 0.296 + 0.0017x1 - 0.0083x2
    
Notice how the third variable affected the intercept. Whenever you have one variable that is ruining the model, you should not use the model altogether because the bias of this variable  is reflected into the coefficients of the other variables. The correct approach is to remove it from the regression and run a new one.

You can 100 variables to the regression and get an amazing predictive power, but this makes the regression analysis futile. You want to use a few of independent variables. **Simplicity is better rewarded than higher explanatory power**

Th Adjusted R-squared is the **basis for comparing regression models.** Once again, it **only makes sense to compare two models considering the same dependent variable and using the same dataset.**