# Multiple linear regression

When there are two or more independent variables, the regression is called Multiple Linear Regression.

In [29]:
# For these lessons we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# and of course the actual regression (machine learning) module
from sklearn.linear_model import LinearRegression

## Load the data

In [31]:
# Load the data from a .csv in the same folder
df = pd.read_csv('./data/1.02. Multiple linear regression.csv')

# Let's explore the top 5 rows of the df
df.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [32]:
# This method gives us very nice descriptive statistics. We don't need this for now, but will later on!
df.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


## Create the multiple linear regression

### Declare the dependent and independent variables

In [33]:
# There are two independent variables: 'SAT' and 'Rand 1,2,3'
x = df[['SAT','Rand 1,2,3']]

# and a single depended variable: 'GPA'
y = df['GPA']

### Regression 모형의 생성

In [36]:
x

Unnamed: 0,SAT,"Rand 1,2,3"
0,1714,1
1,1664,3
2,1760,3
3,1685,3
4,1693,2
...,...,...
79,1936,3
80,1810,1
81,1987,3
82,1962,1


$$y = ax + b$$
$$GPA = a * SAT + b$$
$$GPA = a1 * SAT + a2 * RandNumb + b$$

In [37]:
# We start by creating a linear regression object
reg = LinearRegression()
# The whole learning process boils down to fitting the regression
reg.fit(x, y)

In [38]:
# Getting the coefficients of the regression
reg.coef_
# Note that the output is an array

array([ 0.00165354, -0.00826982])

In [23]:
# Getting the intercept of the regression
reg.intercept_
# Note that the result is a float as we usually expect a single value

0.29603261264909486

### Model Evaluation - Calculating the R-squared

The `R-squared` value is used to evaluate the model's performance. It measures the proportion of variance in the dependent variable that can be explained by the independent variables. A higher R-squared value indicates a better model fit.

In [42]:
# Get the R-squared of the regression
reg.score(x,y)

0.40668119528142843

### Model Evaluation - Formula for Adjusted $R^2$

Adding more independent variables to a model can artificially inflate the R-squared value, even if the new variables do not meaningfully contribute to the model's predictive power. The Adjusted $R^2$ addresses this issue by penalizing the addition of irrelevant variables. 

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [40]:
# Get the shape of x, to facilitate the creation of the Adjusted R^2 metric
x.shape

(84, 2)

In [41]:
# If we want to find the Adjusted R-squared we can do so by knowing the r2, the # observations, the # features
r2 = reg.score(x,y)
# Number of observations is the shape along axis 0
n = x.shape[0]
# Number of features (predictors, p) is the shape along axis 1
p = x.shape[1]

# We find the Adjusted R-squared using the formula
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.39203134825134023

### Model Evaluation - Mean Squared Error (MSE)

The **Mean Squared Error (MSE)** measures the average squared difference between the predicted values and the actual values. A lower MSE indicates a better-fitting model.

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$


Where:  
- $y_i$: Actual value for observation $i$  
- $\hat{y}_i$: Predicted value for observation $i$  
- $n$: Total number of observations 

**Key Points:**  
- MSE penalizes larger errors more heavily due to the squaring of differences.  
- It is sensitive to outliers, so it should be interpreted with caution in the presence of extreme values.  
- Often used for regression model evaluation to quantify prediction accuracy.


In [44]:
from sklearn.metrics import mean_squared_error

y_true = y
y_pred = reg.predict(x)

mse = mean_squared_error(y_true, y_pred) 
print("MSE = ", mse)

MSE =  0.04325149456531023
