# **Lab 4**: Multiple Linear Regression (Prediction + Inference)

In this lab, we will predict **house prices** using two predictors: **SquareFootage** and **AgeOfHouse**.

We will do two things:
1. **Prediction**: See how well our model predicts using train/test split, MSE, and R².
2. **Inference**: Look at coefficients, t-stats, p-values, and the F-statistic to understand which predictors are statistically significant.

## Step 1: Import Libraries

First, we import all of the libraries that we need. Here we have the usual ones plus one new addition: `statsmodels`.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Import everything we need!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

## Step 2: Create Dataset

We are using a small synthetic dataset of houses from the last lab on Simple Linear Regression. It includes:
- SquareFootage
- AgeOfHouse
- NumBedrooms
- Price (target variable)

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Let's create it and view the first few rows.

In [None]:
data = {
    'SquareFootage': [3532, 3407, 2453, 1635, 1563, 2531, 1833, 1077, 2578, 2628,
                      3447, 3942, 3448, 3162, 1505, 3399, 2935, 3022, 3697, 2501],
    'NumBedrooms': [2, 1, 2, 5, 4, 1, 4, 1, 3, 4,
                    1, 2, 4, 4, 4, 1, 2, 2, 2, 1],
    'AgeOfHouse': [31, 10, 23, 35, 11, 28, 34, 0, 0, 36,
                   5, 38, 40, 17, 15, 4, 41, 42, 31, 1],
    'Price': [838560.919253, 804484.724846, 563445.633404, 432226.399244, 408372.745913,
              548757.706247, 440905.015175, 248148.490314, 625590.329047, 644081.556976,
              817637.919091, 953896.908492, 846710.103200, 799152.331779, 385981.010762,
              806350.318599, 659057.167194, 687884.192489, 886352.154312, 563846.859079]
}

df = pd.DataFrame(data)
df.XXX

Unnamed: 0,SquareFootage,NumBedrooms,AgeOfHouse,Price
0,3532,2,31,838560.919253
1,3407,1,10,804484.724846
2,2453,2,23,563445.633404
3,1635,5,35,432226.399244
4,1563,4,11,408372.745913


## Step 3: Define Features and Target

We select our features (**X**) and target (**y**). For Multiple Linear Regression, that means we must have 2+ predictors. Let's use: `SquareFootage` (from last lab) and `AgeOfHouse`.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Parse apart your predictors and your target variables so that we can do our train/test split.

In [None]:
X = XXX
y = XXX

## Step 4: Split Data into Training and Test Sets

We split the data to evaluate how well our model predicts **new, unseen data**.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Perform the train/test split using 80% for training and 20% for testing.

In [None]:
XXX, XXX, XXX, XXX = XXX

## Step 5: Train Sklearn Model (Prediction)

We use **sklearn**'s `LinearRegression` to **fit a model for prediction**.
This is fast and convenient for calculating metrics like MSE and R².

Note: sklearn is mainly for **prediction**, not for inference/statistical testing.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Fit a LinearRegression model to your data. It works exactly the same as when we did this with a single predictor.

In [None]:
model_skl = XXX()
model_skl.XXX(XXX, XXX)

y_pred_train = model_skl.XXX(XXX)
y_pred_test = model_skl.XXX(XXX)

print(f'Coefficients: {model_skl.coef_}')
print(f'Intercept: {model_skl.intercept_}')

Coefficients: [233.34879591 -27.2070154 ]
Intercept: 11053.874506592401


## Step 6: Evaluate Prediction Performance

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Calculate **MSE** (mean squared error) and **R²** on both training and test sets to see how well our model fits and generalizes.

In [None]:
train_mse = XXX(XXX, XXX)
train_r2 = XXX(XXX, XXX)

test_mse = XXX(XXX, XXX)
test_r2 = XXX(XXX, XXX)

print(f'Train MSE: {train_mse:.2f}, R²: {train_r2:.4f}')
print(f'Test MSE: {test_mse:.2f}, R²: {test_r2:.4f}')

Train MSE: 788283182.72, R²: 0.9776
Test MSE: 491526134.40, R²: 0.9877


## Step 7: Fit Statsmodels Model (Inference)

We use **statsmodels** to get statistical information about the model:
- Coefficients for each predictor
- t-statistics and p-values (to test if each predictor is significant)
- F-statistic (to test if the model as a whole is significant)
- R² and adjusted R²

**In other words:** It gives us the results of the statistical tests needed for inference, which sklearn does not provide

**A note on `statsmodels.OLS()`**: By default it does not include an intercept term. That means if you just pass `X_train` it assumes the regression line goes through the origin (0,0), which is usually not what we want.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Run the code below to see the information about the model and each predictor. Is the overall model fitting well? Are both predictors important? How can you tell? Should any be removed?

In [None]:
X_sm = sm.add_constant(X_train)  # add intercept
model_sm = sm.OLS(y_train, X_sm).fit()
model_sm.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.978
Model:,OLS,Adj. R-squared:,0.974
Method:,Least Squares,F-statistic:,283.2
Date:,"Wed, 11 Feb 2026",Prob (F-statistic):,1.91e-11
Time:,17:22:30,Log-Likelihood:,-186.59
No. Observations:,16,AIC:,379.2
Df Residuals:,13,BIC:,381.5
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.105e+04,2.86e+04,0.386,0.706,-5.08e+04,7.29e+04
SquareFootage,233.3488,9.913,23.539,0.000,211.932,254.765
AgeOfHouse,-27.2070,534.747,-0.051,0.960,-1182.457,1128.043

0,1,2,3
Omnibus:,0.291,Durbin-Watson:,1.859
Prob(Omnibus):,0.865,Jarque-Bera (JB):,0.45
Skew:,0.025,Prob(JB):,0.799
Kurtosis:,2.18,Cond. No.,10300.0


### Notes on F-statistic
- The F-statistic tells us if *at least one predictor is significantly related* to the target.
- If the F-statistic p-value < 0.05 -> model is statistically significant.
- When using ML for prediction, we usually don’t need F-statistic, but if we want to **justify which predictors are important**, it matters (i.e., for inference).

### Notes on R-squared

**R²** tells us the proportion of variance in the target variable explained by the predictors.

- Problem with R²: When you add more predictors, R² never decreases, even if the new predictor is useless. This can make your model look better than it really is.

**Adjusted R²** fixes this by penalizing for extra predictors.

- Adjusted R² increases only if a new predictor actually improves the model more than expected by chance.
- If a predictor is useless, adjusted R² can decrease, signaling that adding it did not help.

### Summary
- **Sklearn**: fast, for prediction (MSE, R²)
- **Statsmodels**: provides inference (coefficients, t-stats, p-values, F-statistic)
- **F-statistic**: important when explaining which predictors matter
- Always check train vs test metrics to see if the model generalizes well