# Homework: Multiple Linear Regression in Python

### Learning Objectives
After completing this homework, you will be able to:
- Fit and interpret multiple regression models using **statsmodels** and **scikit-learn**.
- Compute and interpret common goodness-of-fit measures.
- Compare model outputs across different implementations.


## 1. Dataset

We will use the **California Housing Dataset** from `sklearn.datasets`.  
It contains data on housing prices and district-level demographics in California.


In [2]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


The variable **`MedHouseVal`** is the **median house value** in $100,000s.  
We will predict it using a few numeric predictors.


## 2. Select a Subset of Variables

Use only the following features:

| Feature | Description |
|----------|--------------|
| MedInc | Median income in block group |
| AveRooms | Average number of rooms per household |
| AveBedrms | Average number of bedrooms per household |
| Population | Block group population |
| HouseAge | Median age of houses in the block group |


In [3]:
cols = ['MedHouseVal', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']
df_sub = df[cols]
df_sub.head()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
0,4.526,8.3252,6.984127,1.02381,322.0,41.0
1,3.585,8.3014,6.238137,0.97188,2401.0,21.0
2,3.521,7.2574,8.288136,1.073446,496.0,52.0
3,3.413,5.6431,5.817352,1.073059,558.0,52.0
4,3.422,3.8462,6.281853,1.081081,565.0,52.0


## 3. Exploratory Analysis (10 points)
- Display summary statistics (`df_sub.describe()`).
- Compute the correlation matrix (`df_sub.corr()`).
- Briefly discuss which predictors you expect to have the strongest positive or negative relationship with house value.


> I think the correlation matrix is the most useful here. The closer the number is to 1 or -1, the bigger is the positive or negative correlation between them...
>
> The highest positive correlation is betweeen:
> - MedInc and MedHouseValue (0.688075)
>
> Followed by:
> - AveRooms(0.151948)
> - HouseAge (0.105623)
>
> Negative correlations are 'weaker' than the 3 mentioned above. 

In [4]:
# Your code here
df_sub.describe()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,2.068558,3.870671,5.429,1.096675,1425.476744,28.639486
std,1.153956,1.899822,2.474173,0.473911,1132.462122,12.585558
min,0.14999,0.4999,0.846154,0.333333,3.0,1.0
25%,1.196,2.5634,4.440716,1.006079,787.0,18.0
50%,1.797,3.5348,5.229129,1.04878,1166.0,29.0
75%,2.64725,4.74325,6.052381,1.099526,1725.0,37.0
max,5.00001,15.0001,141.909091,34.066667,35682.0,52.0


In [5]:
# Correlation matrix
df_sub.corr()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
MedHouseVal,1.0,0.688075,0.151948,-0.046701,-0.02465,0.105623
MedInc,0.688075,1.0,0.326895,-0.06204,0.004834,-0.119034
AveRooms,0.151948,0.326895,1.0,0.847621,-0.072213,-0.153277
AveBedrms,-0.046701,-0.06204,0.847621,1.0,-0.066197,-0.077747
Population,-0.02465,0.004834,-0.072213,-0.066197,1.0,-0.296244
HouseAge,0.105623,-0.119034,-0.153277,-0.077747,-0.296244,1.0


## 4. Multiple Regression using **statsmodels** (30 points)
Use `statsmodels.api.OLS` to fit a multiple regression model.


In [10]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

X = sm.add_constant(X)
model_sm = sm.OLS(y, X).fit()
print(model_sm.summary())

                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.538
Model:                            OLS   Adj. R-squared:                  0.538
Method:                 Least Squares   F-statistic:                     4801.
Date:                Wed, 19 Nov 2025   Prob (F-statistic):               0.00
Time:                        10:25:51   Log-Likelihood:                -24278.
No. Observations:               20640   AIC:                         4.857e+04
Df Residuals:                   20634   BIC:                         4.862e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4407      0.028    -15.945      0.0

**Tasks:**
1. **Report the estimated regression equation.**
2. **Interpret the coefficients of **`MedInc`** and **`HouseAge`** in words.**

>**`MedInc`**
> Holding everything else constant, if the Median Income in a neighborhood goes up by 1 unit, the House Value goes up by 0.5360 units
>
>**`HouseAge`**
> Holding everything else constant, if a house gets 1 year older, the value increases by 0.0163 units
>
>**Takeaway:**
> We cannot compare the *importance* of variables just by looking at the size of coefficients, because there are measured in different **scales**

3. **What is the R-squared value?**
   
> R^2 = our 5 columns in comparrison explain 53.8% of the variance in the target variable.
   
4. **Which predictor is statistically most significant?**
   
> Statistically the MedInc is the most significant predictor.


## 5. Multiple Regression using **scikit-learn** (30 points)
Now fit the same model using `LinearRegression`.


In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X_sk = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

# Instantiate and fit the model
model_sk = LinearRegression()
model_sk.fit(X_sk, y)

# Generate predictions
y_pred = model_sk.predict(X_sk)

# Evaluate
print("R¬≤ Score: ",r2_score(y, y_pred))



R¬≤ Score:  0.5377839208402416


In [14]:
print("Coefficients: ",model_sk.coef_)

Coefficients:  [ 5.36014757e-01 -2.11185756e-01  9.90813314e-01  1.84789639e-05
  1.63455751e-02]


**Tasks:**
1. Compare coefficients and R¬≤ with the `statsmodels` output.
> The values of coeffients are the same but the notation differs.

3. Are the results identical? Why or why not?
> Yes, I just have to move the decimal point.


## 6. Model Diagnostics (20 points)

1. Plot predicted vs. actual values.
2. Plot residuals vs. fitted values and discuss any visible patterns.
3. What would a ‚Äúgood‚Äù residual plot look like?


In [None]:
import matplotlib.pyplot as plt



In [None]:
residuals = y - y_pred
plt.scatter(y_pred, residuals, alpha=0.3)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Fitted Values")
plt.show()

## 7. Interpretation & Discussion (10 points)
- Which predictors seem most influential on house prices?
- Do any results surprise you?
- What limitations might this model have?


### üîç Bonus (optional)
Try removing one or two predictors and observe how the coefficients and R¬≤ change.  
Does omitting a variable change the interpretation of others?


### üßæ Deliverables
- A **Jupyter Notebook** with code, output, and short written explanations.
- Figures should have labeled axes and titles.
- Write answers in markdown cells below your code.
