# Homework: Multiple Linear Regression in Python

### Learning Objectives
After completing this homework, you will be able to:
- Fit and interpret multiple regression models using **statsmodels** and **scikit-learn**.
- Compute and interpret common goodness-of-fit measures.
- Compare model outputs across different implementations.


## 1. Dataset

We will use the **California Housing Dataset** from `sklearn.datasets`.  
It contains data on housing prices and district-level demographics in California.


In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

The variable **`MedHouseVal`** is the **median house value** in $100,000s.  
We will predict it using a few numeric predictors.


## 2. Select a Subset of Variables

Use only the following features:

| Feature | Description |
|----------|--------------|
| MedInc | Median income in block group |
| AveRooms | Average number of rooms per household |
| AveBedrms | Average number of bedrooms per household |
| Population | Block group population |
| HouseAge | Median age of houses in the block group |


In [None]:
cols = ['MedHouseVal', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']
df_sub = df[cols]
df_sub.head()

## 3. Exploratory Analysis (10 points)
- Display summary statistics (`df_sub.describe()`).
- Compute the correlation matrix (`df_sub.corr()`).
- Briefly discuss which predictors you expect to have the strongest positive or negative relationship with house value.


In [None]:
# Your code here
df_sub.describe()

In [None]:
# Correlation matrix
df_sub.corr()

## 4. Multiple Regression using **statsmodels** (30 points)
Use `statsmodels.api.OLS` to fit a multiple regression model.


In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf



**Tasks:**
1. Report the estimated regression equation.
2. Interpret the coefficients of **`MedInc`** and **`HouseAge`** in words.
3. What is the R-squared value?
4. Which predictor is statistically most significant?


## 5. Multiple Regression using **scikit-learn** (30 points)
Now fit the same model using `LinearRegression`.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score



In [None]:
# Compute R¬≤
y_pred = model_sk.predict(X)
r2_score(y, y_pred)

**Tasks:**
1. Compare coefficients and R¬≤ with the `statsmodels` output.
2. Are the results identical? Why or why not?


## 6. Model Diagnostics (20 points)
1. Plot predicted vs. actual values.
2. Plot residuals vs. fitted values and discuss any visible patterns.
3. What would a ‚Äúgood‚Äù residual plot look like?


In [None]:
import matplotlib.pyplot as plt



In [None]:
residuals = y - y_pred
plt.scatter(y_pred, residuals, alpha=0.3)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Fitted Values")
plt.show()

## 7. Interpretation & Discussion (10 points)
- Which predictors seem most influential on house prices?
- Do any results surprise you?
- What limitations might this model have?


### üîç Bonus (optional)
Try removing one or two predictors and observe how the coefficients and R¬≤ change.  
Does omitting a variable change the interpretation of others?


### üßæ Deliverables
- A **Jupyter Notebook** with code, output, and short written explanations.
- Figures should have labeled axes and titles.
- Write answers in markdown cells below your code.
