# Chapter 2.2: Linear Regression

Goal: Fit linear regression models and interpret coefficients in plain English.

### Topics:
- Fitting simple (1 feature) and multiple regression
- Visualizing the regression line
- Extracting and interpreting coefficients
- Writing interpretations in plain language

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

## Quick Recap

- Linear regression minimizes **MSE** (mean squared error) to find the "best" line
- The **coefficient** tells you: for every 1-unit increase in X, Y changes by this amount
- The **intercept** is the predicted Y when all features are 0
- With multiple features, interpret each coefficient as "holding other features constant"

In [None]:
# Load the California Housing data
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

In [None]:
# Quick look at the feature descriptions
print(housing.DESCR[:1500])

## Practice

### 1. Fit simple regression: MedInc â†’ MedHouseVal

Predict median house value using only median income.

In [None]:
# Step 1: Prepare X (just MedInc) and y (MedHouseVal)
# Note: X needs to be 2D, so use df[['MedInc']] with double brackets


# Step 2: Split into train/test (80/20)


# Step 3: Create and fit LinearRegression


### 2. Plot the data with the regression line overlaid

In [None]:
# Step 1: Create scatter plot of MedInc vs MedHouseVal (use test data)
# Step 2: Create a line using model predictions
# Step 3: Add labels and title

plt.figure(figsize=(10, 6))

# Scatter plot of actual data


# Regression line - predict on a range of MedInc values
X_line = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
y_line = model.predict(X_line)


plt.xlabel('Median Income (in $10,000s)')
plt.ylabel('Median House Value (in $100,000s)')
plt.title('House Value vs. Median Income')
plt.legend()
plt.show()

### 3. What does the coefficient mean in plain English?

Extract the coefficient and intercept, then write an interpretation.

In [None]:
# Extract coefficient and intercept
coef = model.coef_[0]
intercept = model.intercept_

print(f"Coefficient: {coef:.4f}")
print(f"Intercept: {intercept:.4f}")

**Your interpretation:** Fill in the blanks:

"For every $10,000 increase in median income, the median house value increases by approximately $______ (in $100,000s, so multiply by 100,000 for actual dollars)."

(Write your full interpretation here)

### 4. Fit multiple regression with 4 features of your choice

In [None]:
# Step 1: Choose 4 features (look at df.columns for options)
features = ['MedInc', 'HouseAge', 'AveRooms', 'Population']

# Step 2: Prepare X and y


# Step 3: Split into train/test


# Step 4: Fit the model


### 5. Which feature has the largest coefficient? Smallest?

In [None]:
# Create a DataFrame showing feature names and their coefficients
coef_df = pd.DataFrame({
    'Feature': features,
    'Coefficient': model_multi.coef_
})

# Sort by absolute value of coefficient
coef_df['Abs_Coef'] = abs(coef_df['Coefficient'])
coef_df.sort_values('Abs_Coef', ascending=False)

**Note:** Be careful comparing coefficients directly! They're on different scales. A coefficient of 0.5 for MedInc (measured in $10,000s) means something different than 0.5 for Population (measured in people).

### 6. Write interpretation: "For every unit increase in X, Y changes by..."

Pick TWO of your features and write full interpretations.

**Feature 1 interpretation:**

"Holding all other features constant, for every 1-unit increase in _______, the median house value changes by approximately _______ (in $100,000s)."

(Write your interpretation here)

**Feature 2 interpretation:**

(Write your interpretation here)

## Discussion Question

Why do we say "holding other features constant" when interpreting multiple regression coefficients? What could go wrong if we didn't add that caveat?

(Discuss with a neighbor)