# Chapter 2.4: Residual Analysis

Goal: Create residual plots and diagnose assumption violations.

### Topics:
- Calculating and plotting residuals
- Identifying patterns in residual plots
- Diagnosing which assumptions are violated
- Checking normality of residuals

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

## Quick Recap

- **Residuals** = Actual - Predicted (how wrong was the model?)
- Good residual plot: random scatter around 0, no patterns
- **Funnel shape** → heteroscedasticity (variance not constant)
- **Curved pattern** → non-linearity (need polynomial terms or transformation)
- **Clusters** → missing categorical variable

In [None]:
# Load and prepare data
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Use features 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population' to predict 'MedHouseVal'


# 80/20 train/test split


# Fit the model


# Get predictions on test set


# Priunt out test R²


## Practice

### 1. Fit regression, calculate residuals: `y_test - y_pred`

In [None]:
# Calculate residuals
residuals = y_test - y_pred

# Calculate and interpret mean of residuals

**What do we learn from looking at the mean residual?**

(Write your answer here)

### 2. Create scatter plot: fitted values (x) vs residuals (y)

In [None]:
# Step 1: Scatter plot with fitted values on x-axis, residuals on y-axis


# Step 2: Add labels and title


**What do we learn from this graph?**

(Write your answer here)

### 3. Add a horizontal line at y=0 for reference

In [None]:
# Add horizontal line at y=0


### 4. Create histogram of residuals - does it look normal?

In [None]:
# Step 1: Create histogram of residuals


**Question:** Does the histogram look approximately normal (bell-shaped)? Or is it skewed? What does this tell us?

(Write your answer here)

### 5. Describe any patterns you see in the residual plot

Look back at your residuals vs fitted values plot. Answer these questions:

**a) Is there a funnel shape (wider spread on one side)?**

(Write your observation here)

**b) Is there a curved pattern (U-shape or arch)?**

(Write your observation here)

**c) Are there any obvious clusters or groups?**

(Write your observation here)

**d) Is there a horizontal band of points at any particular value?**

(Write your observation here)

### 6. Based on patterns, which assumption might be violated?

Match the patterns to assumptions:
- Funnel shape → **Homoscedasticity** violated (variance not constant)
- Curved pattern → **Linearity** violated (relationship is non-linear)
- Non-normal histogram → **Normality** violated
- Clusters → Possible **independence** issues or missing variables

**Your diagnosis:** Based on your plots, which assumption(s) appear to be violated?

(Write your diagnosis here)

**What would you recommend doing to fix this?**

(Write your recommendation here)

## Discussion Question

You fit a model and the R² is 0.85, which seems great. But the residual plot shows a clear curved pattern. Should you trust this model? Why or why not?

(Discuss with a neighbor)