# Chapter 2.4: Residual Analysis

Goal: Create residual plots and diagnose assumption violations.

### Topics:
- Calculating and plotting residuals
- Identifying patterns in residual plots
- Diagnosing which assumptions are violated
- Checking normality of residuals

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

## Quick Recap

- **Residuals** = Actual - Predicted (how wrong was the model?)
- Good residual plot: random scatter around 0, no patterns
- **Funnel shape** → heteroscedasticity (variance not constant)
- **Curved pattern** → non-linearity (need polynomial terms or transformation)
- **Clusters** → missing categorical variable

In [None]:
# Load and prepare data
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Use multiple features
features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']
X = df[features]
y = df['MedHouseVal']

# Split and fit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

# Get predictions on test set
y_pred = model.predict(X_test)

print(f"Model R² on test: {model.score(X_test, y_test):.4f}")

## Practice

### 1. Fit regression, calculate residuals: `y_test - y_pred`

In [None]:
# Calculate residuals
residuals = y_test - y_pred

# Quick summary
print(f"Residual mean: {residuals.mean():.6f}")  # Should be close to 0
print(f"Residual std: {residuals.std():.4f}")
print(f"Residual min: {residuals.min():.4f}")
print(f"Residual max: {residuals.max():.4f}")

### 2. Create scatter plot: fitted values (x) vs residuals (y)

In [None]:
# Step 1: Create figure
plt.figure(figsize=(10, 6))

# Step 2: Scatter plot with fitted values on x-axis, residuals on y-axis


# Step 3: Add labels and title
plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()

### 3. Add a horizontal line at y=0 for reference

In [None]:
# Recreate the plot with a reference line
plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(y_pred, residuals, alpha=0.5)

# Add horizontal line at y=0
# Use plt.axhline(y=0, color='red', linestyle='--')


plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()

### 4. Create histogram of residuals - does it look normal?

In [None]:
# Step 1: Create histogram of residuals
plt.figure(figsize=(10, 6))

# Use sns.histplot or plt.hist


plt.xlabel('Residual Value')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()

**Question:** Does the histogram look approximately normal (bell-shaped)? Or is it skewed?

(Write your answer here)

### 5. Describe any patterns you see in the residual plot

Look back at your residuals vs fitted values plot. Answer these questions:

**a) Is there a funnel shape (wider spread on one side)?**

(Write your observation here)

**b) Is there a curved pattern (U-shape or arch)?**

(Write your observation here)

**c) Are there any obvious clusters or groups?**

(Write your observation here)

**d) Is there a horizontal band of points at any particular value?**

(Write your observation here)

### 6. Based on patterns, which assumption might be violated?

Match the patterns to assumptions:
- Funnel shape → **Homoscedasticity** violated (variance not constant)
- Curved pattern → **Linearity** violated (relationship is non-linear)
- Non-normal histogram → **Normality** violated
- Clusters → Possible **independence** issues or missing variables

**Your diagnosis:** Based on your plots, which assumption(s) appear to be violated?

(Write your diagnosis here)

**What would you recommend doing to fix this?**

(Write your recommendation here)

## Bonus: Additional Diagnostic Plots

In [None]:
# Q-Q plot to check normality more precisely
from scipy import stats

fig, ax = plt.subplots(figsize=(8, 6))
stats.probplot(residuals, dist="norm", plot=ax)
ax.set_title('Q-Q Plot of Residuals')
plt.show()

**Interpreting the Q-Q plot:** If residuals are normally distributed, points should fall along the diagonal line. Deviations at the ends suggest heavy tails (outliers) or skewness.

## Discussion Question

You fit a model and the R² is 0.85, which seems great. But the residual plot shows a clear curved pattern. Should you trust this model? Why or why not?

(Discuss with a neighbor)