# Chapter 2.6: Multicollinearity

Goal: Detect multicollinearity using correlation matrices and VIF.

### Topics:
- Creating correlation matrices and heatmaps
- Identifying highly correlated feature pairs
- Calculating and interpreting VIF
- Deciding which features to remove

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

## Quick Recap

- **Multicollinearity** = features that are highly correlated with each other
- Problem: coefficients become unstable and hard to interpret
- Detection:
  - Correlation matrix: look for |r| > 0.7
  - VIF > 5 indicates problematic multicollinearity
  - VIF > 10 is severe

In [None]:
# Load California Housing data
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

In [None]:
# Select all features for X except for the target variable MedHouseVal
...

## Practice

### 1. Create correlation matrix with `df[features].corr()`

In [None]:
# Calculate and show the correlation matrix between all features in X
...

### 2. Create heatmap with `sns.heatmap(corr_matrix, annot=True)`

In [None]:
# Turn this into a heatmap. After making your initial heatmap, consider ways you could improve it visually, then implement those changes.
...

### 3. List all pairs with correlation > 0.7 (or < -0.7)

In [None]:
# Find all pairs of variables with correlation above 0.7 (nothing special about this value, just a high correlation)
...

**Your observation:** Which features are highly correlated? Does this make sense intuitively?

(Write your answer here)

### 4. Calculate VIF for each feature

VIF measures how much the variance of a coefficient is inflated due to correlation with other features.

In [None]:
# Calculate VIF for all pairs of features. Turn the output into a DataFrame.

**What do you learn from this VIF?**

(Write your answer here)

### 5. Remove one problematic feature, recalculate VIF - did it improve?

In [None]:
# Step 1: Choose a feature with high VIF to remove, then remove it from X
...

# Step 2: Recalculate VIF, what changes?
...

**Your analysis:** Did removing the feature improve the VIF values? If there are still features with high VIF, what would you do next?

(Write your answer here)

### 6. See the effect on coefficients

Fit a linear regression model using all variables in X, and then all variables in X minus the one you removed. How did the model coefficients change?

In [None]:
# Fit models with and without the problematic feature
...

# Compare coefficients from this model with coefficients from full model (all features in X)
...