# Chapter 2.6: Multicollinearity

Goal: Detect multicollinearity using correlation matrices and VIF.

### Topics:
- Creating correlation matrices and heatmaps
- Identifying highly correlated feature pairs
- Calculating and interpreting VIF
- Deciding which features to remove

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

## Quick Recap

- **Multicollinearity** = features that are highly correlated with each other
- Problem: coefficients become unstable and hard to interpret
- Detection:
  - Correlation matrix: look for |r| > 0.7
  - VIF > 5 indicates problematic multicollinearity
  - VIF > 10 is severe

In [None]:
# Load California Housing data
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

In [None]:
# Select features (excluding target)
features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
X = df[features]

print(f"Features: {features}")
print(f"Shape: {X.shape}")

## Practice

### 1. Create correlation matrix with `df[features].corr()`

In [None]:
# Step 1: Calculate correlation matrix
corr_matrix = X.corr()

# Display the correlation matrix
corr_matrix.round(2)

### 2. Create heatmap with `sns.heatmap(corr_matrix, annot=True)`

In [None]:
# Step 1: Create figure of appropriate size
plt.figure(figsize=(10, 8))

# Step 2: Create heatmap with annotations
# Use cmap='coolwarm' for a diverging colormap (red for positive, blue for negative)
# Set vmin=-1, vmax=1 to center the colormap


plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

### 3. List all pairs with correlation > 0.7 (or < -0.7)

In [None]:
# Find highly correlated pairs
# Step 1: Get the upper triangle of the correlation matrix (to avoid duplicates)
# Step 2: Find pairs where |correlation| > 0.7

high_corr_pairs = []

for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        corr_value = corr_matrix.iloc[i, j]
        if abs(corr_value) > 0.7:
            high_corr_pairs.append({
                'Feature 1': corr_matrix.columns[i],
                'Feature 2': corr_matrix.columns[j],
                'Correlation': corr_value
            })

if high_corr_pairs:
    pd.DataFrame(high_corr_pairs)
else:
    print("No pairs with |correlation| > 0.7")

**Your observation:** Which features are highly correlated? Does this make sense intuitively?

(Write your answer here)

### 4. Calculate VIF for each feature

VIF measures how much the variance of a coefficient is inflated due to correlation with other features.

In [None]:
# Helper function to calculate VIF for all features
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data['Feature'] = X.columns
    vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data.sort_values('VIF', ascending=False)

# Calculate VIF
vif_df = calculate_vif(X)
vif_df

### 5. Which features have VIF > 5?

In [None]:
# Filter to features with VIF > 5
high_vif = vif_df[vif_df['VIF'] > 5]

print("Features with VIF > 5 (potentially problematic):")
high_vif

**Your observation:** Which features have high VIF? How does this relate to the correlation matrix?

(Write your answer here)

### 6. Remove one problematic feature, recalculate VIF - did it improve?

In [None]:
# Step 1: Choose a feature to remove (pick one with high VIF)
feature_to_remove = 'AveBedrms'  # Change this based on your analysis

# Step 2: Create new feature set without that feature
remaining_features = [f for f in features if f != feature_to_remove]
X_reduced = df[remaining_features]

print(f"Removed: {feature_to_remove}")
print(f"Remaining features: {remaining_features}")

In [None]:
# Step 3: Recalculate VIF
vif_reduced = calculate_vif(X_reduced)

print("VIF after removing feature:")
vif_reduced

In [None]:
# Compare: how many features still have VIF > 5?
print(f"\nBefore removal: {len(vif_df[vif_df['VIF'] > 5])} features with VIF > 5")
print(f"After removal: {len(vif_reduced[vif_reduced['VIF'] > 5])} features with VIF > 5")

**Your analysis:** Did removing the feature improve the VIF values? If there are still features with high VIF, what would you do next?

(Write your answer here)

## Bonus: See the effect on coefficients

In [None]:
# Fit models with and without the problematic feature
y = df['MedHouseVal']

# Model with all features
model_full = LinearRegression()
model_full.fit(X, y)

# Model with reduced features
model_reduced = LinearRegression()
model_reduced.fit(X_reduced, y)

# Compare coefficients for overlapping features
print("Coefficient comparison:")
for i, feat in enumerate(remaining_features):
    full_idx = features.index(feat)
    print(f"{feat}: Full={model_full.coef_[full_idx]:.4f}, Reduced={model_reduced.coef_[i]:.4f}")

## Discussion Question

If two features are highly correlated (like AveRooms and AveBedrms), how do you decide which one to keep? What factors would you consider?

(Discuss with a neighbor)