# Chapter 2.7: Ridge and Lasso Regularization

Goal: Compare Ridge vs Lasso, understand coefficient shrinkage and feature selection.

### Topics:
- Scaling features before regularization
- Fitting Ridge and Lasso with different alpha values
- Understanding coefficient shrinkage (Ridge) vs elimination (Lasso)
- Using cross-validation to find optimal alpha

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV, LassoCV
from sklearn.metrics import r2_score

## Quick Recap

- **Ridge (L2)**: Shrinks all coefficients toward zero, but never exactly zero
- **Lasso (L1)**: Can shrink coefficients to exactly zero → automatic feature selection
- **Alpha**: Controls regularization strength (higher = more shrinkage)
- **Important**: Always scale features before regularization!

In [None]:
# Load Auto MPG dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']
df = pd.read_csv(url, delim_whitespace=True, names=columns, na_values='?')

# Drop rows with missing values and car_name (not numeric)
df = df.dropna()
df = df.drop('car_name', axis=1)

df.head()

In [None]:
# Prepare features and target
X = df.drop('mpg', axis=1)
y = df['mpg']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Features: {list(X.columns)}")
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

## Practice

### 1. Scale features using StandardScaler (fit on train, transform both)

In [None]:
# Step 1: Create StandardScaler
scaler = StandardScaler()

# Step 2: Fit on training data ONLY, then transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Note: transform only, not fit_transform!

# Verify scaling worked (mean should be ~0, std should be ~1 for train)
print(f"Train mean: {X_train_scaled.mean(axis=0).round(2)}")
print(f"Train std: {X_train_scaled.std(axis=0).round(2)}")

### 2. Fit LinearRegression, note the coefficients

In [None]:
# Step 1: Fit LinearRegression on scaled data
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Step 2: Display coefficients
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'OLS Coefficient': lr.coef_
})

print(f"OLS R² on test: {lr.score(X_test_scaled, y_test):.4f}")
coef_df

### 3. Fit Ridge(alpha=1), compare coefficients - are they smaller?

In [None]:
# Step 1: Create and fit Ridge with alpha=1


# Step 2: Add Ridge coefficients to the comparison DataFrame
coef_df['Ridge (alpha=1)'] = ridge.coef_

print(f"Ridge R² on test: {ridge.score(X_test_scaled, y_test):.4f}")
coef_df

**Your observation:** Are the Ridge coefficients smaller (closer to 0) than OLS? Which features shrank the most?

(Write your answer here)

### 4. Fit Lasso(alpha=1), which coefficients are exactly 0?

In [None]:
# Step 1: Create and fit Lasso with alpha=1


# Step 2: Add Lasso coefficients to the comparison DataFrame
coef_df['Lasso (alpha=1)'] = lasso.coef_

print(f"Lasso R² on test: {lasso.score(X_test_scaled, y_test):.4f}")
coef_df

In [None]:
# Count how many coefficients are exactly 0
zero_coefs = (coef_df['Lasso (alpha=1)'] == 0).sum()
print(f"\nLasso eliminated {zero_coefs} features (set coefficient to 0)")
print(f"Features eliminated: {list(coef_df[coef_df['Lasso (alpha=1)'] == 0]['Feature'])}")

### 5. Fit Lasso with alpha=0.1, 0.5, 1, 2 - how do zero coefficients change?

In [None]:
# Compare Lasso with different alpha values
alphas = [0.01, 0.1, 0.5, 1.0, 2.0]

for alpha in alphas:
    lasso_temp = Lasso(alpha=alpha)
    lasso_temp.fit(X_train_scaled, y_train)
    n_zeros = (lasso_temp.coef_ == 0).sum()
    r2 = lasso_temp.score(X_test_scaled, y_test)
    print(f"Alpha={alpha}: {n_zeros} zero coefficients, R²={r2:.4f}")

**Your observation:** As alpha increases, what happens to the number of zero coefficients? What happens to R²?

(Write your answer here)

### 6. Use `RidgeCV` to find optimal alpha

In [None]:
# Step 1: Create RidgeCV with a range of alphas
alphas_to_try = [0.001, 0.01, 0.1, 1, 10, 100]
ridge_cv = RidgeCV(alphas=alphas_to_try, cv=5)

# Step 2: Fit on training data
ridge_cv.fit(X_train_scaled, y_train)

# Step 3: Get the best alpha
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
print(f"Ridge CV test R²: {ridge_cv.score(X_test_scaled, y_test):.4f}")

### 7. Use `LassoCV` to find optimal alpha

In [None]:
# Step 1: Create LassoCV (it automatically chooses alphas)
lasso_cv = LassoCV(cv=5, random_state=42)

# Step 2: Fit on training data
lasso_cv.fit(X_train_scaled, y_train)

# Step 3: Get the best alpha and see which features remain
print(f"Best Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Lasso CV test R²: {lasso_cv.score(X_test_scaled, y_test):.4f}")
print(f"\nFeatures with non-zero coefficients:")
for feat, coef in zip(X.columns, lasso_cv.coef_):
    if coef != 0:
        print(f"  {feat}: {coef:.4f}")

### 8. Which regularization method would you choose for this data and why?

In [None]:
# Final comparison
print("Model Comparison:")
print(f"  OLS:      R² = {lr.score(X_test_scaled, y_test):.4f}")
print(f"  Ridge CV: R² = {ridge_cv.score(X_test_scaled, y_test):.4f} (alpha={ridge_cv.alpha_})")
print(f"  Lasso CV: R² = {lasso_cv.score(X_test_scaled, y_test):.4f} (alpha={lasso_cv.alpha_:.4f})")

**Your recommendation:** Based on the results, which method would you choose? Consider:
- Performance (R² scores)
- Interpretability (how many features?)
- Simplicity

(Write your recommendation and reasoning here)

## Discussion Question

When would you prefer Lasso over Ridge? When would you prefer Ridge over Lasso?

(Discuss with a neighbor)