
# DX799 — Week 2 Jupyter Notebook: Linear Regression 2  
**Topic:** Lasso, Ridge, Elastic Net on your Capstone dataset  
**Author:** <Your Name>  
**Date:** <Auto/Today>  

> Use this notebook to run regularized linear models and capture results for **Milestone One**. Replace placeholders with your dataset and context.



## 📋 Milestone One Alignment (Quick Map)
- **Breadth (Weeks 1–6):** This notebook contributes **Week 2** coverage: lasso, ridge, elastic net.  
- **Depth (choose 1–2 weeks):** Use **Week 2** as one deep-dive.  
- **Overfitting prevention:** Cross-validation, regularization, learning curves, and holdout set.  
- **Metrics & tuning:** RMSE/MAE/R²; GridSearchCV across alphas; ElasticNet `l1_ratio`.  
- **Expected vs unexpected:** Capture observations in the Summary section.  
- **EDA support:** Basic EDA, correlation heatmap, VIF for multicollinearity.  
- **External sources:** Cite at least one high-quality source in Yellowdig.


In [None]:

# --- Setup ---
import numpy as np
import pandas as pd

from pathlib import Path
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

# Optional diagnostics
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')



## 1) Load Data  
Replace the path below and identify your **target** and **feature** columns.  
If you already engineered features in earlier weeks, import that cleaned dataset here.


In [None]:

# --- Load your dataset ---
# Example: df = pd.read_csv('data/your_clean_dataset.csv')
df = pd.DataFrame()  # placeholder; replace with your actual load

# Quick sanity check
display(df.head())
display(df.describe(include='all').T.head(20))
print("Shape:", df.shape)



### Select Target and Features
Set your `TARGET` and pick feature columns. You can also drop leakage columns.


In [None]:

# --- Configure columns ---
TARGET = 'your_target'  # <-- replace
feature_cols = [c for c in df.columns if c != TARGET]

X = df[feature_cols].copy()
y = df[TARGET].copy()



## 2) Split Data
Use a holdout set for honest evaluation after model selection.


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_test.shape



## 3) Preprocessing  
- Numeric: Standardize.  
- Categorical: One-hot encode.  
- (Optional) Polynomial/interaction terms.


In [None]:

# --- Identify column types ---
num_cols = X_train.select_dtypes(include=np.number).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ],
    remainder='drop'
)

# Optional: add polynomial features for numeric only (comment in if useful)
add_poly = False  # set True if you want polynomial terms
degree = 2

if add_poly:
    preprocess = ColumnTransformer(
        transformers=[
            ('num', Pipeline([
                ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
                ('scaler', StandardScaler())
            ]), num_cols),
            ('cat', categorical_transformer, cat_cols)
        ],
        remainder='drop'
    )



## 4) Baseline: Ordinary Least Squares (no penalty)  
Use as a baseline to compare against regularized models.


In [None]:

ols = Pipeline(steps=[
    ('prep', preprocess),
    ('model', LinearRegression())
])

ols.fit(X_train, y_train)
y_pred_ols = ols.predict(X_test)

def regression_report(y_true, y_pred, label="Model"):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return pd.Series({'RMSE': rmse, 'MAE': mae, 'R2': r2}, name=label)

report_ols = regression_report(y_test, y_pred_ols, 'OLS')
report_ols



## 5) Ridge Regression (L2)  
Grid-search over `alpha` with CV. Ridge shrinks many coefficients but rarely to zero.


In [None]:

ridge = Pipeline(steps=[
    ('prep', preprocess),
    ('model', Ridge())
])

ridge_param_grid = {
    'model__alpha': np.logspace(-3, 3, 13)  # 0.001 to 1000
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)
ridge_gs = GridSearchCV(ridge, ridge_param_grid, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
ridge_gs.fit(X_train, y_train)

y_pred_ridge = ridge_gs.predict(X_test)
report_ridge = regression_report(y_test, y_pred_ridge, 'Ridge')
pd.DataFrame({
    'best_params': [ridge_gs.best_params_],
    'cv_score_rmse_neg': [ridge_gs.best_score_]
}), report_ridge



## 6) Lasso Regression (L1)  
Grid-search over `alpha` with CV. Lasso performs feature selection by driving some coefficients to zero.


In [None]:

lasso = Pipeline(steps=[
    ('prep', preprocess),
    ('model', Lasso(max_iter=10000))
])

lasso_param_grid = {
    'model__alpha': np.logspace(-3, 1, 9)  # 0.001 to 10
}

lasso_gs = GridSearchCV(lasso, lasso_param_grid, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
lasso_gs.fit(X_train, y_train)

y_pred_lasso = lasso_gs.predict(X_test)
report_lasso = regression_report(y_test, y_pred_lasso, 'Lasso')
pd.DataFrame({
    'best_params': [lasso_gs.best_params_],
    'cv_score_rmse_neg': [lasso_gs.best_score_]
}), report_lasso



## 7) Elastic Net (α blend of L1 & L2)  
Tune both `alpha` (regularization strength) and `l1_ratio` (mix of L1 vs L2).


In [None]:

enet = Pipeline(steps=[
    ('prep', preprocess),
    ('model', ElasticNet(max_iter=10000))
])

enet_param_grid = {
    'model__alpha': np.logspace(-3, 1, 9),
    'model__l1_ratio': np.linspace(0.1, 0.9, 9)
}

enet_gs = GridSearchCV(enet, enet_param_grid, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
enet_gs.fit(X_train, y_train)

y_pred_enet = enet_gs.predict(X_test)
report_enet = regression_report(y_test, y_pred_enet, 'ElasticNet')
pd.DataFrame({
    'best_params': [enet_gs.best_params_],
    'cv_score_rmse_neg': [enet_gs.best_score_]
}), report_enet



## 8) Compare Models  
Summarize holdout metrics and pick a winner. Discuss trade-offs.


In [None]:

comparison = pd.concat([report_ols, report_ridge, report_lasso, report_enet], axis=1).T.sort_values('RMSE')
comparison



## 9) Coefficient Inspection  
Inspect learned coefficients to interpret model behavior.  
> Note: After preprocessing, use the fitted pipeline to extract feature names.


In [None]:

def get_feature_names(preprocessor):
    names = []
    if hasattr(preprocessor, 'transformers_'):
        for name, trans, cols in preprocessor.transformers_:
            if name == 'num':
                # PolynomialFeatures may change feature set
                if hasattr(trans, 'named_steps') and 'poly' in trans.named_steps:
                    poly = trans.named_steps['poly']
                    base_names = cols
                    names += poly.get_feature_names_out(base_names).tolist()
                else:
                    names += list(cols)
            elif name == 'cat':
                ohe = trans.named_steps['onehot']
                ohe_names = ohe.get_feature_names_out(cols).tolist()
                names += ohe_names
    return names

# Example: Ridge coefficients
best_ridge = ridge_gs.best_estimator_
feat_names = get_feature_names(best_ridge.named_steps['prep'])

coefs = best_ridge.named_steps['model'].coef_
coef_df = pd.DataFrame({'feature': feat_names, 'coef': coefs}).sort_values('coef', key=lambda s: s.abs(), ascending=False)
coef_df.head(20)



## 10) Overfitting Diagnostics  
- Cross-validation scores distribution.  
- Simple size-based learning effect (optional).  


In [None]:

def cv_rmse(model, X, y, cv):
    scores = -cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=cv, n_jobs=-1)
    return scores

scores_ridge = cv_rmse(ridge_gs.best_estimator_, X_train, y_train, cv)
scores_lasso = cv_rmse(lasso_gs.best_estimator_, X_train, y_train, cv)
scores_enet  = cv_rmse(enet_gs.best_estimator_,  X_train, y_train, cv)

print("CV RMSE (mean ± std)")
print(f"Ridge: {scores_ridge.mean():.4f} ± {scores_ridge.std():.4f}")
print(f"Lasso: {scores_lasso.mean():.4f} ± {scores_lasso.std():.4f}")
print(f"ENet : {scores_enet.mean():.4f} ± {scores_enet.std():.4f}")



### CV Score Distributions


In [None]:

plt.figure()
plt.boxplot([scores_ridge, scores_lasso, scores_enet], labels=['Ridge','Lasso','ElasticNet'])
plt.title('CV RMSE Distribution')
plt.ylabel('RMSE')
plt.show()



## 11) Multicollinearity Snapshot (Numeric Only)  
- Correlation heatmap and quick VIF calculation to see redundancy risk.


In [None]:

# Compute correlation on numeric columns
if len(num_cols) > 1:
    corr = X_train[num_cols].corr()
    display(corr)

# Quick VIF (numeric only). Requires no missing values.
def compute_vif(df_num):
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    import statsmodels.api as sm
    # drop NA and constant-only cols
    z = df_num.dropna().copy()
    z = z.loc[:, z.std() > 0]
    Xc = sm.add_constant(z)
    vifs = []
    for i in range(1, Xc.shape[1]):  # skip constant
        vifs.append(variance_inflation_factor(Xc.values, i))
    return pd.DataFrame({'feature': z.columns, 'VIF': vifs})

try:
    if len(num_cols) > 1:
        vif_df = compute_vif(X_train[num_cols])
        display(vif_df.sort_values('VIF', ascending=False).head(15))
except Exception as e:
    print("VIF computation skipped:", e)



## 12) Summary & Discussion (Copy to Milestone One)
- **Best model** and why (metrics + interpretability).  
- **Overfitting controls** used and their effect.  
- **Expected vs unexpected** patterns (e.g., selected features, sign of coefficients, stability).  
- **Role of EDA** in feature choices and model interpretation.  
- **Next steps** for Weeks 3–6.



## 13) Yellowdig Helper (External Source to Share)
- Paste a high-quality source link you used to clarify a concept (e.g., scikit-learn docs on Lasso/Ridge/ElasticNet).  
- Write 3–5 lines on *why* it is high quality (clarity, rigor, examples, reproducibility).  
- Reply to 2 peers with one insight each.
