# Elastic Net Evaluation

This notebook trains **ElasticNet regression models per drug** using gene expression features.  
ElasticNet combines L1 (Lasso) and L2 (Ridge) regularization. We use `ElasticNetCV` to tune hyperparameters using 5-fold cross-validation.

Elastic Net combines both:
- **L1 regularization** (Lasso) → promotes sparsity
- **L2 regularization** (Ridge) → stabilizes coefficients

This allows it to both regularize and select features, making it a simple but powerful linear model.

We evaluate performance using:
- **RMSE (Root Mean Squared Error)**
- **R² (Coefficient of Determination)**

and visualize their **distribution across all drugs**.


In [1]:
import os
import joblib
import numpy as np
import pandas as pd
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.metrics import r2_score, root_mean_squared_error
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.stats import pearsonr, spearmanr


In [2]:
# Parameters
NUM_PCS = 30
DATA_PATH = "../../../data/processed/gdsc_pancancer_pseudobulk_30_pcs_tissue_growth.parquet"
data = pd.read_parquet(DATA_PATH)
print("✅ Dataset loaded:", data.shape)

SAVE_DIR = "results/elastic_net_selected"
os.makedirs(SAVE_DIR, exist_ok=True)

TARGET_DRUGS = [179, 1089, 2156]   # 5-FU, Oxaliplatin, 5-Azacytidine

✅ Dataset loaded: (83624, 45)


## 1. Compute Metrics per Drug

For each drug:
- Select only rows where that drug was tested
- Split the data into training and test sets
- Fit an **Elastic Net** regressor on the PCA features
- Evaluate on test set with **RMSE** and **R²**

We only include drugs with **≥ 10 samples** to ensure meaningful evaluation.


In [3]:
non_features = ["SANGER_MODEL_ID", "DRUG_ID", "LN_IC50"]
gene_cols = [c for c in data.columns if c not in non_features]
kf = KFold(n_splits=10, shuffle=True, random_state=42)

for drug_id in TARGET_DRUGS:
    df = data[data["DRUG_ID"] == drug_id].dropna(subset=["LN_IC50"])
    if df.shape[0] < 50:
        print(f"⚠️ Skipping drug {drug_id} (only {df.shape[0]} samples)")
        continue

    X = df[gene_cols].values.astype("float64")
    y = df["LN_IC50"].values.astype("float64")

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    model = ElasticNetCV(
        alphas=[0.01, 0.1, 1.0, 10.0],
        l1_ratio=[0.1, 0.5, 0.9],
        cv=kf,
        max_iter=5000,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_scaled, y)
    print(f"🎯 Trained ElasticNet for drug {drug_id} | Samples: {len(df)}")

    # save model + scaler
    fname = f"EN_drug{drug_id}.joblib"
    joblib.dump((scaler, model), os.path.join(SAVE_DIR, fname))
    print(f"💾 Saved model → {SAVE_DIR}/{fname}")

🎯 Trained ElasticNet for drug 179 | Samples: 135
💾 Saved model → results/elastic_net_selected/EN_drug179.joblib
🎯 Trained ElasticNet for drug 1089 | Samples: 138
💾 Saved model → results/elastic_net_selected/EN_drug1089.joblib
🎯 Trained ElasticNet for drug 2156 | Samples: 112
💾 Saved model → results/elastic_net_selected/EN_drug2156.joblib
