# Week 2 — Linear Regression with Regularization (CKD)
**Target:** `hemoglobin`

Run the notebook top to bottom.

## 0) Setup

## Dataset Choice

For this project, I considered three related datasets: Acute Kidney Injury, Diabetic Nephropathy, and Chronic Kidney Disease (CKD).  
I selected the CKD dataset (`ckd_dataset_v2.csv`) as my primary focus because:

1. **Continuity**: I began analyzing this dataset in Module B, so continuing ensures consistency and builds on prior cleaning and EDA.  
2. **Alignment with project goals**: My Capstone question is about predicting kidney disease progression using eGFR, which is directly available in the CKD dataset.  
3. **Analytical richness**: The dataset has a wide mix of numeric and categorical predictors, with noticeable multicollinearity. This makes it ideal for Week 2 methods:  
   - **Lasso** → feature selection  
   - **Ridge** → stability under correlation  
   - **Elastic Net** → balance between the two  

The AKI and Diabetic Nephropathy datasets remain supplementary but the CKD dataset offers the clearest path to modeling progression.


In [14]:
import re, numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

## 1) Load data and normalize column names

In [15]:
# If running in Codespaces, put this notebook next to ckd_dataset_v2.csv
import pandas as pd
df = pd.read_csv('ckd_dataset_v2.csv')
df.columns = (df.columns.str.strip()
              .str.replace(r'\s+','_',regex=True)
              .str.replace(r'[^0-9a-zA-Z_]','',regex=True)
              .str.lower())
print('Columns:', df.columns.tolist()[:40])
df.head(3)

Columns: ['bp_diastolic', 'bp_limit', 'sg', 'al', 'class', 'rbc', 'su', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sod', 'sc', 'pot', 'hemo', 'pcv', 'rbcc', 'wbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'grf', 'stage', 'affected', 'age']


Unnamed: 0,bp_diastolic,bp_limit,sg,al,class,rbc,su,pc,pcc,ba,...,htn,dm,cad,appet,pe,ane,grf,stage,affected,age
0,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete,...,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete,discrete
1,,,,,,,,,,,...,,,,,,,,,class,meta
2,0,0,1.019 - 1.021,1 - 1,ckd,0,< 0,0,0,0,...,0,0,0,0,0,0,≥ 227.944,s1,1,< 12


## 2) Target = hemoglobin → numeric + align X/y

In [16]:
TARGET = 'hemo'
assert TARGET in df.columns, f'Missing target {TARGET}'
y_raw = df[TARGET].astype(str).str.extract(r'([-+]?\d*\.?\d+)')[0]
y = pd.to_numeric(y_raw, errors='coerce')
mask = y.notna()
y = y.loc[mask].astype(float)
X = df.drop(columns=[TARGET]).loc[mask].copy()
X = X.replace([np.inf,-np.inf], np.nan)
low_missing_cols = [c for c in X.columns if X[c].isna().mean() < 0.95]
X = X[low_missing_cols].copy()
X = X.reset_index(drop=True); y = y.reset_index(drop=True)
print('Shapes:', X.shape, y.shape)

Shapes: (200, 28) (200,)


## 3) Preprocess pipeline (impute + encode)

In [17]:
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
                                   ('scaler', StandardScaler())])
categorical_transformer = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                                    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
preprocess = ColumnTransformer([('num', numeric_transformer, selector(dtype_include=np.number)),
                                ('cat', categorical_transformer, selector(dtype_exclude=np.number))])
def regression_report(y_true, y_pred, label='model'):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return pd.Series({'RMSE': rmse, 'MAE': mae, 'R2': r2}, name=label)

## 4) Train/Test split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((160, 28), (40, 28))

## 5) Baseline OLS

In [19]:
from sklearn.metrics import mean_squared_error
import numpy as np

def regression_report(y_true, y_pred, label="model"):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))  # instead of squared=False
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return pd.Series({"RMSE": rmse, "MAE": mae, "R2": r2}, name=label)


In [20]:
ols = Pipeline([('prep', preprocess), ('model', LinearRegression())])
ols.fit(X_train, y_train)
y_pred_ols = ols.predict(X_test)
report_ols = regression_report(y_test, y_pred_ols, 'OLS')
report_ols

RMSE    3.0494
MAE     2.0298
R2     -0.0896
Name: OLS, dtype: float64

## 6) Ridge / 7) Lasso / 8) Elastic Net

In [21]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)
ridge = Pipeline([('prep', preprocess), ('model', Ridge())])
ridge_cv = GridSearchCV(ridge, {'model__alpha':[0.001,0.01,0.1,1.0,10.0]}, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
ridge_cv.fit(X_train, y_train)
y_pred_ridge = ridge_cv.predict(X_test)
report_ridge = regression_report(y_test, y_pred_ridge, 'Ridge')

lasso = Pipeline([('prep', preprocess), ('model', Lasso(max_iter=10000))])
lasso_cv = GridSearchCV(lasso, {'model__alpha':[0.001,0.01,0.1,1.0]}, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
lasso_cv.fit(X_train, y_train)
y_pred_lasso = lasso_cv.predict(X_test)
report_lasso = regression_report(y_test, y_pred_lasso, 'Lasso')

enet = Pipeline([('prep', preprocess), ('model', ElasticNet(max_iter=10000))])
enet_cv = GridSearchCV(enet, {'model__alpha':[0.001,0.01,0.1,1.0], 'model__l1_ratio':[0.2,0.5,0.8]}, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
enet_cv.fit(X_train, y_train)
y_pred_enet = enet_cv.predict(X_test)
report_enet = regression_report(y_test, y_pred_enet, 'ElasticNet')

pd.concat([report_ols, report_ridge, report_lasso, report_enet], axis=1).T.sort_values('RMSE')

Unnamed: 0,RMSE,MAE,R2
Ridge,1.3973,1.1577,0.7712
ElasticNet,1.4851,1.2222,0.7416
Lasso,1.6554,1.3509,0.6789
OLS,3.0494,2.0298,-0.0896


## 9) Reflection

- Hemoglobin used as clean numeric target.
- Ridge stabilizes, Lasso selects, Elastic Net balances.
- Compared RMSE, MAE, R² on holdout.
