# Week 4 Notebook  Logistic Regression and Feature Scaling

**Author**: James N. Hardison II

**Date**: 2025-09-25

**Course**: DX799 O1 Data Science Capstone  Mod C  Semester 1

**Goal**: Apply logistic regression with proper feature scaling to the Integrated Capstone Project dataset. Demonstrate overfitting control, metric selection, and hyperparameter tuning. This notebook mirrors the structure you used in Week 2 for continuity.

> Note  Imported cues from Week 2 headers for consistent structure:  
- # Week 2 — Linear Regression with Regularization (CKD)
- ## 0) Setup
- ## Dataset Choice
- ## 1) Load data and normalize column names
- ## 2) Target = hemoglobin → numeric + align X/y
- ## 3) Preprocess pipeline (impute + encode)
- ## 4) Train/Test split
- ## 5) Baseline OLS
- ## 6) Ridge / 7) Lasso / 8) Elastic Net
- ## 9) Reflection

## 1 Project Context

State the project in two or three sentences. Mention the outcome to predict as a binary variable. 
Example  Predict kidney disease progression within a fixed horizon yes or no.

Briefly note the client or stakeholder perspective and why this prediction is useful.

## 2 Data Loading and Setup

Datasets available in your workspace  
- `ckd_dataset_v2.csv`  
- `acute_kidney_injury.csv`  
- `Diabetic_Nephropathy_v1.xlsx`

If you prefer a single dataset, set `USE_COMBINED = False` and provide the path you want. 
Otherwise, the helper will try a simple union by shared columns.

In [None]:
# Imports
import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay, DetCurveDisplay,
                             ConfusionMatrixDisplay, classification_report, auc, roc_curve, precision_recall_curve)
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
from sklearn.utils.class_weight import compute_class_weight

import matplotlib.pyplot as plt

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Versions  numpy=", np.__version__, "pandas=", pd.__version__)

In [None]:
# Configuration
USE_COMBINED = True  # if False, set SINGLE_DATASET_PATH below
SINGLE_DATASET_PATH = "/mnt/data/ckd_dataset_v2.csv"  # ignored if USE_COMBINED is True

# Attempt to load provided datasets
paths = {
    'ckd': "/mnt/data/ckd_dataset_v2.csv",
    'aki': "/mnt/data/acute_kidney_injury.csv",
    'dn' : "/mnt/data/Diabetic_Nephropathy_v1.xlsx",
}

dfs = {}
for k, p in paths.items():
    if os.path.exists(p):
        try:
            if p.endswith('.csv'):
                dfs[k] = pd.read_csv(p)
            elif p.endswith('.xlsx') or p.endswith('.xls'):
                dfs[k] = pd.read_excel(p)
            else:
                print(f"Skipping unknown format  {p}")
        except Exception as e:
            print(f"Failed to load {p}  {e}")
    else:
        print(f"Not found  {p}")

for name, df in dfs.items():
    print(f"{name} shape  {df.shape}")

In [None]:
# Light cleaning  standardize column names
def clean_cols(df):
    df = df.copy()
    df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
    return df

dfs = {k  clean_cols(v) for k, v in dfs.items()}

# Combine by shared columns if requested
if USE_COMBINED and len(dfs) > 0:
    shared_cols = None
    for df in dfs.values():
        shared_cols = set(df.columns) if shared_cols is None else shared_cols.intersection(df.columns)
    shared_cols = list(shared_cols) if shared_cols else None
    if shared_cols and len(shared_cols) >= 5:
        combined = pd.concat([df[shared_cols] for df in dfs.values()], ignore_index=True)
        data = combined
        source = "combined shared columns"
    else:
        # fallback  take the largest df
        largest_key = max(dfs, key=lambda k  dfs[k].shape[0])
        data = dfs[largest_key]
        source = f"largest dataset  {largest_key}"
else:
    if os.path.exists(SINGLE_DATASET_PATH):
        if SINGLE_DATASET_PATH.endswith('.csv'):
            data = pd.read_csv(SINGLE_DATASET_PATH)
        else:
            data = pd.read_excel(SINGLE_DATASET_PATH)
        data = clean_cols(data)
        source = "single dataset path"
    else:
        raise FileNotFoundError("No dataset found. Please set SINGLE_DATASET_PATH correctly or upload a dataset.")

print("Active data source ", source, " shape ", data.shape)
display(data.head(3))

### 2.1 Target Selection

Set your binary target variable name. The helper will try to guess likely targets based on common names. 
If not found, set it manually. Also configure positive class if needed.

In [None]:
# Try to guess a binary target column
candidate_targets = [
    'label','target','outcome','progression','disease','ckd','aki','dn','event','y','class','has_ckd','has_aki'
]

target_col = None
for c in candidate_targets:
    if c in data.columns:
        # check if binary-like
        nunique = data[c].nunique(dropna=True)
        if nunique <= 3:
            target_col = c
            break

print("Guessed target  ", target_col)
print("Value counts preview if target exists")
if target_col is not None:
    print(data[target_col].value_counts(dropna=False).head())
else:
    print("No suitable target guessed. Please set target_col manually.")

In [None]:
# Manual override  set your target if needed
# target_col = 'progression'  # Example

if target_col is None:
    raise ValueError("Please set target_col above to a binary column present in your dataset.")

## 3 Exploratory Data Analysis that Informs Modeling

Keep this short and focused on insights that affect logistic regression. Use value counts for the target and check for missingness.

In [None]:
# Target distribution
vc = data[target_col].value_counts(dropna=False)
print("Target distribution\n", vc)

# Basic missingness
missing = data.isna().mean().sort_values(ascending=False)
print("\nTop 15 missingness\n", missing.head(15))

# Identify numeric and categorical features
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
if target_col in numeric_cols:
    numeric_cols.remove(target_col)
categorical_cols = [c for c in data.columns if c not in numeric_cols + [target_col]]

print("\nNumeric cols  ", len(numeric_cols))
print("Categorical cols  ", len(categorical_cols))

## 4 Train Test Split

In [None]:
# Clean target to 0 1
y_raw = data[target_col]
y = y_raw.copy()

# If strings or yes no, convert to 0 1
if y.dtype == 'O' or y.dtype.name == 'category':
    y = y.astype(str).str.strip().str.lower().map({
        '1':1,'true':1,'yes':1,'y':1,'positive':1,'pos':1,'disease':1,'progression':1
    }).fillna(0).astype(int)
elif set(np.unique(y)) - {0,1}:
    # If numeric but not 0 1, convert by threshold at median
    thresh = np.median(pd.to_numeric(y, errors='coerce').dropna())
    y = (pd.to_numeric(y, errors='coerce') > thresh).astype(int)

X = data.drop(columns=[target_col]).copy()

# Simple impute strategy per type
numeric_cols = [c for c in X.columns if pd.api.types.is_numeric_dtype(X[c])]
categorical_cols = [c for c in X.columns if c not in numeric_cols]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

X_train.shape, X_test.shape

## 5 Scaling Comparison

Compare StandardScaler, MinMaxScaler, and RobustScaler in a consistent evaluation setup. Use logistic regression with L2 penalty first.

In [None]:
scalers = {
    'standard': StandardScaler(),
    'minmax': MinMaxScaler(),
    'robust': RobustScaler()
}

results_scalers = []

for name, scaler in scalers.items():
    pipe = Pipeline(steps=[
        ('pre', preprocess),
        ('scale', scaler),
        ('clf', LogisticRegression(max_iter=2000, penalty='l2', solver='lbfgs', random_state=RANDOM_STATE))
    ])
    
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    y_proba = pipe.predict_proba(X_test)[:,1]
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc_ = roc_auc_score(y_test, y_proba)
    
    results_scalers.append((name, acc, prec, rec, f1, auc_))

pd.DataFrame(results_scalers, columns=['scaler','accuracy','precision','recall','f1','auc']).sort_values('auc', ascending=False)

## 6 Hyperparameter Tuning

Tune C and penalty. Use stratified CV. Note that L1 requires a solver that supports it. We will compare L1 and L2.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

pipe = Pipeline(steps=[
    ('pre', preprocess),
    ('scale', StandardScaler()),
    ('clf', LogisticRegression(max_iter=3000, random_state=RANDOM_STATE))
])

param_grid = [
    {'clf__penalty': ['l2'], 'clf__solver': ['lbfgs','liblinear'], 'clf__C': [0.01, 0.1, 1, 10, 100]},
    {'clf__penalty': ['l1'], 'clf__solver': ['liblinear','saga'], 'clf__C': [0.01, 0.1, 1, 10, 100]}
]

grid = GridSearchCV(pipe, param_grid, scoring='roc_auc', cv=cv, n_jobs=-1, verbose=0)
grid.fit(X_train, y_train)

print("Best params  ", grid.best_params_)
print("Best CV AUC  ", grid.best_score_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1]

print("\nTest metrics")
print("Accuracy", accuracy_score(y_test, y_pred))
print("Precision", precision_score(y_test, y_pred, zero_division=0))
print("Recall", recall_score(y_test, y_pred, zero_division=0))
print("F1", f1_score(y_test, y_pred, zero_division=0))
print("ROC AUC", roc_auc_score(y_test, y_proba))

## 7 Diagnostic Plots

ROC, PR curve, and confusion matrix. Add calibration to assess probability quality.

In [None]:
fig = plt.figure(figsize=(6,5))
RocCurveDisplay.from_estimator(best_model, X_test, y_test)
plt.title("ROC curve  best model")
plt.show()

fig = plt.figure(figsize=(6,5))
PrecisionRecallDisplay.from_estimator(best_model, X_test, y_test)
plt.title("Precision Recall curve  best model")
plt.show()

fig = plt.figure(figsize=(5,5))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title("Confusion Matrix  best model")
plt.show()

# Calibration
prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10, strategy='quantile')
plt.figure(figsize=(6,5))
plt.plot(prob_pred, prob_true, marker='o')
plt.plot([0,1], [0,1], linestyle='--')
plt.xlabel('Predicted probability')
plt.ylabel('Observed frequency')
plt.title('Calibration curve  best model')
plt.show()

## 8 Threshold Tuning

Use Youden J statistic to choose a threshold that balances sensitivity and specificity. Report updated metrics.

In [None]:
fpr, tpr, thr = roc_curve(y_test, y_proba)
youden = tpr - fpr
ix = np.argmax(youden)
best_thr = thr[ix]
best_thr

In [None]:
y_pred_thr = (y_proba >= best_thr).astype(int)
print("Threshold", best_thr)
print("Accuracy", accuracy_score(y_test, y_pred_thr))
print("Precision", precision_score(y_test, y_pred_thr, zero_division=0))
print("Recall", recall_score(y_test, y_pred_thr, zero_division=0))
print("F1", f1_score(y_test, y_pred_thr, zero_division=0))
print("ROC AUC", roc_auc_score(y_test, y_proba))

## 9 Interpretation

Show feature coefficients for the best model. Map coefficients back to original column names after preprocessing.

In [None]:
# Extract feature names after preprocessing
def get_feature_names(preprocessor, numeric_cols, categorical_cols):
    num_features = numeric_cols
    cat_features = categorical_cols
    # For simplicity we assume no OneHot here  adjust if one hot is later added
    return num_features + cat_features

feature_names = get_feature_names(preprocess, numeric_cols, categorical_cols)

clf = best_model.named_steps['clf']
if hasattr(clf, 'coef_'):
    coefs = pd.Series(clf.coef_.ravel(), index=feature_names)
    coefs.sort_values(key=np.abs, ascending=False, inplace=True)
    display(coefs.head(20))
else:
    print("No coefficients available on this classifier.")

## 10 Overfitting Control

Briefly discuss what you did to reduce overfitting. Mention CV, regularization, data split, and threshold tuning results.

## 11 Yellowdig Source Snippet

Paste the following in Yellowdig with your own short note on quality.

**Citation suggestion APA**  
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit learn  Machine Learning in Python. *Journal of Machine Learning Research, 12*, 2825 2830.

**Why high quality**  
Peer reviewed venue and foundational library paper. Clear methods and implementation details. Widely cited in machine learning research and practice.

## 12 Appendix Notes for Milestone One

Flag figures and sections you plan to reuse in the 8 to 10 page summary. Mention which parts map to breadth and which week you plan to go deep on for depth.