<h1> Week 2 Checkpoints </h1>

Needed code from previous Week 1 checkpoints

In [25]:
import pandas as pd
import numpy as np

df = pd.read_csv("diabetic_data.csv")

# convert ? missing values to nan for correct calculation
df.replace("?", np.nan, inplace = True)

y = df['readmitted']

In [26]:
from sklearn.model_selection import train_test_split

# dropping columns before dataset to create new dataset to avoid leakage
X = df.drop(columns=[
    'readmitted', # target
    'encounter_id', # ID
    'patient_nbr', # ID
    'time_in_hospital', # post-outcome info
])

# stratified random train/test split that maintains class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, # feature matrix
    y,
    test_size=0.2, # last 20% test
    stratify=y,       # minority class (<30)
    random_state=42
)

In [27]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# selecting numeric and categorical columsn from dataset
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()


# preprocess data for mixed feature types
# outputs numeric scaled features with categorical into binary
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features), # learning from training data std of each numeric col
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features) # converts text into numbers ignoring not seen
                                                       # no sparse matrix for GuassianNB input
    ]
)

<h2> Part C </h2>

8. Train an SVM-family model and do light tuning

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
from sklearn.metrics import f1_score, classification_report, ConfusionMatrixDisplay, accuracy_score

In [29]:
"""
Create pipeline with:
1. preprocessor (ColumnTransformer for handling numeric/categorical columns properly)- during CV, each fold must fit preprocessing only on that fold's training portion.
   if we preprocess once on the full training set before CV, we'd leak info from the validation fold
   into preprocessing parameters
2. the SVM classifier (LinearSVC)

Leakage-safe pipeline for SVM. Pipeline contains:
- preprocessing step (the ColumnTransformer) to scale numeric and one-hot categorical
- SVM: LinearSVC classifier step with 5000 iters to reduce convergence issues

pipeline prevents:
- leakage during cross-validation as each fold fits the preprocessing on only the fold training data,
then transforms fold-val
- train-test contamination when later evaluating on X_test

we choose max_iter = 5000 because:
- LinearSVC sometimes fails to converge with the default max_iter
- increasing it reduces convergence issues and improves stability

"""
pipeline = Pipeline([
    ('preprocessing', preprocessor), # preprocessing step to handle numeric and categorical
    ('svm', LinearSVC(max_iter=5000)) # SVM step with 5000 iters to reduce convergence issues (LinearSVC can fail to converge with default values, 5000 is a common reasonably sized increase)
    # no random state here because we want to keep seeds fixed
])

In [30]:
# cross-validation setup
"""
- stratification keeps class proportions similar across folds
- shuffle=True prevents clusters of similar data forming and makes folds more representative
random_state=42 makes the fold splits reproducible
"""
# hyperparameter tuning (C values)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
'''
C controls regularization strategy (prevents overfitting by adding a penalty to the loss function)
- Smaller C --> stronger regularization: more bias, less variance (may underfit)
- Larger C --> weaker regularization: tighter fit, less bias, more variance (may overfit)

We tune 3 different C values to maximize distance between classes
0.1 allows some misclassification - might underfit
10 allows no mistakes - might overfit

C is the main hyperparameter for a linear SVM. It controls tradeoff between maximizing margin and minimizing training error;
key for bias-variance control
'''
C_vals = [0.1, 1, 10] # we choose these vals because it's a standard small/default/large spread arond the default C=1
results = {} # storing the average CV results for each C in dictionary {C: mean_score}

# cross-validation loop: itereate through each candidate C value and evaluate it with 3-fold cv
"""
For each C:
- update the pipeline's svm__C parameter
- run cross_val_score on only the training set
- compute mean weighted F1 across all folds

We don't touch X_test during tuning. We hold it out for the final evaluation
"""
for C in C_vals:
  # update the SVM hyperparameter inside the pipeline
  # syntax: step_name__param_name
  pipeline.set_params(svm__C=C)
  # cross-validate on training data
  # split data in 3 parts - train 2 parts - validate 1 part
  # weighted F1 accounts for class frequency so the majority class influences more, weighted F1 most common for imbalanced multi-class datasets
  scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='f1_weighted') # f1-weighted: compute F1 per class, weight by class frequency in y_test
  # score the mean CV score for this C
  results[C] = scores.mean() # store average performance across the 3 cv folds for each c to get a generalized performance
  print(f"C={C}, CV WeightedF1={scores.mean():.4f}")

# choose the best C based on mean CV score:
# gets C val which makes model generalize the best/select C that maximizes validation performance
best_C = max(results, key=results.get) # highest average validation accuracy
print("Best C:", best_C)

C=0.1, CV WeightedF1=0.5199
C=1, CV WeightedF1=0.5205
C=10, CV WeightedF1=0.5206
Best C: 10


In [31]:
# train the final model on all training data using best_C
# rebuild the pipeline using the best C so we have final/separate pipeline object
# cv used only subsets per fold; once hyperparameters are selected , we want the strongest model by fitting on the full training set before evlauating on the untouched test set
final_model = Pipeline([
    ('preprocessing', preprocessor),
    ('svm', LinearSVC(C=best_C, max_iter=5000))
])
# train on all available training data now that hyperparameters are fixed to build the strongest possible mdoel before testing
final_model.fit(X_train, y_train) # train on all training data

In [32]:
# test set evaluation: evaluate on the held-out test set

from sklearn.metrics import accuracy_score, classification_report

# now generate predictions for x_test
y_pred = final_model.predict(X_test) # numeric columns scaled, categorical columns one-hot encoded, output predicted class labels
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Weighted F1:", f1_score(y_test, y_pred, average='weighted'))
print(classification_report(y_test, y_pred))

Test Accuracy: 0.5752677606367299
Weighted F1: 0.5208434964907447
              precision    recall  f1-score   support

         <30       0.24      0.01      0.02      2272
         >30       0.50      0.33      0.40      7109
          NO       0.60      0.85      0.70     10973

    accuracy                           0.58     20354
   macro avg       0.45      0.40      0.37     20354
weighted avg       0.52      0.58      0.52     20354



**9. Compare the SVM to your Week 1 baseline using the same split, preprocessing, and metric
definitions.**

The LinearSVM had a higher test accuracy of 57.5% and a weighted F1 score of 0.521 which is slightly higher than the Gaussian NB. This shows that SVM improves the global classification performance. The SVM preforms strongly on the majority class NO with F1 = 0.70 and recall = 0.85 but almost completely fails to detect the minority class <30 readmissions with recall 0.01 and F1 = 0.02. This shows that the SVM is very biased toward the dominant class due to dataset imbalance. The Gaussian NB baseline had slightly lower in overall performance but it handled class predictions more evenly and did not collapse minority detection as much. In the hospital readmission prediction problem, identifying <30 day readmission is the most important feature. Even with the higher overall accuracy, the Linear SVM's low recall makes it less useful for the real world decision making. This data shows how accuracy alone can be misleading in imbalanced medical datasets which is why the metrics help with model selection.

**10. Log your runs in the Experiment Log (baseline + SVM variants). Keep seeds fixed for
reproducibility.**

In [33]:
import csv

rows = [
    ["model","params","seed","cv_metric","test_accuracy","test_weighted_f1","timestamp"],
    ["GaussianNB","default",42,"NA",0.13658248992826963,0.07896890448851324,"2025-02-05_10:00"],
    ["LinearSVC","C=0.1,max_iter=5000",42,0.5199,"NA","NA","2025-02-05_10:05"],
    ["LinearSVC","C=1,max_iter=5000",42,0.5205,"NA","NA","2025-02-05_10:06"],
    ["LinearSVC","C=10,max_iter=5000",42,0.5206,0.5752677606367299,0.5208434964907447,"2025-02-05_10:10"]
]

with open("experiment_log.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print("experiment_log.csv created successfully")

experiment_log.csv created successfully


- model: the learning algorithm used in that experiment run (GaussianNB or LinearSVC)
- params: the hyperparameter configuration for that run
- seed: random seed controlling the train/test split
- cv_metric: cross-validation performance used during tuning (baseline had no cv, so NA; mean 3-fold weighted f1 for the linearsvc model)
- test_accuracy: accuracy agains the held-out test set
- test_weighted_f1: primary metric
- timestamp: when the run was executed

<h1> Part D <h1>

11. CalibratedClassifierCV with sigmoid (Platt scaling)

In [34]:
from sklearn.calibration import CalibratedClassifierCV

'''
Calibration so we can do probability thresholding:

Must do this because:
- LinearSVC doesn't output probabilities (doesn't have predict_proba)
- it produces a decision function score (distance from the margin)
- calibration learns a mapping from scores --> probabilities

Platt scaling/sigmoid calibration:
- a logistic curve fitted to map scores to probabilities
- commonly used for SVM margins
- works well as a default

We use cv=3 (3-fold cross-validation) on the training data to learn the mapping
from SVM decision scores --> probabilities without leakage (different data is used
to train the SVM and to fit the calibration curve)
- tunes the probability calibrator in a leakage safe way
- 3 does the job and keeps runtime reasonable

We pass final_model, which is our tuned pipeline with preprocessing + LinearSVC with the best C
so that calibration applies to the entire pipeline, ie, preprocessing is refit correctly inside
each CV fold of calibration
- for each fold, preprocessing fits only on training data for that fold and then fits
LinearSVC on the transformed training data for that fold
'''
calibrated_model = CalibratedClassifierCV(
    estimator=final_model, # our tuned pipeline (preprocess); calibration will wrap the entire pipeline
    method='sigmoid', # platt scaling
    cv=3
)
# fit the calibration only on the training data to preserve test integrity
calibrated_model.fit(X_train, y_train)

12. Cost Matrix

In [35]:
'''
Choose threshold using validation predictions (not test predictions), then report
metrics at default threshold and chosen threshold on test:

- use validation split to prevent data leakage
- split the training data again into "train_sub" to fit the model and "val"
  - use the smaller training set to fit model + calibrate and validation set only used to choose threshold
- fit the calibrated model on train_sub, pick threshold on val, evaluate on X_test
'''

X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.2, # small validation set is enough for threshold selection
    stratify=y_train, # keep the readmitted class distributions (NO, >30, <30 days) the same
    random_state=42 # keeps reproducible
)
# refit calibrated model only on train_sub to prevent data leakage on threshold selection
calibrated_model.fit(X_train_sub, y_train_sub)

# define binary high-risk label:
# turns the 3-class problem into the binary decision problem of "is this patient at high risk of <30 readmission?"
# threshold under a cost matrix is simplest in binary form and real-world issue is if we intervene or not, where <30 is when we intervene
'''
Positive class (1) - <30 readmission
Negative class (0) - other outcomes
'''
# convert val labels to binary, we don't want to change training labels
y_val_binary = (y_val == '<30').astype(int)

'''
Cost matrix
False negative cost = 10 because missing an at-risk patient has a much higher medical and financial cost
False positive cost = 1 because has a lower cost ()
'''
fn_cost = 10
fp_cost = 1

# get calibrated probabilites on the validation set
val_probabilities = calibrated_model.predict_proba(X_val)
# we want the probability of the <30 class specifically
class_index = list(calibrated_model.classes_).index('<30') # find the index of the <30 column
# extract P(<30) for each validation example
highrisk_prob = val_probabilities[:, class_index]

# sweep thresholds and compute expected cost:
# test a grid of 100 thresholds from 0 to 1 (chose 100 because it's an adequate amount)
thresholds = np.linspace(0,1,100)
costs = []

for t in thresholds:
  # predict high-risk (1) is probability >= threshold
  # lower threshold means more positives which means higher recall but more fps
  predictions = (highrisk_prob >= t).astype(int)
  # compute FP and FN counts for this threshold
  fp = np.sum((predictions == 1) & (y_val_binary == 0))
  fn = np.sum((predictions == 0) & (y_val_binary == 1))
  # total cost = fp_cost * FP + fn_cost * FN
  cost = fp_cost * fp + fn_cost * fn
  costs.append(cost)

# choose threshold that minimzes expected cost on validation data
best_t = thresholds[np.argmin(costs)]
print("Best Threshold:", best_t)

Best Threshold: 0.10101010101010102


13. Report

In [36]:
# report metrics at default vs chosen threshold on test

from sklearn.metrics import confusion_matrix, classification_report

# convert test labels to binary for reporting
y_test_binary = (y_test == '<30').astype(int)

# get calibrated probabilities on the test set
test_probabilities = calibrated_model.predict_proba(X_test)
highrisk_prob_test = test_probabilities[:, class_index]

# default threshold = 0.5 (conventional decision rule); predict positive if P(positive) >= 0.5
default_preds = (highrisk_prob >= 0.5).astype(int)
print("Default Threshold")
print(confusion_matrix(y_test_binary, default_preds))
print(classification_report(y_test_binary, default_preds))

# optimized threshold
best_preds = (highrisk_prob >= best_t).astype(int)
print("\nOptimized Threshold")
print(confusion_matrix(y_test_binary, best_preds))
print(classification_report(y_test_binary, best_preds))

Default Threshold
[[18080     2]
 [ 2272     0]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     18082
           1       0.00      0.00      0.00      2272

    accuracy                           0.89     20354
   macro avg       0.44      0.50      0.47     20354
weighted avg       0.79      0.89      0.84     20354


Optimized Threshold
[[ 7961 10121]
 [  592  1680]]
              precision    recall  f1-score   support

           0       0.93      0.44      0.60     18082
           1       0.14      0.74      0.24      2272

    accuracy                           0.47     20354
   macro avg       0.54      0.59      0.42     20354
weighted avg       0.84      0.47      0.56     20354



Using the default threshold of 0.5 leads to many false negatives, which means that high-risk readmission patients would be missed. Under the defined cost model, missing a high-risk patient is 10 times more costly than unnecessary intervention. Lowering the threshold increases the recall for high-risk patients, which reduces the cost of missed readmissions while keeping the false positives at an acceptable level. The chosen threshold aligns with the model’s decisions, where preventing early readmissions is more important than avoiding extra care.