# 🏥 Notebook 3: Machine Learning with Tabular Data

In this notebook, we’ll shift from images to **tabular clinical data**.  
You’ll predict whether a patient is likely to have diabetes using routine measurements from the **Pima Indians Diabetes** dataset.

---

## 🎯 Learning Objectives
By the end, you will be able to:
1. Load and explore a real clinical tabular dataset.
2. Clean data (handle missing/invalid values) and engineer features.
3. Build a robust ML pipeline (imputation → scaling → model).
4. Train, tune, and evaluate a classifier with appropriate metrics.

---

## 🧠 Clinical Context
In many clinical settings, **tabular data**—vitals, lab values, demographics—is the primary source of information.  
A well-built model can support **risk screening** and trigger further clinical assessment.

> ⚠️ This is an *educational* exercise. Do **not** use these models for real clinical decisions.


In [None]:
#@title Install/Import (Colab) { display-mode: "form" }
# Minimal installs; most are preinstalled on Colab
!pip -q install scikit-learn pandas numpy matplotlib seaborn

import io, os, sys, math, json, time, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from typing import Tuple

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, classification_report,
    confusion_matrix, roc_auc_score, RocCurveDisplay
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

print("✅ Environment ready")


## 3.1 Data Exploration, Cleaning, and Preprocessing

We’ll use the Pima Indians Diabetes dataset (binary label: `outcome`).  
Certain physiological measures in the raw file sometimes contain **invalid zeros** (e.g., `glucose == 0`), which we’ll treat as missing and impute.

**Steps**
1. Load dataset from URL or upload your own CSV.
2. Inspect schema and summary stats.
3. Mark impossible zeros as missing and **impute**.
4. (Optional) Feature scaling for linear/SVM models.


In [None]:
#@title Load Dataset (URL or Upload) { run: "auto" }
use_url = True  #@param {type:"boolean"}
data_url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" #@param {type:"string"}
# Column names from the dataset description
column_names = [
    "pregnancies","glucose","blood_pressure","skin_thickness",
    "insulin","bmi","diabetes_pedigree","age","outcome"
]

if use_url:
    df = pd.read_csv(data_url, names=column_names)
else:
    from google.colab import files
    uploaded = files.upload()
    fname = list(uploaded.keys())[0]
    df = pd.read_csv(io.BytesIO(uploaded[fname]), names=column_names)

print("✅ Loaded data with shape:", df.shape)
display(df.head())


In [None]:
#@title Quick Audit (Info & Stats)
print("Data Types / Non-null counts")
print(df.info())
print("\nSummary statistics")
display(df.describe(include='all'))
print("\nClass balance (outcome):")
print(df['outcome'].value_counts(normalize=True).rename('proportion'))


In [None]:
#@title Mark Impossible Zeros as Missing & Impute { run: "auto" }
# Columns where zero is physiologically implausible in this dataset representation
zero_as_missing = ["glucose","blood_pressure","skin_thickness","insulin","bmi"]

df_clean = df.copy()
for c in zero_as_missing:
    zeros = (df_clean[c] == 0).sum()
    if zeros > 0:
        print(f"⚠️ Found {zeros} zero(s) in {c}; setting to NaN")
        df_clean.loc[df_clean[c] == 0, c] = np.nan

print("\nMissing values per column (after marking zeros):")
print(df_clean.isna().sum())

# We'll impute later inside a pipeline; preview a simple mean-impute copy for EDA:
df_preview = df_clean.copy()
for c in zero_as_missing:
    df_preview[c] = df_preview[c].fillna(df_preview[c].mean())

display(df_preview.head())


In [None]:
#@title Visual EDA (Hist, Correlation) { run: "auto" }
fig, axes = plt.subplots(3, 3, figsize=(12, 10))
axes = axes.ravel()
for i, col in enumerate(df.columns[:-1]):  # exclude outcome
    axes[i].hist(df_preview[col].dropna(), bins=30)
    axes[i].set_title(col)
fig.tight_layout()
plt.show()

plt.figure(figsize=(8,6))
sns.heatmap(df_preview.corr(numeric_only=True), annot=False)
plt.title("Correlation Heatmap")
plt.show()


## 3.2 Model Selection, Training, and Tuning

We’ll assemble a **Pipeline** to ensure leakage-free preprocessing and reproducibility:

- **Imputer**: replace missing with median
- **Scaler**: (optional) standardize features
- **Estimator**: choose among Logistic Regression / Random Forest / SVM

Use the form to pick model and hyperparameters.


In [None]:
#@title Split + Pipeline + Model Choice (Form) { display-mode: "form" }
test_size = 0.2            #@param {type:"number"}
random_state = 42          #@param {type:"number"}
scale_features = True      #@param {type:"boolean"}

model_choice = "RandomForest"  #@param ["RandomForest","LogisticRegression","SVM"]

# RandomForest params
rf_n_estimators = 300      #@param {type:"integer"}
rf_max_depth = 6           #@param {type:"integer"}
rf_min_samples_leaf = 2    #@param {type:"integer"}

# LogisticRegression params
lr_C = 1.0                 #@param {type:"number"}
lr_penalty = "l2"          #@param ["l2","l1"]
lr_solver = "liblinear"    #@param ["liblinear","lbfgs","saga"]

# SVM params
svm_C = 1.0                #@param {type:"number"}
svm_kernel = "rbf"         #@param ["linear","rbf","poly"]

# Features/Target
X = df_clean.drop(columns=["outcome"])
y = df_clean["outcome"].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=float(test_size), stratify=y, random_state=int(random_state)
)

# Numeric columns
num_features = X.columns.tolist()

num_transformers = []
num_transformers.append(("imputer", SimpleImputer(strategy="median")))
if scale_features:
    num_transformers.append(("scaler", StandardScaler()))

preprocess = Pipeline(num_transformers)

if model_choice == "RandomForest":
    clf = RandomForestClassifier(
        n_estimators=int(rf_n_estimators),
        max_depth=int(rf_max_depth),
        min_samples_leaf=int(rf_min_samples_leaf),
        random_state=int(random_state),
        class_weight="balanced"  # helps with class imbalance
    )
elif model_choice == "LogisticRegression":
    clf = LogisticRegression(
        C=float(lr_C),
        penalty=lr_penalty,
        solver=lr_solver,
        random_state=int(random_state),
        max_iter=300,
        class_weight="balanced"
    )
else:
    clf = SVC(
        C=float(svm_C),
        kernel=svm_kernel,
        probability=True,
        random_state=int(random_state),
        class_weight="balanced"
    )

pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", clf)
])

print("✅ Pipeline ready:", pipe)


In [None]:
#@title Train Model
pipe.fit(X_train, y_train)
print("✅ Model training complete!")

## 3.3 Model Testing & Inference

We’ll compute:
- Accuracy, Precision, Recall, F1
- Confusion Matrix
- ROC-AUC & ROC curve

> In screening contexts, **Recall (Sensitivity)** is especially important.


In [None]:
#@title Evaluate Model
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:,1] if hasattr(pipe.named_steps["model"], "predict_proba") else None

acc = accuracy_score(y_test, y_pred)
prec, rec, f1, _ = precision_recall_fscore_support(y_test, y_pred, average="binary")

print(f"Accuracy:  {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall:    {rec:.3f}")
print(f"F1-score:  {f1:.3f}")

print("\nClassification Report")
print(classification_report(y_test, y_pred, digits=3))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4,3))
sns.heatmap(cm, annot=True, fmt="d", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

if y_proba is not None:
    auc = roc_auc_score(y_test, y_proba)
    print(f"ROC-AUC: {auc:.3f}")
    RocCurveDisplay.from_predictions(y_test, y_proba)
    plt.show()
else:
    print("Model does not expose predict_proba; ROC-AUC not computed.")


## 🧪 Exercise: Try a Different Model

- Switch `model_choice` (e.g., Logistic Regression or SVM).
- Toggle `scale_features`.
- Adjust hyperparameters.
- Observe: Which model **improves Recall** without destroying Precision?

> 💡 Bonus: Add polynomial features or try `class_weight=None` and compare.


In [None]:
#@title (Optional) Feature Importance for Trees
is_rf = isinstance(pipe.named_steps["model"], RandomForestClassifier)
if is_rf:
    importances = pipe.named_steps["model"].feature_importances_
    idx = np.argsort(importances)[::-1]
    plt.figure(figsize=(6,4))
    sns.barplot(x=importances[idx], y=np.array(X.columns)[idx])
    plt.title("Random Forest Feature Importance")
    plt.show()
else:
    print("Feature importance plot only shown for RandomForest.")

## ✅ Sum Up and Reflection

### What you accomplished
- Cleaned and imputed clinical tabular data.
- Built a robust sklearn pipeline.
- Trained, tuned, and evaluated multiple classifiers.

### Reflection prompts
1. In screening tasks, is **Recall** or **Precision** more important? Why?
2. What are potential sources of **bias** in this dataset?
3. How would you update the pipeline for **temporal validation** (simulating deployment in the future)?

---

### Next steps
- Notebook 4: **AI Engineering** — deploy as an API and monitor in production.