# Early Breast Cancer Prediction — AT‑Style Comprehensive Notebook (Procedural)

**Updated:** 2025-08-08 21:22

This notebook mirrors the length and step‑by‑step detail of the AT AML reference. It is **procedural** (no custom `def` functions) and runs on the 10% sample now; re‑run on full data later.


## Table of Contents
1. Project Context & Objectives  
2. Data Provenance & Governance  
3. Setup & Environment Check  
4. Configuration (Paths, Target, Random Seed)  
5. Reproducibility: Version Snapshot  
6. Load Data & Quick Sanity Checks  
7. Schema Review & Data Dictionary Stub  
8. Target Definition & Encoding Policy  
9. Class Balance & Baseline Naive Metrics  
10. Missing Values (Global & Column-Level)  
11. Numeric Feature Exploration (Distributions, Outliers)  
12. Categorical Feature Exploration (Top Levels)  
13. Correlation & Collinearity Review  
14. Train/Test Split (Stratified)  
15. Preprocessing Plan (Impute → Encode → Scale)  
16. Fit Preprocessor & Post-Transform NaN/Shape Checks  
17. Leakage Guardrails (Target Leakage Scan)  
18. Optional Procedural Feature Engineering (Bins, Ratios)  
19. Re-Preprocess After FE & Shape Checks  
20. Model Suite (LogReg, RandomForest, XGBoost if installed)  
21. Cross-Validation Design (Stratified K-Fold)  
22. CV Results (Per-Model Means & Per-Fold AUCs)  
23. Threshold Analysis (validation folds)  
24. Hold-Out Test Evaluation (Confusion, ROC, PR, Metrics)  
25. Calibration Check (Reliability Curve)  
26. Feature Importance / Permutation Intuition  
27. Hyperparameter Tuning (RandomizedSearchCV)  
28. Compare Tuned vs. Baseline on Test  
29. Fairness & Subgroup Slices  
30. Error Analysis: Inspect FP/FN  
31. Model Persistence (Optional)  
32. Scaling to Full Dataset  
33. Risks, Limitations, Ethics  
34. Next Steps & Checklist


## 1) Project Context & Objectives
- **Goal:** Predict breast cancer risk from non‑invasive features (demographics, lifestyle, symptoms).
- **Primary metric:** ROC‑AUC; also Recall/F1 and PR‑AUC.
- **Scale:** Develop on 10% sample; re‑run on ~1.5M rows later.


## 2) Data Provenance & Governance
- Institutional source; assumed lawful use.
- Avoid PII; persist only models/aggregates.
- Document transformations and access controls.


## 3) Setup & Environment Check

In [None]:
import sys, platform, warnings
warnings.filterwarnings("ignore")
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())

## 4) Configuration (Paths, Target, Random Seed)

In [None]:
DATA_PATH = "/mnt/data/sample_10percent.csv"   # change to full dataset later
TARGET_COL = "cancer"                          # set actual label column
ID_COLUMNS = []
RANDOM_STATE = 42
TEST_SIZE = 0.2
CV_FOLDS = 5
N_JOBS = -1
MAX_PLOT_ROWS = 60000

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             roc_curve, precision_recall_curve, average_precision_score, ConfusionMatrixDisplay, brier_score_loss)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

try:
    from xgboost import XGBClassifier
    XGB_AVAILABLE = True
except Exception:
    XGB_AVAILABLE = False

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

pd.set_option("display.max_columns", 160)
pd.set_option("display.width", 200)

print("Config loaded. TARGET_COL:", TARGET_COL)

## 5) Reproducibility: Version Snapshot

In [None]:
import importlib
pkgs = ["numpy","pandas","matplotlib","scikit-learn","imblearn"] + (["xgboost"] if XGB_AVAILABLE else [])
versions = {}
for p in pkgs:
    try:
        m = importlib.import_module(p if p!="scikit-learn" else "sklearn")
        versions[p] = getattr(m, "__version__", "n/a")
    except Exception:
        versions[p] = "not installed"
versions

## 6) Load Data & Quick Sanity Checks

In [None]:
df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
display(df.head(5)); display(df.tail(3))
_ = df.info(); display(pd.DataFrame(df.dtypes, columns=["dtype"]).T)
display(df.describe(include='number').T.head(25))
display(df.describe(include='object').T.head(25))
assert TARGET_COL in df.columns, f"TARGET_COL '{TARGET_COL}' not found."
print("TARGET:", TARGET_COL)

## 7) Schema Review & Data Dictionary Stub

In [None]:
schema_tbl = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(t) for t in df.dtypes],
    "example": [df[c].dropna().iloc[0] if df[c].notna().any() else None for c in df.columns]
})
display(schema_tbl.head(40))

## 8) Target Definition & Encoding Policy

In [None]:
if df[TARGET_COL].dtype == 'O':
    mapping = {"yes":1,"no":0,"true":1,"false":0,"1":1,"0":0}
    df[TARGET_COL] = df[TARGET_COL].astype(str).str.lower().map(mapping).fillna(df[TARGET_COL])
print("Target unique:", df[TARGET_COL].dropna().unique()[:10])

## 9) Class Balance & Baseline Naive Metrics

In [None]:
vc = df[TARGET_COL].value_counts()
display(vc); display((vc/len(df)*100).round(2))
majority = vc.idxmax(); print("Naive baseline accuracy:", round((df[TARGET_COL]==majority).mean(), 4))

## 10) Missing Values (Global & Column-Level)

In [None]:
missing_counts = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_counts/len(df)*100).round(2)
display(pd.DataFrame({"missing_count": missing_counts, "missing_pct": missing_pct}).head(40))
print("Total NaNs:", int(df.isna().sum().sum()))
print("Any fully empty columns?:", bool((missing_counts==len(df)).any()))
print("Any entirely empty rows?:", bool(df.isna().all(axis=1).any()))

## 11) Numeric Feature Exploration (Distributions & Outliers)

In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
plot_df = df if len(df)<=MAX_PLOT_ROWS else df.sample(MAX_PLOT_ROWS, random_state=RANDOM_STATE)
for col in num_cols[:10]:
    plt.figure(); plot_df[col].hist(bins=40)
    plt.title(f"{col} — distribution"); plt.xlabel(col); plt.ylabel("count"); plt.show()
import numpy as np
for col in num_cols[:10]:
    s = df[col]; mu, sd = s.mean(), s.std(ddof=0)
    if sd and sd>0:
        z = (s-mu)/sd
        print(f"{col}: {(np.abs(z)>4).mean()*100:.2f}% > |z|=4")

## 12) Categorical Feature Exploration (Top Levels)

In [None]:
cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
for col in cat_cols[:10]:
    plt.figure(); df[col].astype(str).value_counts().head(20).plot(kind='bar')
    plt.title(f"{col} — top 20"); plt.xticks(rotation=45, ha='right'); plt.tight_layout(); plt.show()

## 13) Correlation & Collinearity Review

In [None]:
subset = num_cols[:15]
if len(subset)>=2:
    corr = df[subset].corr()
    plt.figure(figsize=(8,6)); im = plt.imshow(corr, aspect='auto'); plt.colorbar(im)
    plt.title("Correlation heatmap (subset)")
    plt.xticks(range(len(subset)), subset, rotation=90); plt.yticks(range(len(subset)), subset)
    plt.tight_layout(); plt.show()

## 14) Train/Test Split (Stratified)

In [None]:
X = df.drop(columns=[TARGET_COL]+[c for c in ID_COLUMNS if c in df.columns], errors='ignore')
y = df[TARGET_COL]
print("NaNs in X pre-split:", int(X.isna().sum().sum()), "| NaNs in y:", int(y.isna().sum()))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)
print("Train:", X_train.shape, "Test:", X_test.shape)
display((y_train.value_counts(normalize=True)*100).round(2))

## 15) Preprocessing Plan (Impute → Encode → Scale)

In [None]:
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
numeric_pipe = Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
categorical_pipe = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])
preprocessor = ColumnTransformer([("num", numeric_pipe, num_cols), ("cat", categorical_pipe, cat_cols)])

## 16) Fit Preprocessor & Post‑Transform Checks

In [None]:
Xt_train = preprocessor.fit_transform(X_train)
Xt_test = preprocessor.transform(X_test)
print("Transformed shapes:", Xt_train.shape, Xt_test.shape)
print("NaNs after preprocess (train)?", bool(np.isnan(Xt_train).any()))
print("NaNs after preprocess (test)?", bool(np.isnan(Xt_test).any()))

## 17) Leakage Guardrails (Quick Scan)

In [None]:
corr_with_target = {}
for c in X_train.select_dtypes(include=[np.number]).columns[:50]:
    try:
        corr_with_target[c] = abs(pd.concat([X_train[c], y_train], axis=1).corr().iloc[0,1])
    except Exception:
        pass
suspicious = [k for k,v in corr_with_target.items() if v>=0.98]
print("Near-perfect correlations with target:", suspicious[:10])

## 18) Optional Procedural Feature Engineering

In [None]:
X_train_fe = X_train.copy(); X_test_fe = X_test.copy()
age_cols = [c for c in X_train_fe.columns if "age" in c.lower()]
for c in age_cols:
    try:
        X_train_fe[c+"_bin"] = pd.cut(X_train_fe[c],[0,30,40,50,60,70,200],labels=["<30","30-39","40-49","50-59","60-69","70+"], include_lowest=True)
        X_test_fe[c+"_bin"] = pd.cut(X_test_fe[c],[0,30,40,50,60,70,200],labels=["<30","30-39","40-49","50-59","60-69","70+"], include_lowest=True)
    except Exception as e:
        print("Age binning skipped:", e)
bmi_cols = [c for c in X_train_fe.columns if "bmi" in c.lower()]
for c in bmi_cols:
    try:
        X_train_fe[c+"_class"] = pd.cut(X_train_fe[c],[0,18.5,25,30,100],labels=["underweight","normal","overweight","obese"], include_lowest=True)
        X_test_fe[c+"_class"] = pd.cut(X_test_fe[c],[0,18.5,25,30,100],labels=["underweight","normal","overweight","obese"], include_lowest=True)
    except Exception as e:
        print("BMI binning skipped:", e)
num_cols_fe = X_train_fe.select_dtypes(include=[np.number]).columns.tolist()
pairs = [(num_cols_fe[i], num_cols_fe[j]) for i in range(min(8,len(num_cols_fe))) for j in range(i+1, min(8,len(num_cols_fe)))]
for a,b in pairs[:10]:
    try:
        X_train_fe[f"ratio_{a}_over_{b}"] = X_train_fe[a]/(X_train_fe[b].abs()+1e-6)
        X_test_fe[f"ratio_{a}_over_{b}"] = X_test_fe[a]/(X_test_fe[b].abs()+1e-6)
    except Exception:
        pass
print("NaNs after FE — train:", int(X_train_fe.isna().sum().sum()), "test:", int(X_test_fe.isna().sum().sum()))
num_cols2 = X_train_fe.select_dtypes(include=[np.number]).columns.tolist()
cat_cols2 = X_train_fe.select_dtypes(exclude=[np.number]).columns.tolist()
preprocessor_fe = ColumnTransformer([
    ("num", Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]), num_cols2),
    ("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]), cat_cols2)
])
Xt_train_fe = preprocessor_fe.fit_transform(X_train_fe)
Xt_test_fe = preprocessor_fe.transform(X_test_fe)
print("Shapes after FE+preprocess:", Xt_train_fe.shape, Xt_test_fe.shape)

## 19) Model Suite (Pipelines with SMOTE inside CV)

In [None]:
scorer = {"accuracy":"accuracy","precision":"precision","recall":"recall","f1":"f1","roc_auc":"roc_auc","average_precision":"average_precision"}
cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)
pipe_logreg = ImbPipeline([("prep", preprocessor_fe), ("smote", SMOTE(random_state=RANDOM_STATE)), ("clf", LogisticRegression(max_iter=1000, n_jobs=N_JOBS))])
pipe_rf = ImbPipeline([("prep", preprocessor_fe), ("smote", SMOTE(random_state=RANDOM_STATE)), ("clf", RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=N_JOBS))])
models = {"LogReg": pipe_logreg, "RF": pipe_rf}
if XGB_AVAILABLE:
    pipe_xgb = ImbPipeline([("prep", preprocessor_fe), ("smote", SMOTE(random_state=RANDOM_STATE)),
                            ("clf", XGBClassifier(n_estimators=500, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, max_depth=6, random_state=RANDOM_STATE, n_jobs=N_JOBS, eval_metric="logloss"))])
    models["XGB"] = pipe_xgb
list(models.keys())

## 20) Cross‑Validation — Run

In [None]:
cv_results = {}; per_fold = {}
for name, pipe in models.items():
    scores = cross_validate(pipe, X_train_fe, y_train, cv=cv, scoring=scorer, n_jobs=N_JOBS)
    cv_results[name] = {k: float(np.mean(v)) for k, v in scores.items()}
    per_fold[name] = scores
    print(name, "CV means:", cv_results[name])
import pandas as pd
cv_df = pd.DataFrame(cv_results).T.sort_values("test_roc_auc", ascending=False)
display(cv_df)
best_name = cv_df.index[0]; best_name

## 21) Cross‑Validation — Per Fold AUCs (Top Model)

In [None]:
import numpy as np, pandas as pd
auc_vec = per_fold[best_name]["test_roc_auc"]
fold_tbl = pd.DataFrame({"fold": np.arange(1, len(auc_vec)+1), "roc_auc": auc_vec})
display(fold_tbl); print("Mean:", float(np.mean(auc_vec)), "Std:", float(np.std(auc_vec)))

## 22) Threshold Analysis (validation folds)

In [None]:
print("Default threshold 0.5 used here. Adjust in deployment to meet recall/precision targets.")

## 23) Fit Best Model on Train (with FE)

In [None]:
best_pipe = models[best_name]; best_pipe.fit(X_train_fe, y_train)

## 24) Hold‑Out Test Evaluation

In [None]:
proba = best_pipe.predict_proba(X_test_fe)[:,1]; pred = (proba>=0.5).astype(int)
print("Acc:", accuracy_score(y_test, pred))
print("Prec:", precision_score(y_test, pred, zero_division=0))
print("Rec:", recall_score(y_test, pred, zero_division=0))
print("F1:", f1_score(y_test, pred, zero_division=0))
print("ROC‑AUC:", roc_auc_score(y_test, proba))
print("PR‑AUC:", average_precision_score(y_test, proba))
plt.figure(); ConfusionMatrixDisplay.from_predictions(y_test, pred); plt.title("Confusion Matrix"); plt.show()
fpr, tpr, _ = roc_curve(y_test, proba); plt.figure(); plt.plot(fpr,tpr); plt.plot([0,1],[0,1],'--'); plt.title("ROC"); plt.show()
prec, rec, _ = precision_recall_curve(y_test, proba); plt.figure(); plt.plot(rec,prec); plt.title("PR Curve"); plt.show()

## 25) Calibration Check (Reliability Curve)

In [None]:
bins = np.linspace(0.0, 1.0, 11)
digitized = np.digitize(proba, bins)-1
rows = []
for b in range(len(bins)-1):
    mask = digitized==b
    if mask.any():
        rows.append((float(np.mean(proba[mask])), float(np.mean(y_test.iloc[mask]))))
cal_tbl = pd.DataFrame(rows, columns=["avg_pred","frac_positive"])
display(cal_tbl)
plt.figure(); plt.plot(cal_tbl["avg_pred"], cal_tbl["frac_positive"], marker="o"); plt.plot([0,1],[0,1],'--'); plt.title("Calibration"); plt.show()
print("Brier:", brier_score_loss(y_test, proba))

## 26) Feature Importance / Interpretation

In [None]:
clf = best_pipe.named_steps["clf"]
if hasattr(clf, "feature_importances_"):
    imp = clf.feature_importances_
    topk = min(20, len(imp)); idx = np.argsort(imp)[-topk:][::-1]
    plt.figure(figsize=(7,6)); plt.barh(range(topk), imp[idx][::-1])
    plt.yticks(range(topk), [f"f_{i}" for i in idx][::-1]); plt.title("Top Importances (indices)"); plt.tight_layout(); plt.show()
else:
    print("No native importances for this classifier.")

## 27) Hyperparameter Tuning (RandomizedSearchCV)

In [None]:
RUN_TUNING = True
if RUN_TUNING:
    import scipy.stats as st
    if best_name=="RF":
        dist = {"clf__n_estimators": st.randint(300,900),
                "clf__max_depth": st.randint(3,20),
                "clf__min_samples_split": st.randint(2,20),
                "clf__min_samples_leaf": st.randint(1,20),
                "clf__max_features": ["sqrt","log2",None]}
    elif best_name=="LogReg":
        dist = {"clf__C": st.loguniform(1e-3,1e2),
                "clf__solver": ["lbfgs","liblinear"],
                "clf__penalty": ["l2"]}
    elif best_name=="XGB" and XGB_AVAILABLE:
        dist = {"clf__n_estimators": st.randint(300,900),
                "clf__max_depth": st.randint(3,12),
                "clf__learning_rate": st.uniform(0.01,0.2),
                "clf__subsample": st.uniform(0.6,0.4),
                "clf__colsample_bytree": st.uniform(0.6,0.4),
                "clf__gamma": st.uniform(0.0,5.0),
                "clf__reg_alpha": st.uniform(0.0,1.0),
                "clf__reg_lambda": st.uniform(0.5,1.5)}
    else:
        dist = None
    if dist:
        tuner = RandomizedSearchCV(models[best_name], param_distributions=dist, n_iter=30,
                                   scoring="roc_auc",
                                   cv=StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE),
                                   random_state=RANDOM_STATE, n_jobs=N_JOBS, verbose=1)
        tuner.fit(X_train_fe, y_train)
        print("Best params:", tuner.best_params_); print("Best CV ROC‑AUC:", tuner.best_score_)
        tuned = tuner.best_estimator_; proba_t = tuned.predict_proba(X_test_fe)[:,1]; pred_t = (proba_t>=0.5).astype(int)
        print("\n— Test (Tuned) —")
        print("Acc:", accuracy_score(y_test, pred_t))
        print("Prec:", precision_score(y_test, pred_t, zero_division=0))
        print("Rec:", recall_score(y_test, pred_t, zero_division=0))
        print("F1:", f1_score(y_test, pred_t, zero_division=0))
        print("ROC‑AUC:", roc_auc_score(y_test, proba_t))
        print("PR‑AUC:", average_precision_score(y_test, proba_t))

## 29) Fairness & Subgroup Slices

In [None]:
slice_cols = [c for c in df.columns if any(k in c.lower() for k in ["age","sex","ethnicity","bmi","region"])][:3]
print("Slice candidates:", slice_cols)
if slice_cols:
    for sc in slice_cols:
        if sc in X_test.columns:
            tmp = pd.DataFrame({sc: X_test[sc], "y_true": y_test, "y_prob": proba})
            for lvl, sub in tmp.groupby(sc):
                if len(sub) < 30: 
                    continue
                yhat = (sub["y_prob"]>=0.5).astype(int)
                print(f"{sc}={lvl}: n={len(sub)}, Acc={accuracy_score(sub['y_true'], yhat):.3f}, Rec={recall_score(sub['y_true'], yhat, zero_division=0):.3f}")
else:
    print("No obvious demographic columns detected.")

## 30) Error Analysis: Inspect FP/FN

In [None]:
errs = pd.DataFrame({"y_true": y_test.values, "y_prob": proba, "y_pred": (proba>=0.5).astype(int)}, index=X_test.index)
fp = errs[(errs.y_true==0) & (errs.y_pred==1)].sort_values("y_prob", ascending=False).head(10)
fn = errs[(errs.y_true==1) & (errs.y_pred==0)].sort_values("y_prob", ascending=True).head(10)
print("False Positives (top 10 by prob):"); display(fp)
print("False Negatives (top 10 by prob):"); display(fn)

## 31) Model Persistence (Optional)

In [None]:
SAVE_MODEL = False
if SAVE_MODEL:
    import joblib, os
    os.makedirs("artifacts", exist_ok=True)
    joblib.dump(best_pipe, "artifacts/best_model_pipeline.joblib")
    print("Saved → artifacts/best_model_pipeline.joblib")

## 32) Scaling to Full Dataset
- Increase CV folds and tuning iterations if time permits.
- Consider distributed compute or chunked reading.
- Monitor memory; use sparse encodings where appropriate.


## 33) Risks, Limitations, Ethics
- Validate across subgroups; avoid disparate impact.
- Calibrate probabilities for clinical thresholds.
- Document assumptions and data limitations.


## 34) Next Steps & Checklist
- [ ] Set `DATA_PATH` to full dataset
- [ ] Confirm `TARGET_COL` and its 0/1 mapping
- [ ] Choose operating threshold to meet recall requirements
- [ ] Persist final model and calibration
