
# Early Breast Cancer Prediction — AT‑Style **Simple & Visual** (Procedural)

**Updated:** 2025-08-08 21:28

A long, step‑by‑step, **simple** notebook. No custom `def` functions, minimal logic,
and **visualizations in every section**.



## Table of Contents
1. Project Aim & Metrics  
2. Environment Snapshot (with visual)  
3. Configuration (Paths, Target)  
4. Load Data (first look)  
5. Schema Overview (data dictionary preview)  
6. Target Check & Distribution (visual)  
7. Missing Values Overview (visual)  
8. Numeric Features — Distributions (visual)  
9. Categorical Features — Top Levels (visual)  
10. Correlation Heatmap (visual)  
11. Train/Test Split (visual class balance)  
12. Preprocessing Plan (impute/encode/scale)  
13. Fit Preprocessor & Post‑Transform Checks (visual)  
14. SMOTE Demonstration on Training (visual)  
15. Baseline Models (LogReg, RandomForest) + Cross‑Validation (visual)  
16. Test‑Set Evaluation: Confusion, ROC, PR (visual)  
17. Threshold Sweep (visual)  
18. Calibration Curve & Brier Score (visual)  
19. Feature Importance (Random Forest) (visual)  
20. Notes for Scaling & Next Steps



## 1) Project Aim & Metrics
- Predict breast cancer risk from **non‑invasive** features only.  
- Primary metric: **ROC‑AUC**; also track **Recall**, **F1**, and **PR‑AUC**.
- Run on a 10% development sample, then scale to the full dataset.


## 2) Environment Snapshot (with visual)

In [None]:

import sys, platform, importlib
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())

packages = ["numpy","pandas","matplotlib","scikit-learn","imblearn"]
versions = []
for p in packages:
    try:
        m = importlib.import_module(p if p!="scikit-learn" else "sklearn")
        v = getattr(m, "__version__", "n/a")
    except Exception:
        v = "not installed"
    versions.append(v)

import matplotlib.pyplot as plt
plt.figure()
plt.bar(packages, [len(s) for s in versions])
plt.title("Environment snapshot — version string lengths (quick visual)")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

for p,v in zip(packages, versions):
    print(f"{p}: {v}")


## 3) Configuration (Paths, Target)

In [None]:

DATA_PATH = "/mnt/data/sample_10percent.csv"   # change to full dataset later
TARGET_COL = "cancer"                          # set your actual label column
TEST_SIZE = 0.2
CV_FOLDS = 5
RANDOM_STATE = 42
N_JOBS = -1
MAX_PLOT_ROWS = 60000

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             roc_curve, precision_recall_curve, average_precision_score, ConfusionMatrixDisplay)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

pd.set_option("display.max_columns", 160)
pd.set_option("display.width", 200)

print("Configuration ready. TARGET_COL:", TARGET_COL)


## 4) Load Data (first look)

In [None]:

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
display(df.head(5))
display(df.tail(3))
_ = df.info()
display(pd.DataFrame(df.dtypes, columns=["dtype"]).T)


## 5) Schema Overview (data dictionary preview)

In [None]:

schema_tbl = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(t) for t in df.dtypes],
    "non_null": [int(df[c].notna().sum()) for c in df.columns]
})
display(schema_tbl.head(30))

plt.figure()
plt.barh(schema_tbl["column"][:15], schema_tbl["non_null"][:15])
plt.xlabel("Non-null count"); plt.title("Top 15 columns by non-null count")
plt.tight_layout(); plt.show()


## 6) Target Check & Distribution (visual)

In [None]:

assert TARGET_COL in df.columns, f"TARGET_COL '{TARGET_COL}' not found — set it in Config."
vc = df[TARGET_COL].value_counts().sort_index()
display(vc)

plt.figure()
vc.plot(kind="bar")
plt.title("Target distribution")
plt.xlabel(TARGET_COL); plt.ylabel("Count")
plt.show()

print("Target distribution (%):")
display((vc/len(df)*100).round(2))


## 7) Missing Values Overview (visual)

In [None]:

missing = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing/len(df)*100).round(2)
miss_tbl = pd.DataFrame({"missing_count": missing, "missing_pct": missing_pct})
display(miss_tbl.head(30))

plt.figure()
miss_tbl.head(20)["missing_pct"].plot(kind="bar")
plt.title("Top 20 columns by missing %"); plt.ylabel("% missing"); plt.xticks(rotation=45, ha="right")
plt.tight_layout(); plt.show()

print("Total NaNs in dataset:", int(df.isna().sum().sum()))


## 8) Numeric Features — Distributions (visual)

In [None]:

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
plot_df = df if len(df)<=MAX_PLOT_ROWS else df.sample(MAX_PLOT_ROWS, random_state=RANDOM_STATE)

for col in num_cols[:8]:
    plt.figure()
    plot_df[col].hist(bins=40)
    plt.title(f"Distribution: {col}"); plt.xlabel(col); plt.ylabel("Count")
    plt.show()


## 9) Categorical Features — Top Levels (visual)

In [None]:

cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
for col in cat_cols[:8]:
    plt.figure()
    df[col].astype(str).value_counts().head(15).plot(kind="bar")
    plt.title(f"Top levels: {col}"); plt.xlabel(col); plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right"); plt.tight_layout()
    plt.show()


## 10) Correlation Heatmap (visual)

In [None]:

subset = num_cols[:15]
if len(subset) >= 2:
    corr = df[subset].corr()
    plt.figure(figsize=(8,6))
    im = plt.imshow(corr, aspect="auto")
    plt.colorbar(im)
    plt.title("Correlation heatmap (subset)")
    plt.xticks(range(len(subset)), subset, rotation=90)
    plt.yticks(range(len(subset)), subset)
    plt.tight_layout(); plt.show()
else:
    print("Not enough numeric columns for a heatmap.")


## 11) Train/Test Split (visual class balance)

In [None]:

ID_COLUMNS = []
X = df.drop(columns=[TARGET_COL]+[c for c in ID_COLUMNS if c in df.columns], errors="ignore")
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)
print("Train:", X_train.shape, "| Test:", X_test.shape)

bal = pd.DataFrame({
    "set": ["train","test"],
    "positive_pct": [float((y_train==1).mean()*100), float((y_test==1).mean()*100)]
})

plt.figure()
plt.bar(bal["set"], bal["positive_pct"])
plt.title("Positive class % — train vs test")
plt.ylabel("% positive"); plt.ylim(0,100)
plt.show()


## 12) Preprocessing Plan (impute/encode/scale)

In [None]:

num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_pipe = Pipeline([("imputer", SimpleImputer(strategy="median")),
                         ("scaler", StandardScaler())])
categorical_pipe = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),
                             ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])

preprocessor = ColumnTransformer([("num", numeric_pipe, num_cols),
                                  ("cat", categorical_pipe, cat_cols)])

print("Numerical features:", len(num_cols), "| Categorical features:", len(cat_cols))
plt.figure()
plt.bar(["numeric","categorical"], [len(num_cols), len(cat_cols)])
plt.title("Feature types count")
plt.show()


## 13) Fit Preprocessor & Post‑Transform Checks (visual)

In [None]:

Xt_train = preprocessor.fit_transform(X_train)
Xt_test = preprocessor.transform(X_test)

print("Transformed shapes:", Xt_train.shape, Xt_test.shape)
print("Any NaNs after preprocess (train)?", bool(np.isnan(Xt_train).any()))
print("Any NaNs after preprocess (test)?", bool(np.isnan(Xt_test).any()))

sizes = [Xt_train.shape[1], Xt_test.shape[1]]
plt.figure()
plt.bar(["train features","test features"], sizes)
plt.title("Transformed feature counts")
plt.show()


## 14) SMOTE Demonstration on Training (visual)

In [None]:

sm = SMOTE(random_state=RANDOM_STATE)
Xt_train_demo, y_train_demo = sm.fit_resample(Xt_train, y_train)

before_pct = float((y_train==1).mean()*100)
after_pct = float((y_train_demo==1).mean()*100)

plt.figure()
plt.bar(["before SMOTE","after SMOTE"], [before_pct, after_pct])
plt.ylabel("% positive"); plt.title("Training class balance — SMOTE demonstration")
plt.ylim(0,100); plt.show()

print("Before SMOTE positive %:", round(before_pct,2), "| After SMOTE positive %:", round(after_pct,2))


## 15) Baseline Models + Cross‑Validation (visual)

In [None]:

scorer = {"accuracy":"accuracy","precision":"precision","recall":"recall","f1":"f1","roc_auc":"roc_auc","average_precision":"average_precision"}
cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

pipe_logreg = ImbPipeline([("prep", preprocessor),
                           ("smote", SMOTE(random_state=RANDOM_STATE)),
                           ("clf", LogisticRegression(max_iter=1000, n_jobs=N_JOBS))])

pipe_rf = ImbPipeline([("prep", preprocessor),
                       ("smote", SMOTE(random_state=RANDOM_STATE)),
                       ("clf", RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=N_JOBS))])

models = {"LogReg": pipe_logreg, "RF": pipe_rf}

cv_means = {}
for name, pipe in models.items():
    scores = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scorer, n_jobs=N_JOBS)
    cv_means[name] = {m: float(np.mean(scores[m])) for m in scores if m.startswith("test_")}

cv_df = pd.DataFrame(cv_means).T
display(cv_df)

plt.figure()
plt.bar(cv_df.index, cv_df["test_roc_auc"])
plt.title("CV ROC‑AUC by model")
plt.ylabel("ROC‑AUC")
plt.ylim(0,1)
plt.show()

best_name = cv_df["test_roc_auc"].idxmax()
print("Best model by CV ROC‑AUC:", best_name)


## 16) Test‑Set Evaluation: Confusion, ROC, PR (visual)

In [None]:

best_pipe = models[best_name]
best_pipe.fit(X_train, y_train)

proba = best_pipe.predict_proba(X_test)[:,1]
pred = (proba >= 0.5).astype(int)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, zero_division=0))
print("Recall:", recall_score(y_test, pred, zero_division=0))
print("F1:", f1_score(y_test, pred, zero_division=0))
print("ROC‑AUC:", roc_auc_score(y_test, proba))
print("PR‑AUC:", average_precision_score(y_test, proba))

plt.figure()
ConfusionMatrixDisplay.from_predictions(y_test, pred)
plt.title(f"Confusion Matrix — {best_name}")
plt.show()

fpr, tpr, _ = roc_curve(y_test, proba)
plt.figure()
plt.plot(fpr, tpr); plt.plot([0,1],[0,1],'--')
plt.title("ROC Curve"); plt.xlabel("FPR"); plt.ylabel("TPR")
plt.show()

prec, rec, _ = precision_recall_curve(y_test, proba)
plt.figure()
plt.plot(rec, prec)
plt.title("Precision‑Recall Curve"); plt.xlabel("Recall"); plt.ylabel("Precision")
plt.show()


## 17) Threshold Sweep (visual)

In [None]:

thresholds = np.linspace(0.1, 0.9, 9)
accs, recs, precs, f1s = [], [], [], []
for t in thresholds:
    yhat = (proba >= t).astype(int)
    accs.append(accuracy_score(y_test, yhat))
    recs.append(recall_score(y_test, yhat, zero_division=0))
    precs.append(precision_score(y_test, yhat, zero_division=0))
    f1s.append(f1_score(y_test, yhat, zero_division=0))

plt.figure()
plt.plot(thresholds, recs, label="Recall")
plt.plot(thresholds, precs, label="Precision")
plt.plot(thresholds, f1s, label="F1")
plt.title("Metric vs Threshold")
plt.xlabel("Threshold"); plt.ylabel("Score"); plt.legend()
plt.show()


## 18) Calibration Curve & Brier Score (visual)

In [None]:

bins = np.linspace(0.0, 1.0, 11)
digitized = np.digitize(proba, bins) - 1
xs, ys = [], []
for i in range(len(bins)-1):
    mask = digitized == i
    if mask.any():
        xs.append(float(np.mean(proba[mask])))
        ys.append(float(np.mean(y_test.iloc[mask])))

plt.figure()
plt.plot(xs, ys, marker="o")
plt.plot([0,1],[0,1],'--')
plt.title("Calibration curve"); plt.xlabel("Avg predicted"); plt.ylabel("Observed positive rate")
plt.show()

from sklearn.metrics import brier_score_loss
print("Brier score:", brier_score_loss(y_test, proba))


## 19) Feature Importance (Random Forest) (visual)

In [None]:

rf_pipe = ImbPipeline([("prep", preprocessor),
                       ("smote", SMOTE(random_state=RANDOM_STATE)),
                       ("clf", RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=N_JOBS))])
rf_pipe.fit(X_train, y_train)
rf = rf_pipe.named_steps["clf"]

if hasattr(rf, "feature_importances_"):
    importances = rf.feature_importances_
    topk = min(20, len(importances))
    idx = np.argsort(importances)[-topk:][::-1]
    plt.figure(figsize=(8,6))
    plt.barh(range(topk), importances[idx][::-1])
    plt.yticks(range(topk), [f"feature_{i}" for i in idx][::-1])
    plt.title("Random Forest — Top feature importances (indices)")
    plt.tight_layout(); plt.show()
else:
    print("RandomForestClassifier importances not available.")


## 20) Notes for Scaling & Next Steps


- Switch `DATA_PATH` to the **full dataset** and re‑run end‑to‑end.  
- Keep the model set small for clarity (LogReg + RF).  
- If runtime allows, add tuning later; save the final pipeline with `joblib`.  
- Decide on a clinical operating threshold using the threshold sweep.
