
# Early Breast Cancer Prediction — AT‑Style **Simple & Visual PLUS** (Procedural)

**Updated:** 2025-08-08 21:37

Long, simple, and highly visual. No custom `def` functions, minimal branching.  
**Includes a consolidated results table + charts for each model.**



## Table of Contents
1. Aim & Metrics (visuals)  
2. Environment Snapshot (visuals)  
3. Configuration (Paths, Target)  
4. Load Data (first look)  
5. Schema Overview (visuals)  
6. Target Check & Distribution (visuals)  
7. Missing Values Overview (visuals)  
8. Numeric Features — Distributions & Pairplots (visuals)  
9. Categorical Features — Top Levels & Proportions (visuals)  
10. Correlation Heatmap & Scatter Matrix (visuals)  
11. Train/Test Split & Class Balance (visuals)  
12. Preprocessing Plan (impute/encode/scale) + Feature Type Counts (visuals)  
13. Fit Preprocessor & Post‑Transform Checks (visuals)  
14. SMOTE Demonstration on Training (visuals)  
15. Baseline Models (LogReg, RandomForest) + Cross‑Validation (table + visuals)  
16. Test‑Set Evaluation: Confusion, ROC, PR (visuals)  
17. Threshold Sweep (visual)  
18. Calibration Curve & Brier Score (visual)  
19. Feature Importance (Random Forest) (visual)  
20. Consolidated Results DataFrame + Metric Charts (table + visuals)  
21. Notes for Scaling & Next Steps



## 1) Aim & Metrics (visuals)
Predict breast cancer risk using **non‑invasive** features.  
Primary: **ROC‑AUC**. Also: **Recall**, **F1**, **PR‑AUC**.  
The visuals below anchor the metric focus.


In [None]:

import numpy as np, matplotlib.pyplot as plt
metrics = ["ROC-AUC","Recall","F1","PR-AUC"]
plt.figure()
plt.bar(metrics, [1,1,1,1])
plt.ylim(0,1.1); plt.title("Key metrics focus"); plt.ylabel("Relative emphasis")
plt.show()


## 2) Environment Snapshot (visuals)

In [None]:

import sys, platform, importlib
print("Python:", sys.version.split()[0]); print("Platform:", platform.platform())
packages = ["numpy","pandas","matplotlib","scikit-learn","imblearn"]
vers = []
for p in packages:
    try:
        m = importlib.import_module(p if p!="scikit-learn" else "sklearn")
        vers.append(getattr(m,"__version__","n/a"))
    except Exception:
        vers.append("not installed")

import matplotlib.pyplot as plt
plt.figure()
plt.bar(packages, [len(v) for v in vers])
plt.title("Env snapshot — version string lengths"); plt.xticks(rotation=45, ha="right"); plt.tight_layout(); plt.show()

for p,v in zip(packages, vers): print(f"{p}: {v}")


## 3) Configuration (Paths, Target)

In [None]:

DATA_PATH = "/mnt/data/sample_10percent.csv"   # change to full dataset later
TARGET_COL = "cancer"                          # set your actual label column
TEST_SIZE = 0.2
CV_FOLDS = 5
RANDOM_STATE = 42
N_JOBS = -1
MAX_PLOT_ROWS = 60000

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             roc_curve, precision_recall_curve, average_precision_score, ConfusionMatrixDisplay, brier_score_loss)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

pd.set_option("display.max_columns", 160)
pd.set_option("display.width", 200)

print("Configuration ready. TARGET_COL:", TARGET_COL)


## 4) Load Data (first look)

In [None]:

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
display(df.head(5)); display(df.tail(3))
_ = df.info(); display(pd.DataFrame(df.dtypes, columns=["dtype"]).T)


## 5) Schema Overview (visuals)

In [None]:

schema_tbl = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(t) for t in df.dtypes],
    "non_null": [int(df[c].notna().sum()) for c in df.columns],
    "nulls": [int(df[c].isna().sum()) for c in df.columns]
})
display(schema_tbl.head(40))

plt.figure(); plt.barh(schema_tbl["column"][:15], schema_tbl["non_null"][:15]); plt.title("Top 15 by non-null"); plt.tight_layout(); plt.show()
plt.figure(); plt.barh(schema_tbl["column"][:15], schema_tbl["nulls"][:15]); plt.title("Top 15 by nulls"); plt.tight_layout(); plt.show()


## 6) Target Check & Distribution (visuals)

In [None]:

assert TARGET_COL in df.columns, f"TARGET_COL '{TARGET_COL}' not found — set it in Config."
vc = df[TARGET_COL].value_counts().sort_index()
display(vc)
plt.figure(); vc.plot(kind="bar"); plt.title("Target distribution"); plt.xlabel(TARGET_COL); plt.ylabel("Count"); plt.show()
display((vc/len(df)*100).round(2))


## 7) Missing Values Overview (visuals)

In [None]:

miss = df.isna().sum().sort_values(ascending=False)
miss_pct = (miss/len(df)*100).round(2)
miss_tbl = pd.DataFrame({"missing_count": miss, "missing_pct": miss_pct})
display(miss_tbl.head(40))

plt.figure(); miss_tbl.head(20)["missing_pct"].plot(kind="bar"); plt.title("Top 20 missing %"); plt.ylabel("%"); plt.xticks(rotation=45, ha="right"); plt.tight_layout(); plt.show()

# Simple binary missingness heatmap for first 100 rows/cols (matplotlib only)
r = min(100, len(df)); c = min(30, df.shape[1])
plt.figure(figsize=(8,4))
plt.imshow(df.iloc[:r, :c].isna(), aspect="auto")
plt.title("Missingness heatmap (first 100 rows, 30 cols)"); plt.xlabel("Columns"); plt.ylabel("Rows")
plt.show()

print("Total NaNs:", int(df.isna().sum().sum()))


## 8) Numeric Features — Distributions & Pairplots (visuals)

In [None]:

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
plot_df = df if len(df)<=MAX_PLOT_ROWS else df.sample(MAX_PLOT_ROWS, random_state=RANDOM_STATE)

for col in num_cols[:10]:
    plt.figure(); plot_df[col].hist(bins=40); plt.title(f"Distribution: {col}"); plt.xlabel(col); plt.ylabel("Count"); plt.show()

# Pairwise scatter for first 3 numeric columns (quick view)
if len(num_cols) >= 3 and TARGET_COL in df.columns:
    first3 = num_cols[:3]
    for i in range(3):
        for j in range(i+1, 3):
            plt.figure()
            colors = df[TARGET_COL].map({0:0.2,1:0.8}) if df[TARGET_COL].dropna().isin([0,1]).all() else None
            plt.scatter(plot_df[first3[i]], plot_df[first3[j]], s=5, alpha=0.5)
            plt.xlabel(first3[i]); plt.ylabel(first3[j]); plt.title(f"Scatter: {first3[i]} vs {first3[j]}")
            plt.show()


## 9) Categorical Features — Top Levels & Proportions (visuals)

In [None]:

cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
for col in cat_cols[:8]:
    counts = df[col].astype(str).value_counts().head(12)
    plt.figure(); counts.plot(kind="bar"); plt.title(f"Top 12 levels: {col}"); plt.xticks(rotation=45, ha="right"); plt.tight_layout(); plt.show()
    plt.figure(); (counts/counts.sum()).plot(kind="bar"); plt.title(f"Top 12 proportions: {col}"); plt.xticks(rotation=45, ha="right"); plt.tight_layout(); plt.show()


## 10) Correlation Heatmap & Scatter Matrix (visuals)

In [None]:

subset = num_cols[:12]
if len(subset)>=2:
    corr = df[subset].corr()
    plt.figure(figsize=(8,6)); im = plt.imshow(corr, aspect="auto"); plt.colorbar(im); plt.title("Correlation heatmap (subset)")
    plt.xticks(range(len(subset)), subset, rotation=90); plt.yticks(range(len(subset)), subset); plt.tight_layout(); plt.show()

    # Simple scatter matrix (first 4 numeric columns)
    sm_cols = num_cols[:4]
    for i in range(len(sm_cols)):
        for j in range(i+1, len(sm_cols)):
            plt.figure(); plt.scatter(df[sm_cols[i]], df[sm_cols[j]], s=4, alpha=0.4)
            plt.xlabel(sm_cols[i]); plt.ylabel(sm_cols[j]); plt.title(f"Scatter: {sm_cols[i]} vs {sm_cols[j]}"); plt.show()


## 11) Train/Test Split & Class Balance (visuals)

In [None]:

ID_COLUMNS = []
X = df.drop(columns=[TARGET_COL]+[c for c in ID_COLUMNS if c in df.columns], errors="ignore")
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)
print("Train:", X_train.shape, "| Test:", X_test.shape)

bal = pd.DataFrame({"set":["train","test"],
                    "positive_pct":[float((y_train==1).mean()*100), float((y_test==1).mean()*100)]})
plt.figure(); plt.bar(bal["set"], bal["positive_pct"]); plt.title("Positive % — train vs test"); plt.ylabel("%"); plt.ylim(0,100); plt.show()


## 12) Preprocessing Plan + Feature Type Counts (visuals)

In [None]:

num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_pipe = Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
categorical_pipe = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])

preprocessor = ColumnTransformer([("num", numeric_pipe, num_cols), ("cat", categorical_pipe, cat_cols)])

plt.figure(); plt.bar(["numeric","categorical"], [len(num_cols), len(cat_cols)]); plt.title("Feature type counts"); plt.show()


## 13) Fit Preprocessor & Post‑Transform Checks (visuals)

In [None]:

Xt_train = preprocessor.fit_transform(X_train)
Xt_test = preprocessor.transform(X_test)
print("Transformed shapes:", Xt_train.shape, Xt_test.shape)
print("NaNs after preprocess (train)?", bool(np.isnan(Xt_train).any()))
print("NaNs after preprocess (test)?", bool(np.isnan(Xt_test).any()))
plt.figure(); plt.bar(["train features","test features"], [Xt_train.shape[1], Xt_test.shape[1]]); plt.title("Transformed feature counts"); plt.show()


## 14) SMOTE Demonstration on Training (visuals)

In [None]:

sm = SMOTE(random_state=RANDOM_STATE)
Xt_train_sm, y_train_sm = sm.fit_resample(Xt_train, y_train)

before_pct = float((y_train==1).mean()*100); after_pct = float((y_train_sm==1).mean()*100)
plt.figure(); plt.bar(["before SMOTE","after SMOTE"], [before_pct, after_pct]); plt.ylabel("% positive"); plt.title("Class balance (train)"); plt.ylim(0,100); plt.show()


## 15) Baseline Models + Cross‑Validation (table + visuals)

In [None]:

scorer = {"accuracy":"accuracy","precision":"precision","recall":"recall","f1":"f1","roc_auc":"roc_auc","average_precision":"average_precision"}
cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

pipe_logreg = ImbPipeline([("prep", preprocessor), ("smote", SMOTE(random_state=RANDOM_STATE)), ("clf", LogisticRegression(max_iter=1000, n_jobs=N_JOBS))])
pipe_rf = ImbPipeline([("prep", preprocessor), ("smote", SMOTE(random_state=RANDOM_STATE)), ("clf", RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=N_JOBS))])

models = {"LogisticRegression": pipe_logreg, "RandomForest": pipe_rf}

cv_means = {}; cv_folds = {}
for name, pipe in models.items():
    scores = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scorer, n_jobs=N_JOBS, return_estimator=False)
    cv_means[name] = {m.replace("test_",""): float(np.mean(scores[m])) for m in scores if m.startswith("test_")}
    cv_folds[name] = {m.replace("test_",""): scores[m] for m in scores if m.startswith("test_")}
cv_df = pd.DataFrame(cv_means).T
display(cv_df)

# Per-metric bar chart (CV means)
for metric in ["roc_auc","recall","precision","f1","average_precision","accuracy"]:
    if metric in cv_df.columns:
        plt.figure(); plt.bar(cv_df.index, cv_df[metric]); plt.title(f"CV mean — {metric}"); plt.ylim(0,1); plt.show()

# ROC-AUC boxplot across folds
if "roc_auc" in cv_folds["LogisticRegression"]:
    data = [cv_folds[m]["roc_auc"] for m in models.keys()]
    plt.figure(); plt.boxplot(data, labels=list(models.keys())); plt.title("CV ROC-AUC per fold"); plt.ylim(0,1); plt.show()

best_name = cv_df["roc_auc"].idxmax()
print("Best model by CV ROC‑AUC:", best_name)


## 16) Test‑Set Evaluation: Confusion, ROC, PR (visuals)

In [None]:

best_pipe = models[best_name]
best_pipe.fit(X_train, y_train)
proba = best_pipe.predict_proba(X_test)[:,1]
pred = (proba>=0.5).astype(int)

test_results = {
    "accuracy": accuracy_score(y_test, pred),
    "precision": precision_score(y_test, pred, zero_division=0),
    "recall": recall_score(y_test, pred, zero_division=0),
    "f1": f1_score(y_test, pred, zero_division=0),
    "roc_auc": roc_auc_score(y_test, proba),
    "average_precision": average_precision_score(y_test, proba),
}
print("Test results:", test_results)

plt.figure(); ConfusionMatrixDisplay.from_predictions(y_test, pred); plt.title(f"Confusion Matrix — {best_name}"); plt.show()
fpr,tpr,_ = roc_curve(y_test, proba); plt.figure(); plt.plot(fpr,tpr); plt.plot([0,1],[0,1],'--'); plt.title("ROC Curve"); plt.show()
prec,rec,_ = precision_recall_curve(y_test, proba); plt.figure(); plt.plot(rec,prec); plt.title("Precision‑Recall Curve"); plt.show()


## 17) Threshold Sweep (visual)

In [None]:

thresholds = np.linspace(0.1,0.9,9)
accs=[]; recs=[]; precs=[]; f1s=[]
for t in thresholds:
    yhat = (proba>=t).astype(int)
    accs.append(accuracy_score(y_test, yhat))
    recs.append(recall_score(y_test, yhat, zero_division=0))
    precs.append(precision_score(y_test, yhat, zero_division=0))
    f1s.append(f1_score(y_test, yhat, zero_division=0))

plt.figure(); plt.plot(thresholds,recs,label="Recall"); plt.plot(thresholds,precs,label="Precision"); plt.plot(thresholds,f1s,label="F1")
plt.title("Metric vs Threshold"); plt.xlabel("Threshold"); plt.ylabel("Score"); plt.legend(); plt.show()


## 18) Calibration Curve & Brier Score (visual)

In [None]:

bins = np.linspace(0.0, 1.0, 11)
digitized = np.digitize(proba, bins)-1
xs=[]; ys=[]
for i in range(len(bins)-1):
    mask = digitized==i
    if mask.any():
        xs.append(float(np.mean(proba[mask])))
        ys.append(float(np.mean(y_test.iloc[mask])))
plt.figure(); plt.plot(xs,ys,marker="o"); plt.plot([0,1],[0,1],'--'); plt.title("Calibration curve"); plt.xlabel("Avg predicted"); plt.ylabel("Observed pos rate"); plt.show()
print("Brier score:", brier_score_loss(y_test, proba))


## 19) Feature Importance (Random Forest) (visual)

In [None]:

rf_pipe = ImbPipeline([("prep", preprocessor), ("smote", SMOTE(random_state=RANDOM_STATE)), ("clf", RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=N_JOBS))])
rf_pipe.fit(X_train, y_train); rf = rf_pipe.named_steps["clf"]
if hasattr(rf,"feature_importances_"):
    imp = rf.feature_importances_; topk=min(20,len(imp)); idx=np.argsort(imp)[-topk:][::-1]
    plt.figure(figsize=(8,6)); plt.barh(range(topk), imp[idx][::-1]); plt.yticks(range(topk), [f"feature_{i}" for i in idx][::-1]); plt.title("RF top importances (indices)"); plt.tight_layout(); plt.show()
else:
    print("No importances available.")


## 20) Consolidated Results DataFrame + Metric Charts

In [None]:

# Build a single results DataFrame combining CV means and Test metrics for each model
results_rows = []
for name, pipe in models.items():
    # CV means already computed
    row = {"model": name}
    for m in ["accuracy","precision","recall","f1","roc_auc","average_precision"]:
        row[f"cv_{m}"] = float(pd.to_numeric(pd.Series(cv_folds[name][m])).mean())
    results_rows.append(row)

# Add test metrics for the chosen best model as well
for k,v in test_results.items():
    results_rows.append({"model": f"{best_name}_TEST", f"cv_{k}": float(v)})

results_df = pd.DataFrame(results_rows).set_index("model")
display(results_df)

# Visualize each metric across models
for m in ["cv_accuracy","cv_precision","cv_recall","cv_f1","cv_roc_auc","cv_average_precision"]:
    if m in results_df.columns:
        plt.figure(); results_df[m].plot(kind="bar")
        plt.title(f"Results comparison — {m}"); plt.ylim(0,1); plt.xticks(rotation=45, ha="right"); plt.tight_layout(); plt.show()


## 21) Notes for Scaling & Next Steps


- Swap `DATA_PATH` to the full dataset; keep this structure.  
- Decide threshold using the sweep to meet clinical recall targets.  
- (Optional) Add tuning later and persist the final pipeline with `joblib`.
