<a href="https://colab.research.google.com/github/nkeseeyo/datasets-nk/blob/main/C92550.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Environment & Folder Setup

This cell creates three folders—data, results, and models—to organise the project files. It ensures clean storage for datasets, output results, and trained models, preventing file-path errors during execution.

In [3]:
import os
os.makedirs("data", exist_ok=True)
os.makedirs("results", exist_ok=True)
os.makedirs("models", exist_ok=True)

In [26]:
# Load the BUSI dataset

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_ultra = datagen.flow_from_directory(
    "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT",
    target_size=(128,128),
    batch_size=32,
    class_mode='binary',
    subset='training'
)

val_ultra = datagen.flow_from_directory(
    "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT",
    target_size=(128,128),
    batch_size=32,
    class_mode='binary',
    subset='validation'
)

# Display the first batch of training data
images, labels = next(train_ultra)
print("Shape of the first batch of images:", images.shape)
print("Shape of the first batch of labels:", labels.shape)

Found 1263 images belonging to 3 classes.
Found 315 images belonging to 3 classes.
Shape of the first batch of images: (32, 128, 128, 3)
Shape of the first batch of labels: (32,)


In [5]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/"

ls: cannot access '/content/drive/MyDrive/datasets/DATASETSWORK/': No such file or directory


In [6]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/code"

ls: cannot access '/content/drive/MyDrive/datasets/DATASETSWORK/code': No such file or directory


In [7]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [8]:
!ls "/content/drive/MyDrive/datasets"


DATASETSWORK


In [9]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK"


Code


In [10]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code"


C92550.ipynb  data  dataset  models  requirements.txt  results


In [11]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset"


'Breast Cancer YasserH'  'MIAS Mammography Dataset'
'Breast Ultrasound'	 'Wisconsin Diagnostic'


In [12]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound"


Dataset_BUSI_with_GT


In [13]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT"


benign	malignant  normal


## Install Dependencies

This cell displays the active Python interpreter path and installs all essential libraries—joblib, pandas, numpy, scikit-learn, imbalanced-learn, shap, matplotlib, seaborn, kaggle, and tensorflow—ensuring every dependency required for data processing and model training is available.

In [14]:
import sys
print(sys.executable)
!{sys.executable} -m pip install joblib pandas numpy scikit-learn imbalanced-learn shap matplotlib seaborn kaggle tensorflow

/usr/bin/python3


## Imports & Helper Functions

This cell imports essential libraries, defines preprocessing, evaluation, and explainability functions, and sets a random seed. It prepares reusable utilities for model training, data cleaning, performance measurement, and SHAP-based feature importance visualisation in breast cancer classification tasks.

In [15]:
import json
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from typing import Tuple, Dict, Any
from pathlib import Path

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, roc_curve, auc, classification_report
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

try:
    import shap
    shap_installed = True
except:
    shap_installed = False

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

def ensure_binary(y: pd.Series) -> pd.Series:
    """Map common breast-cancer labels to 0/1 (0=benign, 1=malignant)."""
    y2 = y.copy()
    # Common label names
    mapping_candidates = [
        {'M':1, 'B':0, 'benign':0, 'malignant':1, 'Benign':0, 'Malignant':1},
        {'1':1, '0':0}, {1:1, 0:0}, {'yes':1, 'no':0}, {'Y':1, 'N':0}
    ]
    if y2.dtype == 'O':
        y2 = y2.str.strip()
    for mp in mapping_candidates:
        try:
            return y2.map(mp).astype(int)
        except Exception:
            pass
    if pd.api.types.is_numeric_dtype(y2):
        return (y2.astype(float) > 0).astype(int)
    classes = {cls:i for i, cls in enumerate(sorted(y2.unique()))}
    return y2.map(classes).astype(int)

def basic_numeric_categoricals(X: pd.DataFrame) -> Tuple[list, list]:
    """Split columns into numeric/categorical for preprocessing."""
    num_cols = X.select_dtypes(include=np.number).columns.tolist()
    cat_cols = [c for c in X.columns if c not in num_cols]
    return num_cols, cat_cols

def build_preprocessor(X: pd.DataFrame, scale: bool=True) -> ColumnTransformer:
    num_cols, cat_cols = basic_numeric_categoricals(X)
    num_tf = []
    if len(num_cols):
        num_tf = [('num_impute', SimpleImputer(strategy='median'), num_cols)]
        if scale:
            num_tf.append(('num_scale', StandardScaler(), num_cols))
    cat_tf = []
    if len(cat_cols):
        cat_tf = [('cat_impute', SimpleImputer(strategy='most_frequent'), cat_cols)]

    transformers = []
    transformers.extend([t for t in num_tf])
    transformers.extend([t for t in cat_tf])
    if not transformers:
        raise ValueError("No columns found to preprocess.")
    return ColumnTransformer(transformers=transformers, remainder='drop')

def evaluate_and_plot(y_true, y_prob, y_pred, model_name, dataset_name, out_dir="results"):
    """Compute metrics, save CM & ROC plots, return metrics dict."""
    os.makedirs(out_dir, exist_ok=True)
    metrics = {
        "dataset": dataset_name,
        "model": model_name,
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
    }
    try:
        roc_auc = roc_auc_score(y_true, y_prob)
    except:
        roc_auc = np.nan
    metrics["roc_auc"] = roc_auc

    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    fig_cm, ax = plt.subplots()
    sns.heatmap(cm, annot=True, fmt='d', ax=ax)
    ax.set_title(f"Confusion Matrix: {model_name} on {dataset_name}")
    ax.set_xlabel("Predicted"); ax.set_ylabel("True")
    cm_path = f"{out_dir}/CM_{dataset_name}_{model_name}.png".replace(" ", "_")
    fig_cm.savefig(cm_path, bbox_inches="tight", dpi=150)
    plt.close(fig_cm)

    # ROC curve
    if not np.isnan(roc_auc):
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        fig_roc, ax = plt.subplots()
        ax.plot(fpr, tpr, label=f"AUC={auc(fpr,tpr):.3f}")
        ax.plot([0,1], [0,1], linestyle='--')
        ax.set_title(f"ROC: {model_name} on {dataset_name}")
        ax.set_xlabel("False Positive Rate"); ax.set_ylabel("True Positive Rate")
        ax.legend()
        roc_path = f"{out_dir}/ROC_{dataset_name}_{model_name}.png".replace(" ", "_")
        fig_roc.savefig(roc_path, bbox_inches="tight", dpi=150)
        plt.close(fig_roc)

    # Classification report text
    report_txt = classification_report(y_true, y_pred, digits=3)
    with open(f"{out_dir}/REPORT_{dataset_name}_{model_name}.txt".replace(" ", "_"), "w") as f:
        f.write(report_txt)

    return metrics

def maybe_shap_summary(fitted_pipeline, X_tr, dataset_name, model_name, out_dir="results"):
    """Optional SHAP summary for RF/LR if shap is installed."""
    if not shap_installed:
        return
    # Extract final estimator
    try:
        final_est = fitted_pipeline.named_steps['clf']
    except:
        # imblearn Pipeline naming
        final_est = fitted_pipeline.named_steps.get('clf', None)
    if final_est is None:
        return
    if isinstance(final_est, (RandomForestClassifier, LogisticRegression)):
        # Get processed X for SHAP
        try:
            preproc = fitted_pipeline.named_steps['preproc']
            X_proc = preproc.transform(X_tr)
        except:
            X_proc = X_tr.values
        explainer = shap.Explainer(final_est, X_proc)
        shap_values = explainer(X_proc[:100])  # subsample for speed
        plt.figure()
        shap.plots.beeswarm(shap_values, show=False, max_display=15)
        p = f"{out_dir}/SHAP_{dataset_name}_{model_name}.png".replace(" ", "_")
        plt.title(f"SHAP Summary: {model_name} on {dataset_name}")
        plt.savefig(p, bbox_inches='tight', dpi=150)
        plt.close()

## Dataset Loaders
This cell defines dataset loader functions for various breast cancer datasets. It handles tabular data (Kaggle and Wisconsin) and image data (BUSI and MIAS) using preprocessing, directory-based loading, and Keras ImageDataGenerator for model-ready input.

In [16]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer as sk_breast
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split

# Utility
def ensure_binary(y: pd.Series):
    """Convert string labels like 'Benign'/'Malignant' into 0/1 numeric."""
    mapping = {'B':0, 'M':1, 'benign':0, 'malignant':1, 'normal':0, 'Benign':0, 'Malignant':1, 'Normal':0}
    if y.dtype == 'O':
        return y.map(mapping).astype(int)
    return y.astype(int)


# 1. Kaggle Breast Cancer YasserH
def load_dataset_yasserh(path="dataset/Breast Cancer YasserH/breast-cancer.csv"):
    df = pd.read_csv(path)
    target_col = None
    for col in ['diagnosis','Class','target','Outcome']:
        if col in df.columns:
            target_col = col
            break
    if target_col is None:
        target_col = df.columns[-1]
    y = ensure_binary(df[target_col])
    X = df.drop(columns=[target_col])
    for c in ['id','ID','Unnamed: 32']:
        if c in X.columns:
            X = X.drop(columns=[c])
    return X, y


# 2. Wisconsin Diagnostic Dataset
def load_dataset_wisconsin(path="dataset/Wisconsin Diagnostic/data.csv"):
    df = pd.read_csv(path)
    # Usually columns: ID, diagnosis, 30 features
    if 'diagnosis' in df.columns:
        y = ensure_binary(df['diagnosis'])
        X = df.drop(columns=['diagnosis'])
    else:
        y = ensure_binary(df.iloc[:, 1])
        X = df.drop(df.columns[[0, 1]], axis=1)
    return X, y


# 3. Breast Ultrasound Dataset (BUSI)
def load_dataset_ultrasound(path="dataset/Breast Ultrasound/Dataset_BUSI_with_GT", img_size=(128,128), batch_size=32):
    """
    Loads BUSI ultrasound images from three folders: benign, malignant, normal.
    Returns ImageDataGenerator flow for training/testing.
    """
    datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

    train_gen = datagen.flow_from_directory(
        path,
        target_size=img_size,
        batch_size=batch_size,
        class_mode='binary',
        subset='training'
    )
    val_gen = datagen.flow_from_directory(
        path,
        target_size=img_size,
        batch_size=batch_size,
        class_mode='binary',
        subset='validation'
    )
    return train_gen, val_gen


# 4. MIAS Mammography Dataset
def load_dataset_mias(path="dataset/MIAS Mammography Dataset/all-mias", img_size=(128,128), batch_size=32):
    """
    Loads MIAS images (converted to PNG/JPG/PGM) using ImageDataGenerator.
    Expects folders containing labeled subdirectories or filenames with labels.
    If only one folder exists, you can create subfolders manually by label.
    """
    datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

    # If the dataset has subfolders (e.g., benign/malignant)
    if any(os.path.isdir(os.path.join(path, d)) for d in os.listdir(path)):
        train_gen = datagen.flow_from_directory(
            path,
            target_size=img_size,
            batch_size=batch_size,
            class_mode='binary',
            subset='training'
        )
        val_gen = datagen.flow_from_directory(
            path,
            target_size=img_size,
            batch_size=batch_size,
            class_mode='binary',
            subset='validation'
        )
    else:
        raise ValueError("MIAS dataset needs subfolders (benign/malignant) or labeled filenames.")
    return train_gen, val_gen


# 5. Sklearn Built-In Dataset (optional for comparison)
def load_dataset_sklearn():
    data = sk_breast()
    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = pd.Series(data.target)
    return X, y

## Define the 4 Algorithms
This cell defines a function that returns four machine learning models—Random Forest, SVM, KNN, and Logistic Regression—each configured with balanced class weights and tuned parameters for reliable breast cancer classification.

In [17]:
def get_models() -> Dict[str, Any]:
    models = {
        "RandomForest": RandomForestClassifier(
            n_estimators=300, random_state=RANDOM_STATE, class_weight='balanced'
        ),
        "SVM": SVC(
            kernel='rbf', probability=True, random_state=RANDOM_STATE, class_weight='balanced'
        ),
        "KNN": KNeighborsClassifier(n_neighbors=7),
        "LogisticRegression": LogisticRegression(
            max_iter=2000, random_state=RANDOM_STATE, class_weight='balanced'
        )
    }
    return models

## Unified Train/Eval Loop (Produces 16 Results)
This cell defines a function that trains all machine learning models on one dataset. It splits data, applies preprocessing, handles imbalance with SMOTE, evaluates performance using multiple metrics, saves visual results, and exports trained models and CSV summaries.

In [18]:
def train_one_dataset(X, y, dataset_name: str, models: Dict[str, Any], out_dir="results") -> pd.DataFrame:
    results = []
    os.makedirs(out_dir, exist_ok=True)

    # Split once for fair comparison
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
    )

    # Preprocessor (scale beneficial for SVM/KNN/LR; ok for RF too)
    preproc = build_preprocessor(X_tr, scale=True)

    for model_name, clf in models.items():
        # Imbalanced pipeline: Preprocess -> SMOTE -> Classifier
        pipe = ImbPipeline(steps=[
            ('preproc', preproc),
            ('smote', SMOTE(random_state=RANDOM_STATE)),
            ('clf', clf)
        ])

        # Fit
        pipe.fit(X_tr, y_tr)

        # Predict proba and labels
        try:
            y_prob = pipe.predict_proba(X_te)[:,1]
        except:
            # For models without predict_proba
            y_prob = pipe.decision_function(X_te)
            # scale to 0-1
            y_prob = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min() + 1e-8)
        y_pred = pipe.predict(X_te)

        # Evaluate + plots
        metrics = evaluate_and_plot(y_te, y_prob, y_pred, model_name, dataset_name, out_dir=out_dir)
        results.append(metrics)

        maybe_shap_summary(pipe, X_tr, dataset_name, model_name, out_dir=out_dir)

        # Save model
        joblib.dump(pipe, f"models/{dataset_name}_{model_name}.joblib".replace(" ", "_"))

    df_res = pd.DataFrame(results)
    csv_path = f"{out_dir}/ALL_METRICS.csv"
    if os.path.exists(csv_path):
        existing = pd.read_csv(csv_path)
        pd.concat([existing, df_res], ignore_index=True).to_csv(csv_path, index=False)
    else:
        df_res.to_csv(csv_path, index=False)

    df_res.to_csv(f"{out_dir}/METRICS_{dataset_name}.csv".replace(" ", "_"), index=False)
    return df_res

## Run All 4 Datasets × 4 Models
This cell sequentially trains models on all datasets. It runs tabular datasets through machine learning pipelines, loads image datasets for CNN preparation, merges tabular results into one summary file, and confirms readiness for deep learning training.

In [19]:
all_results = []

models = get_models()

# Kaggle Breast Cancer (YasserH - tabular)
try:
    X1, y1 = load_dataset_yasserh("/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Cancer YasserH/breast-cancer.csv")
    res1 = train_one_dataset(X1, y1, "Kaggle_YasserH", models)
    all_results.append(res1)
except Exception as e:
    print("Kaggle_YasserH dataset skipped:", e)


# Wisconsin Diagnostic Dataset (tabular)
try:
    X2, y2 = load_dataset_wisconsin("/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Wisconsin Diagnostic/data.csv")
    res2 = train_one_dataset(X2, y2, "Wisconsin_Diagnostic", models)
    all_results.append(res2)
except Exception as e:
    print("Wisconsin_Diagnostic dataset skipped:", e)


# Breast Ultrasound Dataset (BUSI - image dataset)
try:
    train_ultra, val_ultra = load_dataset_ultrasound("/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT")
    print("Ultrasound dataset loaded successfully — ready for CNN training.")
except Exception as e:
    print("Breast Ultrasound dataset skipped:", e)


# MIAS Mammography Dataset (image dataset)
try:
    train_mias, val_mias = load_dataset_mias("/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/MIAS Mammography Dataset/all-mias")
    print("MIAS dataset loaded successfully — ready for CNN training.")
except Exception as e:
    print("MIAS dataset skipped:", e)


# Combine tabular model results (first two datasets)
if len(all_results):
    final_table = pd.concat(all_results, ignore_index=True)
    final_table.sort_values(["dataset", "roc_auc"], ascending=[True, False], inplace=True)
    display(final_table)
    final_table.to_csv("results/SUMMARY_TABULAR_RESULTS.csv", index=False)
    print("Tabular model results saved to results/SUMMARY_TABULAR_RESULTS.csv")
else:
    print("No tabular datasets ran successfully. Please check file paths.")

print("\n Tabular datasets processed. Image datasets (Ultrasound, MIAS) ready for CNN model training.")

Kaggle_YasserH dataset skipped: The beeswarm plot does not support plotting explanations with instances that have more than one dimension!
Wisconsin_Diagnostic dataset skipped: Input X contains NaN.
SMOTE does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Found 1263 images belonging to 3 classes.


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


Found 315 images belonging to 3 classes.
Ultrasound dataset loaded successfully — ready for CNN training.
MIAS dataset skipped: MIAS dataset needs subfolders (benign/malignant) or labeled filenames.
No tabular datasets ran successfully. Please check file paths.

 Tabular datasets processed. Image datasets (Ultrasound, MIAS) ready for CNN model training.


<Figure size 640x480 with 0 Axes>

In [20]:
from sklearn.impute import SimpleImputer
import pandas as pd

# Load Wisconsin dataset
wisconsin_path = "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Wisconsin Diagnostic/data.csv"  # adjust if needed
df_wisconsin = pd.read_csv(wisconsin_path)

# Drop irrelevant column
df_wisconsin = df_wisconsin.drop(columns=["Unnamed: 32"], errors='ignore')

# Separate features and labels
X = df_wisconsin.drop(columns=['diagnosis'])
y = df_wisconsin['diagnosis']

# Replace NaN values
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)

print("✅ Wisconsin dataset cleaned successfully — no NaN values remaining.")


✅ Wisconsin dataset cleaned successfully — no NaN values remaining.


In [21]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset"


'Breast Cancer YasserH'  'MIAS Mammography Dataset'
'Breast Ultrasound'	 'Wisconsin Diagnostic'


In [22]:
!ls "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Cancer YasserH"


breast-cancer.csv


## CNN Model
This cell builds and trains a Convolutional Neural Network (CNN) for breast ultrasound images. It uses convolution and pooling layers for feature extraction, compiles the model, trains it for ten epochs, and saves the trained CNN.

In [23]:
# STEP 1 — Install required packages
!pip install tensorflow

# STEP 2 — Import libraries
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models
import os

# STEP 3 — Load the BUSI dataset
# (Make sure you have this folder structure in your Colab files:)
# dataset/Breast Ultrasound/Dataset_BUSI_with_GT/benign
# dataset/Breast Ultrasound/Dataset_BUSI_with_GT/malignant
# dataset/Breast Ultrasound/Dataset_BUSI_with_GT/normal

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_ultra = datagen.flow_from_directory(
    "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT",
    target_size=(128,128),
    batch_size=32,
    class_mode='binary',
    subset='training'
)

val_ultra = datagen.flow_from_directory(
    "/content/drive/MyDrive/datasets/DATASETSWORK/Code/dataset/Breast Ultrasound/Dataset_BUSI_with_GT",
    target_size=(128,128),
    batch_size=32,
    class_mode='binary',
    subset='validation'
)

# STEP 4 — Build the CNN model
def build_cnn_model(input_shape=(128,128,3)):
    model = models.Sequential([
        layers.Conv2D(32, (3,3), activation='relu', input_shape=input_shape),
        layers.MaxPooling2D(2,2),
        layers.Conv2D(64, (3,3), activation='relu'),
        layers.MaxPooling2D(2,2),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# STEP 5 — Train the model
cnn = build_cnn_model()
history = cnn.fit(train_ultra, validation_data=val_ultra, epochs=10)

# STEP 6 — Save the model
os.makedirs("models", exist_ok=True)
cnn.save("models/CNN_BUSI.h5")

Found 1263 images belonging to 3 classes.
Found 315 images belonging to 3 classes.


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  self._warn_if_super_not_called()


Epoch 1/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m720s[0m 18s/step - accuracy: 0.4104 - loss: 0.8884 - val_accuracy: 0.3651 - val_loss: 0.6704
Epoch 2/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 1s/step - accuracy: 0.4829 - loss: 0.5258 - val_accuracy: 0.3524 - val_loss: 0.5396
Epoch 3/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 1s/step - accuracy: 0.5583 - loss: -0.1224 - val_accuracy: 0.3873 - val_loss: 0.3697
Epoch 4/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 1s/step - accuracy: 0.5187 - loss: -2.0184 - val_accuracy: 0.3524 - val_loss: -1.0487
Epoch 5/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 1s/step - accuracy: 0.5577 - loss: -14.6068 - val_accuracy: 0.3524 - val_loss: -15.8573
Epoch 6/10
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 1s/step - accuracy: 0.5732 - loss: -86.2320 - val_accuracy: 0.4508 - val_loss: -93.6498
Epoch 7/10
[1m40/40[0m [



## TRAINING
This cell performs end-to-end model training for all datasets. It builds and trains ML models for tabular data (Kaggle, Wisconsin) and CNNs for image datasets (BUSI, MIAS), handles preprocessing, evaluates results, saves models, and generates summary files.

In [24]:
import os
import pandas as pd
from sklearn.impute import SimpleImputer
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from tensorflow.keras import layers, models

# Helper: CNN Model
def build_cnn_model(input_shape=(128,128,3)):
    model = models.Sequential([
        layers.Conv2D(32, (3,3), activation='relu', input_shape=input_shape),
        layers.MaxPooling2D(2,2),
        layers.Conv2D(64, (3,3), activation='relu'),
        layers.MaxPooling2D(2,2),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


# Tabular Model Training
def train_one_dataset(X, y, dataset_name: str, models: Dict[str, Any], out_dir="results") -> pd.DataFrame:
    results = []
    os.makedirs(out_dir, exist_ok=True)

    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)
    preproc = build_preprocessor(X_tr, scale=True)

    for model_name, clf in models.items():
        pipe = ImbPipeline(steps=[
            ('preproc', preproc),
            ('imputer', SimpleImputer(strategy='median')),
            ('smote', SMOTE(random_state=RANDOM_STATE)),
            ('clf', clf)
        ])

        pipe.fit(X_tr, y_tr)

        try:
            y_prob = pipe.predict_proba(X_te)[:, 1]
        except:
            y_prob = pipe.decision_function(X_te)
            y_prob = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min() + 1e-8)
        y_pred = pipe.predict(X_te)

        metrics = evaluate_and_plot(y_te, y_prob, y_pred, model_name, dataset_name, out_dir=out_dir)
        results.append(metrics)

        try:
            maybe_shap_summary(pipe, X_tr, dataset_name, model_name, out_dir=out_dir)
        except Exception as e:
            print(f"SHAP skipped for {dataset_name}-{model_name}: {e}")

        joblib.dump(pipe, f"models/{dataset_name}_{model_name}.joblib".replace(" ", "_"))

    df_res = pd.DataFrame(results)
    df_res.to_csv(f"{out_dir}/METRICS_{dataset_name}.csv".replace(" ", "_"), index=False)
    return df_res


# Run All Datasets
all_results = []
models = get_models()

# Kaggle (YasserH)
try:
    X1, y1 = load_dataset_yasserh("dataset/Breast Cancer YasserH/breast-cancer.csv")
    print(f"Training on Kaggle_YasserH dataset with {X1.shape[0]} samples...")
    res1 = train_one_dataset(X1, y1, "Kaggle_YasserH", models)
    all_results.append(res1)
    print("Kaggle_YasserH complete.")
except Exception as e:
    print("Kaggle_YasserH dataset skipped:", e)


# Wisconsin Diagnostic
try:
    X2, y2 = load_dataset_wisconsin("dataset/Wisconsin Diagnostic/data.csv")
    print(f"Training on Wisconsin_Diagnostic dataset with {X2.shape[0]} samples...")
    res2 = train_one_dataset(X2, y2, "Wisconsin_Diagnostic", models)
    all_results.append(res2)
    print("Wisconsin_Diagnostic complete.")
except Exception as e:
    print("Wisconsin_Diagnostic dataset skipped:", e)


# Breast Ultrasound (BUSI)
try:
    train_ultra, val_ultra = load_dataset_ultrasound("dataset/Breast Ultrasound/Dataset_BUSI_with_GT")
    print("Training CNN on BUSI (Ultrasound) dataset...")
    cnn_ultra = build_cnn_model()
    hist_ultra = cnn_ultra.fit(train_ultra, validation_data=val_ultra, epochs=10)
    cnn_ultra.save("models/CNN_BUSI.h5")
    print("BUSI CNN training complete and model saved.")
except Exception as e:
    print("BUSI dataset skipped:", e)


# MIAS Mammography Dataset
try:
    train_mias, val_mias = load_dataset_mias("dataset/MIAS Mammography Dataset/all-mias")
    print("Training CNN on MIAS (Mammography) dataset...")
    cnn_mias = build_cnn_model()
    hist_mias = cnn_mias.fit(train_mias, validation_data=val_mias, epochs=10)
    cnn_mias.save("models/CNN_MIAS.h5")
    print("MIAS CNN training complete and model saved.")
except Exception as e:
    print("MIAS dataset skipped:", e)


# Combine Tabular Results
if len(all_results):
    final_table = pd.concat(all_results, ignore_index=True)
    final_table.sort_values(["dataset", "roc_auc"], ascending=[True, False], inplace=True)
    final_table.to_csv("results/SUMMARY_TABULAR_RESULTS.csv", index=False)
    print("Tabular results saved: results/SUMMARY_TABULAR_RESULTS.csv")
else:
    print("No tabular datasets ran successfully. Please check data paths.")

print("\n Training complete for all datasets — tabular + CNN models ready.")


Kaggle_YasserH dataset skipped: [Errno 2] No such file or directory: 'dataset/Breast Cancer YasserH/breast-cancer.csv'
Wisconsin_Diagnostic dataset skipped: [Errno 2] No such file or directory: 'dataset/Wisconsin Diagnostic/data.csv'
BUSI dataset skipped: [Errno 2] No such file or directory: 'dataset/Breast Ultrasound/Dataset_BUSI_with_GT'
MIAS dataset skipped: [Errno 2] No such file or directory: 'dataset/MIAS Mammography Dataset/all-mias'
No tabular datasets ran successfully. Please check data paths.

 Training complete for all datasets — tabular + CNN models ready.


## Nice Summary Plots for Padlet
This cell visualises model performance across datasets by plotting ROC AUC scores from the summary CSV. It groups results by dataset, styles the chart with distinct colours, adds labels and legends, and saves the figure for performance comparison.

In [25]:
import matplotlib.pyplot as plt
import pandas as pd

try:
    df_path = "results/SUMMARY_TABULAR_RESULTS.csv"
    if not os.path.exists(df_path):
        raise FileNotFoundError(f"{df_path} not found. Run training cells first.")

    df = pd.read_csv(df_path)

    plt.figure(figsize=(12, 7))
    colors = plt.cm.Paired.colors

    for i, ds in enumerate(df['dataset'].unique()):
        sub = df[df['dataset'] == ds]
        x_labels = [f"{ds}-{m}" for m in sub['model']]
        plt.bar(x_labels, sub['roc_auc'], color=colors[i % len(colors)], label=ds)

    plt.xticks(rotation=70, ha='right', fontsize=9)
    plt.ylabel("ROC AUC", fontsize=11)
    plt.title("ROC AUC across Tabular Datasets × 4 Models", fontsize=13, weight='bold')
    plt.legend(title="Dataset", fontsize=9)
    plt.tight_layout()

    output_path = "results/BAR_ROC_AUC_TABULAR.png"
    plt.savefig(output_path, dpi=150)
    plt.close()
    print(f"Summary bar plot saved to {output_path}")

except Exception as e:
    print("Summary bar plot skipped due to error:", e)

Summary bar plot skipped due to error: results/SUMMARY_TABULAR_RESULTS.csv not found. Run training cells first.
