# Skin Cancer Classification: Case Study and Implementation

## Problem Statement

Late skin cancer diagnosis in Africa, driven by limited diagnostic access, high treatment costs, and socio-cultural barriers, results in high mortality rates. In 2020, skin cancer accounted for approximately 10,000 deaths annually in Africa, with over 90% of cases diagnosed at advanced stages due to inadequate healthcare infrastructure (GLOBOCAN 2020). Current solutions, such as mobile health units and WHO screening programs, are constrained by insufficient funding, limited reach, and stigma, necessitating an accessible, low-cost, and accurate diagnostic tool for early detection to improve outcomes in underserved communities.

## Objective

Develop a modular machine learning pipeline using XGBoost to classify skin lesion images as benign or malignant, with data augmentation, hyperparameter tuning, and a retraining mechanism. The pipeline is split into `preprocessing.py`, `model.py`, and `prediction.py`, with this notebook demonstrating the full workflow and evaluation metrics.

## Dataset

- **Source**: ISIC dataset.
- **Structure**: `data/train/` and `data/test/` with subfolders `benign/` and `malignant/`.
- **Preprocessing**: Images resized to 172x251, normalized, and flattened for XGBoost. Training data includes augmentation (rotations, flips, brightness, grayscale).

## Requirements

```bash
pip install numpy pillow tensorflow scikit-learn xgboost matplotlib seaborn scikit-plot
```

## Step 1: Import Dependencies and Scripts

Import the necessary libraries and our modular scripts from the `src/` directory.

In [1]:
import os
import sys
sys.path.append('C:\\Users\\TestSolutions\\Desktop\\Summative - ML Pipeline')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, auc
from src.preprocessing import load_dataset, preprocess_single_image
from src.model import create_model, train_model, evaluate_model, save_model, trigger_retrain
from src.prediction import load_model, predict_single_image, predict_batch

# Set paths
train_dir = '../data/train'
test_dir = '../data/test'
model_path = '../models/optimized_xgb_model.pkl'
new_data_dir = '../data/new_data'  # For retraining
os.makedirs('../models', exist_ok=True)



## Step 2: Load and Preprocess Data

Load the training and test datasets using `preprocessing.py`, applying augmentation to training data and no augmentation to test data.

In [2]:
# Load training data with augmentation
train_gen, train_samples = load_dataset(train_dir, batch_size=32, augmentation=True, normalize=True)
print(f'Loaded {train_samples} training samples')

# Load test data without augmentation
test_gen, test_samples = load_dataset(test_dir, batch_size=32, augmentation=False, normalize=True)
print(f'Loaded {test_samples} test samples')

2025-07-31 14:56:23,356 - INFO - Attempting to load dataset from: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train
2025-07-31 14:56:23,359 - INFO - Checking benign directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train\benign
2025-07-31 14:56:23,361 - INFO - Checking malignant directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train\malignant
2025-07-31 14:56:23,368 - INFO - Found 139 images in c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train (benign: 48, malignant: 91)


Found 139 images belonging to 2 classes.


2025-07-31 14:56:23,426 - INFO - Loaded dataset from c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train with 139 samples
2025-07-31 14:56:23,429 - INFO - Attempting to load dataset from: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test
2025-07-31 14:56:23,430 - INFO - Checking benign directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test\benign
2025-07-31 14:56:23,431 - INFO - Checking malignant directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test\malignant
2025-07-31 14:56:23,435 - INFO - Found 56 images in c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test (benign: 19, malignant: 37)


Loaded 139 training samples
Found 56 images belonging to 2 classes.


2025-07-31 14:56:23,472 - INFO - Loaded dataset from c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test with 56 samples


Loaded 56 test samples


## Step 3: Train the Model

Create and train the XGBoost model with hyperparameter tuning using `model.py`.

In [3]:
# Create and train model
model = create_model()
if model:
    model = train_model(model, train_dir, batch_size=32, tune_hyperparameters=True)
    if model:
        save_model(model, model_path)
    else:
        print('Model training failed.')
else:
    print('Model creation failed.')

2025-07-31 14:56:23,496 - INFO - XGBoost model created successfully
2025-07-31 14:56:23,500 - INFO - Attempting to load dataset from: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train
2025-07-31 14:56:23,502 - INFO - Checking benign directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train\benign
2025-07-31 14:56:23,504 - INFO - Checking malignant directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train\malignant
2025-07-31 14:56:23,509 - INFO - Found 139 images in c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train (benign: 48, malignant: 91)


Found 139 images belonging to 2 classes.


2025-07-31 14:56:23,570 - INFO - Loaded dataset from c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\train with 139 samples
2025-07-31 14:56:23,571 - INFO - Loaded 139 training samples
2025-07-31 14:56:52,831 - ERROR - Error training model: 
All the 81 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
81 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\venv\Lib\site-packages\xgboost\core.py", line 729, in inner_f
    return func(**kwargs)


Model training failed.


## Step 4: Evaluate the Model

Evaluate the model on the test dataset and display metrics (accuracy, precision, recall, F1-score, confusion matrix).

In [4]:
if model:
    metrics = evaluate_model(model, test_dir, batch_size=32)
    if metrics:
        print('Evaluation Metrics:')
        for key, value in metrics.items():
            if key != 'confusion_matrix':
                print(f'{key}: {value:.4f}')
            else:
                print(f'{key}:\n{np.array(value)}')

        # Visualize confusion matrix
        cm = np.array(metrics['confusion_matrix'])
        plt.figure(figsize=(6, 4))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
        plt.title('Confusion Matrix')
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.show()

        # ROC Curve
        test_gen, _ = load_dataset(test_dir, batch_size=32, augmentation=False, normalize=True)
        X_test, y_test = [], []
        for _ in range(test_samples // 32 + 1):
            batch_x, batch_y = next(test_gen)
            X_test.append(batch_x)
            y_test.append(batch_y)
        X_test = np.vstack(X_test)
        y_test = np.hstack(y_test)
        y_scores = model.predict_proba(X_test)[:, 1]
        skplt.metrics.plot_roc(y_test, model.predict_proba(X_test), plot_micro=False, plot_macro=False, classes_to_plot=[1])
        plt.title('ROC Curve for Malignant Class')
        plt.show()
else:
    print('No model available for evaluation.')

No model available for evaluation.


## Step 5: Make Predictions

Demonstrate single-image and batch predictions using `prediction.py`. Note: Replace `sample_image` with an actual image path from your dataset.

In [5]:
# Single image prediction
sample_image = 'data/test/benign/ISIC_1431322.jpg';  # Replace with actual image path, e.g., 'data/test/benign/ISIC_1431322.jpg'
if model:
    result = predict_single_image(model, sample_image)
    if result:
        print(f"Single Image Prediction: {result['image_path']}")
        print(f"  Predicted: {result['prediction']}, Probability: {result['probability']:.4f}")

# Batch prediction
predictions = predict_batch(model, test_dir, batch_size=32)
print('\nBatch Prediction Results (first 10):')
for pred in predictions[:10]:
    print(f"Image: {pred['image_path']}, Predicted: {pred['prediction']}, Probability: {pred['probability']:.4f}")

2025-07-31 14:56:53,005 - INFO - Attempting to load dataset from: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test
2025-07-31 14:56:53,010 - INFO - Checking benign directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test\benign
2025-07-31 14:56:53,018 - INFO - Checking malignant directory: c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test\malignant
2025-07-31 14:56:53,030 - INFO - Found 56 images in c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test (benign: 19, malignant: 37)


Found 56 images belonging to 2 classes.


2025-07-31 14:56:53,204 - INFO - Loaded dataset from c:\Users\TestSolutions\Desktop\Summative - ML Pipeline\data\test with 56 samples
2025-07-31 14:56:53,206 - INFO - Loaded 56 test samples for batch prediction
2025-07-31 14:56:54,078 - ERROR - Error in batch prediction: 'NoneType' object has no attribute 'predict_proba'



Batch Prediction Results (first 10):


## Step 6: Retrain the Model

Trigger retraining if new data is available or performance is below threshold (0.8).

In [6]:
if model:
    model = trigger_retrain(model, new_data_dir, model_path, performance_threshold=0.8, test_dir=test_dir)
    if model:
        print('Retraining completed successfully.')
    else:
        print('Retraining failed.')

## Step 7: Visualize Confusion Matrix (Chart.js)

Create an interactive confusion matrix visualization using Chart.js. Note: Using 'bar' chart instead of 'matrix' to comply with allowed Chart.js types.

In [7]:
%%javascript
if (metrics && metrics['confusion_matrix']) {
    const cm = metrics['confusion_matrix'];
    const ctx = document.createElement('canvas').getContext('2d');
    document.body.appendChild(ctx.canvas);

```chartjs
{
  "type": "bar",
  "data": {
    "labels": ["True Benign, Pred Benign", "True Benign, Pred Malignant", "True Malignant, Pred Benign", "True Malignant, Pred Malignant"],
    "datasets": [{
      "label": "Confusion Matrix Counts",
      "data": [cm[0][0], cm[0][1], cm[1][0], cm[1][1]],
      "backgroundColor": ["rgba(54, 162, 235, 0.5)", "rgba(255, 99, 132, 0.5)", "rgba(75, 192, 192, 0.5)", "rgba(255, 205, 86, 0.5)"],
      "borderColor": ["rgba(54, 162, 235, 1)", "rgba(255, 99, 132, 1)", "rgba(75, 192, 192, 1)", "rgba(255, 205, 86, 1)"],
      "borderWidth": 1
    }]
  },
  "options": {
    "scales": {
      "y": {
        "beginAtZero": true,
        "title": { "display": true, "text": "Count" }
      },
      "x": {
        "title": { "display": true, "text": "Confusion Matrix Categories" }
      }
    },
    "plugins": {
      "title": { "display": true, "text": "Confusion Matrix" }
    }
  }
}
```
}

<IPython.core.display.Javascript object>