In [None]:
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Skin Cancer Classification: Case Study and Implementation\n",
        "\n",
        "## Problem Statement\n",
        "\n",
        "Late skin cancer diagnosis in Africa, driven by limited diagnostic access, high treatment costs, and socio-cultural barriers, results in high mortality rates. In 2020, skin cancer accounted for approximately 10,000 deaths annually in Africa, with over 90% of cases diagnosed at advanced stages due to inadequate healthcare infrastructure (GLOBOCAN 2020). Current solutions, such as mobile health units and WHO screening programs, are constrained by insufficient funding, limited reach, and stigma, necessitating an accessible, low-cost, and accurate diagnostic tool for early detection to improve outcomes in underserved communities.\n",
        "\n",
        "## Objective\n",
        "\n",
        "Develop a modular machine learning pipeline using XGBoost to classify skin lesion images as benign or malignant, with data augmentation, hyperparameter tuning, and a retraining mechanism. The pipeline is split into `preprocessing.py`, `model.py`, and `prediction.py`, with this notebook demonstrating the full workflow and evaluation metrics.\n",
        "\n",
        "## Dataset\n",
        "\n",
        "- **Source**: ISIC dataset (assumed, based on file names like `ISIC_1431322.jpg`).\n",
        "- **Structure**: `data/train/` and `data/test/` with subfolders `benign/` and `malignant/`.\n",
        "- **Preprocessing**: Images resized to 172x251, normalized, and flattened for XGBoost. Training data includes augmentation (rotations, flips, brightness, grayscale).\n",
        "\n",
        "## Requirements\n",
        "\n",
        "```bash\n",
        "pip install numpy pillow tensorflow scikit-learn xgboost matplotlib seaborn scikit-plot\n",
        "```"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 1: Import Dependencies and Scripts\n",
        "\n",
        "Import the necessary libraries and our modular scripts from the `src/` directory."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "import scikitplot as skplt\n",
        "from src.preprocessing import load_dataset, preprocess_single_image\n",
        "from src.model import create_model, train_model, evaluate_model, save_model, trigger_retrain\n",
        "from src.prediction import load_model, predict_single_image, predict_batch\n",
        "\n",
        "# Set paths\n",
        "train_dir = 'data/train'\n",
        "test_dir = 'data/test'\n",
        "model_path = 'models/optimized_xgb_model.pkl'\n",
        "new_data_dir = 'data/new_data'  # For retraining\n",
        "os.makedirs('models', exist_ok=True)"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 2: Load and Preprocess Data\n",
        "\n",
        "Load the training and test datasets using `preprocessing.py`, applying augmentation to training data and no augmentation to test data."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Load training data with augmentation\n",
        "train_gen, train_samples = load_dataset(train_dir, batch_size=32, augmentation=True, normalize=True)\n",
        "print(f'Loaded {train_samples} training samples')\n",
        "\n",
        "# Load test data without augmentation\n",
        "test_gen, test_samples = load_dataset(test_dir, batch_size=32, augmentation=False, normalize=True)\n",
        "print(f'Loaded {test_samples} test samples')"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 3: Train the Model\n",
        "\n",
        "Create and train the XGBoost model with hyperparameter tuning using `model.py`."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Create and train model\n",
        "model = create_model()\n",
        "if model:\n",
        "    model = train_model(model, train_dir, batch_size=32, tune_hyperparameters=True)\n",
        "    if model:\n",
        "        save_model(model, model_path)\n",
        "    else:\n",
        "        print('Model training failed.')\n",
        "else:\n",
        "    print('Model creation failed.')"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 4: Evaluate the Model\n",
        "\n",
        "Evaluate the model on the test dataset and display metrics (accuracy, precision, recall, F1-score, confusion matrix)."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "if model:\n",
        "    metrics = evaluate_model(model, test_dir, batch_size=32)\n",
        "    if metrics:\n",
        "        print('Evaluation Metrics:')\n",
        "        for key, value in metrics.items():\n",
        "            if key != 'confusion_matrix':\n",
        "                print(f'{key}: {value:.4f}')\n",
        "            else:\n",
        "                print(f'{key}:\\n{np.array(value)}')\n",
        "\n",
        "        # Visualize confusion matrix\n",
        "        cm = np.array(metrics['confusion_matrix'])\n",
        "        plt.figure(figsize=(6, 4))\n",
        "        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])\n",
        "        plt.title('Confusion Matrix')\n",
        "        plt.xlabel('Predicted')\n",
        "        plt.ylabel('True')\n",
        "        plt.show()\n",
        "\n",
        "        # ROC Curve\n",
        "        test_gen, _ = load_dataset(test_dir, batch_size=32, augmentation=False, normalize=True)\n",
        "        X_test, y_test = [], []\n",
        "        for _ in range(test_samples // 32 + 1):\n",
        "            batch_x, batch_y = next(test_gen)\n",
        "            X_test.append(batch_x)\n",
        "            y_test.append(batch_y)\n",
        "        X_test = np.vstack(X_test)\n",
        "        y_test = np.hstack(y_test)\n",
        "        y_scores = model.predict_proba(X_test)[:, 1]\n",
        "        skplt.metrics.plot_roc(y_test, model.predict_proba(X_test), plot_micro=False, plot_macro=False, classes_to_plot=[1])\n",
        "        plt.title('ROC Curve for Malignant Class')\n",
        "        plt.show()\n",
        "else:\n",
        "    print('No model available for evaluation.')"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 5: Make Predictions\n",
        "\n",
        "Demonstrate single-image and batch predictions using `prediction.py`."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Single image prediction\n",
        "sample_image = 'data/test/benign/ISIC_1431322.jpg'\n",
        "if model:\n",
        "    result = predict_single_image(model, sample_image)\n",
        "    if result:\n",
        "        print(f\"Single Image Prediction: {result['image_path']}\")\n",
        "        print(f\"  Predicted: {result['prediction']}, Probability: {result['probability']:.4f}\")\n",
        "\n",
        "# Batch prediction\n",
        "predictions = predict_batch(model, test_dir, batch_size=32)\n",
        "print('\\nBatch Prediction Results (first 10):')\n",
        "for pred in predictions[:10]:\n",
        "    print(f\"Image: {pred['image_path']}, Predicted: {pred['prediction']}, Probability: {pred['probability']:.4f}\")"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 6: Retrain the Model\n",
        "\n",
        "Trigger retraining if new data is available or performance is below threshold (0.8)."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "if model:\n",
        "    model = trigger_retrain(model, new_data_dir, model_path, performance_threshold=0.8, test_dir=test_dir)\n",
        "    if model:\n",
        "        print('Retraining completed successfully.')\n",
        "    else:\n",
        "        print('Retraining failed.')"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 7: Visualize Confusion Matrix (Chart.js)\n",
        "\n",
        "Create an interactive confusion matrix visualization using Chart.js."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "%%javascript\n",
        "if (metrics && metrics['confusion_matrix']) {\n",
        "    const cm = metrics['confusion_matrix'];\n",
        "    const ctx = document.createElement('canvas').getContext('2d');\n",
        "    document.body.appendChild(ctx.canvas);\n",
        "\n",
        "```chartjs\n",
        "{\n",
        "  \"type\": \"matrix\",\n",
        "  \"data\": {\n",
        "    \"datasets\": [{\n",
        "      \"label\": \"Confusion Matrix\",\n",
        "      \"data\": [\n",
        "        {\"x\": \"Benign\", \"y\": \"Benign\", \"v\": cm[0][0]},\n",
        "        {\"x\": \"Malignant\", \"y\": \"Benign\", \"v\": cm[0][1]},\n",
        "        {\"x\": \"Benign\", \"y\": \"Malignant\", \"v\": cm[1][0]},\n",
        "        {\"x\": \"Malignant\", \"y\": \"Malignant\", \"v\": cm[1][1]}\n",
        "      ],\n",
        "      \"backgroundColor\": \"rgba(54, 162, 235, 0.5)\",\n",
        "      \"borderColor\": \"rgba(54, 162, 235, 1)\",\n",
        "      \"borderWidth\": 1\n",
        "    }]\n",
        "  },\n",
        "  \"options\": {\n",
        "    \"plugins\": {\n",
        "      \"title\": {\n",
        "        \"display\": true,\n",
        "        \"text\": \"Confusion Matrix\"\n",
        "      }\n",
        "    },\n",
        "    \"scales\": {\n",
        "      \"x\": {\n",
        "        \"title\": {\n",
        "          \"display\": true,\n",
        "          \"text\": \"Predicted\"\n",
        "        },\n",
        "        \"ticks\": {\n",
        "          \"autoSkip\": false,\n",
        "          \"maxRotation\": 0,\n",
        "          \"minRotation\": 0\n",
        "        }\n",
        "      },\n",
        "      \"y\": {\n",
        "        \"title\": {\n",
        "          \"display\": true,\n",
        "          \"text\": \"True\"\n",
        "        },\n",
        "        \"ticks\": {\n",
        "          \"autoSkip\": false,\n",
        "          \"maxRotation\": 0,\n",
        "          \"minRotation\": 0\n",
        "        }\n",
        "      }\n",
        "    }\n",
        "  }\n",
        "}\n",
        "```\n",
        "}"
      ],
      "metadata": {},
      "execution_count": null,
      "outputs": []
    }
  ]
}