# 🩺 Predictive Analytics for Resource Allocation in Breast-Cancer Care  
*Using the Kaggle Breast Cancer Wisconsin (Diagnostic) Dataset*

---

## 1. Introduction
Early detection and timely intervention are decisive factors in breast-cancer outcomes.  
Yet oncology clinics often face bottlenecks when determining which patients require immediate attention versus routine follow-up.  
This project leverages machine-learning to **triage** incoming cases into three urgency tiers—**high, medium, low**—so scarce resources (imaging slots, specialist reviews, operating-theatre time) can be deployed where they have the greatest impact.

---

## 2. Problem Statement
Manual triage relies on clinician intuition and may vary between practitioners, leading to delays or misallocation of care.  
A data-driven, reproducible model that predicts case urgency can:

* Shorten wait times for high-risk patients.  
* Reduce unnecessary fast-tracking of clearly benign cases.  
* Provide transparent criteria for scheduling and staffing decisions.

---

## 3. Main Objective
> **Develop and validate a machine-learning model that predicts the priority level (high / medium / low) of breast-cancer cases using the Kaggle Breast Cancer Wisconsin dataset, thereby providing actionable insights for resource allocation in clinical workflows.**

---

## 4. Specific Objectives
1. **Data Understanding** – load dataset, explore feature distributions, flag quality issues.  
2. **Data Preparation** – clean data, engineer 3-class priority label, scale numeric features.  
3. **Dataset Partitioning** – create stratified train/validation/test splits.  
4. **Model Development** – train a Random Forest classifier (Logistic Regression optional baseline).  
5. **Model Evaluation** – compute Accuracy & Macro F1; visualise confusion matrix.  
6. **Feature Interpretation** – identify top drivers of high-priority predictions.  
7. **Deployment-Readiness** – present a clear, annotated notebook with actionable insights.  
8. **Deliverables** – Jupyter Notebook plus performance-metrics summary.

---

## 5. Dataset
| Item | Details |
|------|---------|
| **Source** | Kaggle: `uciml/breast-cancer-wisconsin-data` |
| **Size** | 569 rows × 32 columns |
| **Features** | 30 numeric tumour-cell measurements (radius, perimeter, area, concavity, etc.), plus ID & diagnosis |
| **Classes (original)** | `M` = malignant (37%), `B` = benign (63%) |

### 5.1 Priority-Label Engineering  
| Priority | Rule |
|----------|------|
| **High** | diagnosis == “M” |
| **Medium** | diagnosis == “B” **AND** `radius_mean` ≥ benign-median |
| **Low** | diagnosis == “B” **AND** `radius_mean` < benign-median |

This yields a balanced 3-class target suitable for hospital triage.

---

## 6. Methodology

| Step | Action |
|------|--------|
| **Exploratory Analysis** | Checked distributions & ensured no missing values. |
| **Cleaning** | Dropped `id` & unnamed column. |
| **Scaling** | Applied `StandardScaler` to all numeric features. |
| **Split** | 80 / 20 stratified train–test split (random_state = 42). |
| **Model** | `RandomForestClassifier` with `class_weight='balanced'` (default hyper-params adequate for baseline). |
| **Metrics** | Accuracy and Macro-averaged F1-Score on unseen test set. |
| **Interpretability** | Impurity-based feature importances plotted (top 10). |

*(Hyper-parameter tuning can be added in future work.)*

---

## 7. Results

| Metric | Score |
|--------|-------|
| **Accuracy** | **0.965** |
| **Macro F1-Score** | **0.965** |

### 7.1 Class-wise Performance
| Priority | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| High     | 0.97 | 0.93 | 0.95 | 42 |
| Medium   | 0.95 | 0.97 | 0.96 | 36 |
| Low      | 0.97 | 1.00 | 0.99 | 36 |

### 7.2 Confusion-Matrix Highlights
* **High → High:** 39  **High → Medium:** 3  
* **Medium → Medium:** 35  
* **Low → Low:** 36 (no low-priority cases mis-flagged as high)

### 7.3 Top 10 Predictive Features
`radius_mean`, `area_worst`, `perimeter_mean`, `area_mean`, `perimeter_worst`,  
`radius_worst`, `concave points_worst`, `concave points_mean`,  
`concavity_mean`, `concavity_worst`.

These align with clinical intuition that size and shape irregularities signal malignancy and therefore urgency.

---

## 8. Conclusion
The Random Forest model accurately classifies breast-cancer cases into high, medium, or low priority with a **96.5 %** macro F1-Score, offering a robust decision-support tool for clinical triage.  
Key tumour measurements driving predictions are transparent and clinically meaningful, enabling oncologists to trust and act on the model’s recommendations.  
Deploying this model within an electronic health-record or scheduling system can:

* Fast-track genuinely urgent cases.  
* Optimize utilisation of imaging and biopsy resources.  
* Provide consistent, data-driven triage across practitioners.

---

## 9. Future Work
* **Hyper-parameter optimisation** (GridSearchCV / Bayesian search) for potential +1-2 % performance.  
* Compare additional algorithms (e.g., XGBoost, LightGBM).  
* Deploy as a REST API and embed in hospital dashboards.  
* Validate on external datasets to test generalisability.  
* Incorporate cost-sensitive learning to reflect real-world resource constraints.

---

## 10. References
1. *Wolberg, W. et al.* “Breast Cancer Wisconsin (Diagnostic).” UCI Machine Learning Repository.  
2. Breiman, L. “Random Forests.” *Machine Learning* 45, 5–32 (2001).  
3. scikit-learn documentation, v1.4.  

---

*Prepared by Emmanuel Ouma — June 2025*
