# Task 3 – Predictive Analytics for Resource Allocation
Dataset: Breast Cancer Dataset from Kaggle
Goal: Predict issue priority (High, Medium, Low)


## Step 1 – Import Required Libraries

We'll import `pandas` and `numpy` for data handling, `scikit-learn` for machine learning, and metrics to evaluate model performance.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report


## Step 2 – Load and Explore the Dataset

We begin by loading the Breast Cancer dataset using `pandas`. This step helps us get a feel for the structure of the data, check for missing values, and understand feature distributions. We'll print the first few rows, view column types, and generate summary statistics.


In [None]:
import pandas as pd

# Load the dataset (make sure the .csv file is in the same folder as this notebook)
df = pd.read_csv("breast-cancer.csv")  # Replace with the correct filename if different

# Preview the first 5 rows
df.head()


## Step 3 – Data Preprocessing and Feature Engineering

We’ll clean the dataset by checking for missing values and transforming categorical data. We'll also create a new `priority` label based on diagnosis, where:
- `M` (Malignant) = High priority
- `B` (Benign) = Low priority

This label will be used as the prediction target for our classifier.


In [None]:
# Check for missing values
print("Missing values per column:\n", df.isnull().sum())

# Encode the diagnosis column as 'priority' label
df['priority'] = df['diagnosis'].map({'M': 'high', 'B': 'low'})

# Confirm that the mapping worked
df['priority'].value_counts()


## Step 4 – Encode the Target Variable

Machine learning models work with numerical labels, so we'll encode the `priority` column:
- `high` → 1
- `low` → 0

This creates a new target column called `priority_encoded`.


In [None]:
from sklearn.preprocessing import LabelEncoder

# Create the encoder and apply it to the 'priority' column
le = LabelEncoder()
df['priority_encoded'] = le.fit_transform(df['priority'])

# Display a few rows to confirm encoding
df[['priority', 'priority_encoded']].head()


## Step 5 – Split the Data into Training and Test Sets

We'll divide our dataset so the model can learn patterns from the training set and be fairly tested on unseen data. We'll use an 80/20 split and set a `random_state` for reproducibility.


In [None]:
# Define features (X) and target (y)
X = df.drop(columns=['diagnosis', 'priority', 'priority_encoded'])  # Drop non-numeric or redundant columns
y = df['priority_encoded']

# Split the dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))


## Step 6 – Train a Random Forest Classifier

We'll use a `RandomForestClassifier` to model the relationship between features and issue priority. Random Forest is a powerful ensemble method that reduces overfitting and handles unbalanced data well.


In [None]:
# Initialize the model
model = RandomForestClassifier(random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)


## Step 7 – Evaluate Model Performance

We'll use the test data to measure how well our Random Forest model performs. Specifically, we'll check:
- **Accuracy**: how often the model predicts the correct class
- **F1-score**: balances precision and recall, useful if classes are imbalanced
- **Classification report**: provides detailed performance by class


In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print("✅ Accuracy:", round(accuracy * 100, 2), "%")
print("✅ F1 Score:", round(f1, 2))
print("\nDetailed Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))


## Step 8 – Conclusion

In this notebook, we tackled predictive analytics by building a classification model using the Breast Cancer dataset from Kaggle. We:

- Cleaned and explored the data
- Engineered a new `priority` label based on diagnosis:
  - Malignant (M) → High
  - Benign (B) → Low
- Encoded labels numerically for training
- Split the dataset into training and testing sets
- Trained a `RandomForestClassifier`
- Evaluated performance using:
  - Accuracy
  - F1-score
  - Classification report

This model helps classify issue priority effectively, demonstrating the power of predictive analytics for resource planning and triage automation.


## Step 9 – Save the Trained Model

We'll use `joblib` to save the trained Random Forest model to a `.pkl` file. This allows us to reuse the model later without retraining.


In [None]:
import joblib

# Save the trained model
joblib.dump(model, 'priority_classifier_model.pkl')

print("✅ Model saved successfully as priority_classifier_model.pkl")


In [None]:
# Load the model from file
loaded_model = joblib.load('priority_classifier_model.pkl')

# Predict using the loaded model
loaded_model.predict(X_test[:5])


# Final Summary – Predictive Pipeline Completed ✅

In this project, we implemented an end-to-end predictive analytics pipeline using the Breast Cancer dataset. Here's a quick recap of what we achieved:

📌 **Step 1–2: Setup and Exploration**
- Imported key libraries (`pandas`, `scikit-learn`, etc.)
- Loaded the dataset and reviewed structure and distributions

🧹 **Step 3–4: Preprocessing**
- Mapped `diagnosis` to a new `priority` label: Malignant → high, Benign → low
- Encoded labels numerically for training

✂️ **Step 5: Data Splitting**
- Used `train_test_split` to separate the data (80/20)

🌲 **Step 6: Model Training**
- Trained a `RandomForestClassifier` on the processed training data

📈 **Step 7: Evaluation**
- Achieved accuracy and F1-score metrics using test data
- Verified model performance with `classification_report`

💾 **Step 8: Model Preservation**
- Saved the trained model using `joblib.dump()` for reuse and deployment

---

✅ This pipeline serves as a baseline for classification projects and a strong step forward in mastering applied ML. Next steps could include:
- Testing more models (Logistic Regression, Gradient Boosting)
- Improving label granularity (e.g., high/medium/low based on severity)
- Deploying with Flask or FastAPI for real-world usage

