# Task 3: Predictive Analytics for Resource Allocation

# Predictive Analytics for Resource Allocation

This notebook demonstrates predictive analytics using the Kaggle Breast Cancer Dataset. The goal is to preprocess the data, train a Random Forest model to predict issue priority (high/medium/low), and evaluate the model using accuracy and F1-score.

## 1. Import Required Libraries
Import pandas, numpy, scikit-learn, and matplotlib/seaborn for data handling, modeling, and visualization.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

## 2. Load and Explore the Kaggle Breast Cancer Dataset
Load the dataset into a pandas DataFrame, display the first few rows, and summarize key statistics and class distributions.

In [None]:
# Load the Kaggle Breast Cancer Dataset
# Please ensure the dataset CSV is in the same directory or provide the correct path
# Example filename: 'breast_cancer_data.csv'
df = pd.read_csv('breast_cancer_data.csv')
df.head()

In [None]:
# Display summary statistics and class distribution
print('Dataset shape:', df.shape)
df.describe()
df['diagnosis'].value_counts()

## 3. Data Preprocessing (Cleaning, Label Encoding, Splitting)
Handle missing values, encode categorical variables (including mapping issue priority to high/medium/low), and split the data into training and test sets.

In [None]:
# Data Cleaning: Handle missing values
print('Missing values per column:')
print(df.isnull().sum())
df = df.dropna()  # Drop rows with missing values

# Label Encoding: Map diagnosis to priority (example mapping)
# You may need to adjust this mapping based on your use case
def map_priority(diagnosis):
    if diagnosis == 'M':
        return 'high'
    elif diagnosis == 'B':
        return 'low'
    else:
        return 'medium'

df['priority'] = df['diagnosis'].apply(map_priority)

# Encode priority as numbers for classification
priority_mapping = {'low': 0, 'medium': 1, 'high': 2}
df['priority_label'] = df['priority'].map(priority_mapping)

# Feature selection (drop non-feature columns)
X = df.drop(['diagnosis', 'priority', 'priority_label'], axis=1)
y = df['priority_label']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Train Random Forest Classifier to Predict Issue Priority
Initialize and train a Random Forest classifier on the training data to predict the issue priority class.

In [None]:
# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

## 5. Evaluate Model Performance (Accuracy and F1-score)
Use the trained model to predict on the test set and calculate accuracy and F1-score using scikit-learn metrics.

In [None]:
# Predict and evaluate
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'Accuracy: {accuracy:.4f}')
print(f'F1-score: {f1:.4f}')

## 6. Display and Interpret Performance Metrics
Display the calculated metrics and provide code to visualize the confusion matrix and interpret the results.

In [None]:
# Confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=priority_mapping.keys(), yticklabels=priority_mapping.keys())
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=priority_mapping.keys()))