# 🍄 Mushroom Classification Competition
## Asymmetric Classification - Maximizing Safety

**Goal:** Classify mushrooms as poisonous or edible while minimizing False Negatives (predicting a poisonous mushroom as edible), because eating a poisonous mushroom has catastrophic consequences.

**Approach:**
1. Data loading & exploration
2. Preprocessing pipeline (categorical features → OneHotEncoding)
3. Baseline model (Random Forest)
4. ROC Curve analysis & threshold optimization
5. Cost-balanced threshold tuning with `TunedThresholdClassifierCV`
6. Final predictions & submission file generation

**Repository:** [github.com/hzajkani/mushroom-ml-classification](https://github.com/hzajkani/mushroom-ml-classification)

---
## 1. Setup & Data Loading 🗄️

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import (
    train_test_split, cross_val_predict, cross_validate, KFold
)
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    confusion_matrix, ConfusionMatrixDisplay,
    RocCurveDisplay, roc_curve, roc_auc_score,
    PrecisionRecallDisplay, precision_recall_curve,
    make_scorer
)
from sklearn.model_selection import TunedThresholdClassifierCV

In [None]:
# Load training data
train_path = './data/7.4.3.1_mushroom_competition_train_data.csv'
mush = pd.read_csv(train_path).set_index('Id')
mush

In [None]:
# Quick data exploration
print(f"Shape: {mush.shape}")
print(f"\nData types:\n{mush.dtypes}")
print(f"\nMissing values:\n{mush.isnull().sum()}")

In [None]:
# Target distribution
mush.poisonous.value_counts(normalize=True)

The dataset is fairly balanced (~51% edible, ~49% poisonous). All features are categorical except `bruises` which is boolean.

**Key Insight:** A False Negative (predicting a poisonous mushroom as edible) is far more dangerous than a False Positive (avoiding an edible mushroom). We need to prioritize **recall** for the poisonous class.

---
## 2. Data Preparation 🔧

In [None]:
# Separate features and target
X = mush.drop(columns=['poisonous'])
y = mush['poisonous']

# Convert boolean 'bruises' column to string for consistent encoding
X['bruises'] = X['bruises'].astype(str)

# Train-test split (keeping a validation set to check generalization)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"X_train: {X_train.shape}")
print(f"X_test:  {X_test.shape}")

---
## 3. Baseline Model & Confusion Matrix 📊

Let's start with a Random Forest classifier using default settings to establish a baseline.

In [None]:
# Preprocessing pipeline: impute missing → one-hot encode
preproc_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OneHotEncoder(handle_unknown='ignore', sparse_output=False)
)

# Full model pipeline
model = make_pipeline(
    preproc_pipeline,
    RandomForestClassifier(n_estimators=300, random_state=42)
)

# Cross-validation setup
splitter = KFold(n_splits=10, shuffle=True, random_state=42)

# Cross-validated predictions
cv_preds = cross_val_predict(model, X_train, y_train, cv=splitter)

In [None]:
# Baseline performance metrics
cv_acc = accuracy_score(y_train, cv_preds)
cv_precision = precision_score(y_train, cv_preds)
cv_recall = recall_score(y_train, cv_preds)

print(f"CV Accuracy:  {cv_acc:.4f}")
print(f"CV Precision: {cv_precision:.4f}")
print(f"CV Recall:    {cv_recall:.4f}")

# Confusion Matrix
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay.from_predictions(
    y_true=y_train, y_pred=cv_preds,
    display_labels=['Edible (0)', 'Poisonous (1)'],
    cmap='Blues', ax=ax
)
ax.set_title('Baseline Random Forest - CV Confusion Matrix')
plt.tight_layout()
plt.show()

The baseline model is already performing well (~96% accuracy), but we still have some **False Negatives** — poisonous mushrooms predicted as edible. This is unacceptable for food safety!

Let's use threshold tuning to eliminate (or minimize) these dangerous misclassifications.

---
## 4. ROC Curve & Threshold Analysis 📈

Instead of using the default 0.5 threshold, let's explore how different thresholds affect our True Positive Rate (TPR) vs False Positive Rate (FPR).

In [None]:
# Get probabilistic predictions via cross-validation
cv_probs = cross_val_predict(model, X_train, y_train, cv=splitter, method='predict_proba')
cv_pos_probs = cv_probs[:, 1]  # Probability of being poisonous

In [None]:
# Plot ROC Curve
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_predictions(
    y_true=y_train, y_pred=cv_pos_probs, ax=ax,
    name='Random Forest'
)
ax.set_title('ROC Curve - Random Forest (Cross-Validated)')
ax.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random Classifier')
ax.legend()
plt.tight_layout()
plt.show()

# Print AUC score
auc = roc_auc_score(y_train, cv_pos_probs)
print(f"AUC Score: {auc:.4f}")

In [None]:
# Detailed ROC Curve data — examine threshold options
fpr, tpr, thresholds = roc_curve(y_train, cv_pos_probs)

# Align arrays (fpr/tpr may have one extra element compared to thresholds)
min_len = min(len(fpr), len(tpr), len(thresholds))
roc_df = pd.DataFrame({
    'False Positive Rate': fpr[:min_len],
    'True Positive Rate': tpr[:min_len],
    'Threshold': thresholds[:min_len]
})

# Show thresholds that give very high recall (TPR > 0.99)
high_recall = roc_df[roc_df['True Positive Rate'] >= 0.99]
print("Thresholds with TPR >= 0.99:")
print(high_recall.to_string(index=False))

---
## 5. Precision-Recall Curve 🎯

In [None]:
# Plot Precision-Recall Curve
fig, ax = plt.subplots(figsize=(8, 6))
PrecisionRecallDisplay.from_predictions(
    y_true=y_train, y_pred=cv_pos_probs, ax=ax,
    name='Random Forest'
)
ax.set_title('Precision-Recall Curve - Random Forest (Cross-Validated)')
plt.tight_layout()
plt.show()

In [None]:
# Detailed Precision-Recall data
precision, recall, pr_thresholds = precision_recall_curve(y_train, cv_pos_probs)

pr_df = pd.DataFrame({
    'Precision': precision[:-1],
    'Recall': recall[:-1],
    'Threshold': pr_thresholds
})

# Show thresholds with perfect or near-perfect recall
perfect_recall = pr_df[pr_df['Recall'] >= 0.99]
print("Thresholds with Recall >= 0.99:")
print(perfect_recall.to_string(index=False))

---
## 6. Automated Threshold Optimization ⚡

We'll define a **cost-balanced scorer** that heavily penalizes False Negatives (poisonous mushrooms classified as edible) and use `TunedThresholdClassifierCV` to find the optimal threshold automatically.

In [None]:
# Define cost-balanced scoring function
# False Negative (poisonous → edible) is catastrophic → high cost
# False Positive (edible → poisonous) is just a missed meal → low cost

def cost_balanced_score(y_true, y_pred):
    fp_cost = 3     # Minor inconvenience: we skip an edible mushroom
    fn_cost = 100   # Catastrophic: we eat a poisonous mushroom!
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    cost = (fp * fp_cost) + (fn * fn_cost)
    max_cost = len(y_true) * max(fp_cost, fn_cost)
    return 1 - (cost / max_cost)

cost_scorer = make_scorer(cost_balanced_score)

In [None]:
# Automated threshold tuning
tuned_model = TunedThresholdClassifierCV(
    estimator=model,
    scoring=cost_scorer,
    cv=splitter
)

tuned_model.fit(X_train, y_train)
print(f"Optimal threshold (single fit): {tuned_model.best_threshold_:.4f}")

In [None]:
# Cross-validate the threshold to avoid overfitting
cv_results = cross_validate(
    tuned_model, X_train, y_train,
    cv=splitter, return_estimator=True
)

# Collect thresholds from each CV fold
cv_thresholds = [est.best_threshold_ for est in cv_results['estimator']]
tuned_threshold = np.mean(cv_thresholds)

print("Thresholds per fold:")
for i, t in enumerate(cv_thresholds):
    print(f"  Fold {i+1}: {t:.4f}")
print(f"\nAverage threshold: {tuned_threshold:.4f}")

---
## 7. Validation Performance 🧪

Let's check how our tuned threshold performs on the held-out validation set.

In [None]:
# Fit model on training set
model.fit(X_train, y_train)

# Make predictions on validation set using tuned threshold
val_probs = model.predict_proba(X_test)[:, 1]
val_preds = (val_probs > tuned_threshold).astype(int)

# Performance metrics
val_acc = accuracy_score(y_test, val_preds)
val_precision = precision_score(y_test, val_preds)
val_recall = recall_score(y_test, val_preds)

print(f"Validation Accuracy:  {val_acc:.4f}")
print(f"Validation Precision: {val_precision:.4f}")
print(f"Validation Recall:    {val_recall:.4f}")

# Confusion Matrix
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay.from_predictions(
    y_true=y_test, y_pred=val_preds,
    display_labels=['Edible (0)', 'Poisonous (1)'],
    cmap='Oranges', ax=ax
)
ax.set_title(f'Validation - Tuned Threshold ({tuned_threshold:.4f})')
plt.tight_layout()
plt.show()

Excellent! With the tuned threshold, we achieve near-perfect recall — almost no poisonous mushrooms slip through as "edible". The trade-off is a slight decrease in precision (some edible mushrooms get flagged as poisonous), which is perfectly acceptable for food safety.

---
## 8. Alternative: Youden's J Statistic 📐

For comparison, let's also try Youden's J statistic which balances sensitivity and specificity equally.

In [None]:
# Youden's J scorer
def youdens_j_score(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    tpr = tp / (tp + fn)  # Sensitivity
    fpr = fp / (fp + tn)  # False Positive Rate
    return tpr - fpr      # Youden's J = Sensitivity + Specificity - 1

youden_scorer = make_scorer(youdens_j_score)

# Tune threshold with Youden's J
youden_tuned = TunedThresholdClassifierCV(
    estimator=model,
    scoring=youden_scorer,
    cv=splitter
)

# Cross-validate
youden_cv = cross_validate(youden_tuned, X_train, y_train, cv=splitter, return_estimator=True)
youden_thresholds = [est.best_threshold_ for est in youden_cv['estimator']]
youden_threshold = np.mean(youden_thresholds)
print(f"Youden's J average threshold: {youden_threshold:.4f}")

# Compare on validation
youden_preds = (val_probs > youden_threshold).astype(int)
print(f"\nYouden's J Validation:")
print(f"  Accuracy:  {accuracy_score(y_test, youden_preds):.4f}")
print(f"  Precision: {precision_score(y_test, youden_preds):.4f}")
print(f"  Recall:    {recall_score(y_test, youden_preds):.4f}")
print(f"  CM: {confusion_matrix(y_test, youden_preds).ravel()}")

As expected, Youden's J gives a more balanced result but allows more False Negatives. For mushroom safety, our **cost-balanced approach is preferred** since it prioritizes avoiding poisonous mushrooms.

---
## 9. Final Submission 🏆

Now we retrain on the **entire training dataset** and generate predictions for the competition test set.

In [None]:
# Retrain threshold tuning on the FULL training data
X_full = mush.drop(columns=['poisonous'])
y_full = mush['poisonous']
X_full['bruises'] = X_full['bruises'].astype(str)

# Cross-validate threshold on full data for robustness
tuned_final = TunedThresholdClassifierCV(
    estimator=model,
    scoring=cost_scorer,
    cv=splitter
)

cv_final = cross_validate(tuned_final, X_full, y_full, cv=splitter, return_estimator=True)
final_thresholds = [est.best_threshold_ for est in cv_final['estimator']]
final_threshold = np.mean(final_thresholds)

print("Final thresholds per fold:")
for i, t in enumerate(final_thresholds):
    print(f"  Fold {i+1}: {t:.4f}")
print(f"\nFinal average threshold: {final_threshold:.4f}")

In [None]:
# Cross-validated performance on full training data
cv_probs_final = cross_val_predict(model, X_full, y_full, cv=splitter, method='predict_proba')[:, 1]
cv_preds_final = (cv_probs_final > final_threshold).astype(int)

print(f"Full Training CV Performance (threshold={final_threshold:.4f}):")
print(f"  Accuracy:  {accuracy_score(y_full, cv_preds_final):.4f}")
print(f"  Precision: {precision_score(y_full, cv_preds_final):.4f}")
print(f"  Recall:    {recall_score(y_full, cv_preds_final):.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay.from_predictions(
    y_true=y_full, y_pred=cv_preds_final,
    display_labels=['Edible (0)', 'Poisonous (1)'],
    cmap='Greens', ax=ax
)
ax.set_title(f'Full CV - Cost-Balanced Threshold ({final_threshold:.4f})')
plt.tight_layout()
plt.show()

In [None]:
# Load competition test data
test_path = './data/7.4.3.2_mushroom_competition_test_data.csv'
X_new = pd.read_csv(test_path).set_index('Id')
X_new['bruises'] = X_new['bruises'].astype(str)

# Ensure column order matches training data
X_new = X_new[X_full.columns]

print(f"Test data shape: {X_new.shape}")
X_new.head()

In [None]:
# Fit model on ALL training data
model.fit(X_full, y_full)

# Generate predictions with tuned threshold
test_probs = model.predict_proba(X_new)[:, 1]
test_preds = (test_probs > final_threshold).astype(int)

print(f"Prediction distribution:")
print(pd.Series(test_preds).value_counts().rename({0: 'Edible', 1: 'Poisonous'}))

In [None]:
# Save submission file
submission = pd.DataFrame({
    'Id': X_new.index,
    'poisonous': test_preds
})

submission.to_csv('mushroom_submission.csv', index=False)
print("Submission file saved: mushroom_submission.csv")
print(f"\nTotal predictions: {len(submission)}")
submission.head(10)

---
## Summary 📝

| Metric | Baseline (0.5 threshold) | Cost-Balanced Tuned | Youden's J |
|--------|-------------------------|-------------------|------------|
| Accuracy | ~0.96 | ~0.94 | ~0.96 |
| Precision | ~0.95 | ~0.90 | ~0.95 |
| Recall | ~0.97 | ~1.00 | ~0.97 |
| False Negatives | ~79 | ~1 | ~79 |

**Key Takeaway:** By lowering the classification threshold using a cost-balanced approach, we sacrifice some precision (more edible mushrooms get flagged as suspicious) but achieve near-perfect recall (virtually no poisonous mushrooms are misclassified as safe to eat).

**When it comes to mushroom safety, it's better to throw away a good mushroom than to eat a bad one!** 🍄