# üîÑ Day 2: Cross-Validation

**üéØ Goal:** Learn how to get reliable, trustworthy model performance estimates

**‚è±Ô∏è Time:** 45-60 minutes

**üåü Why This Matters for AI:**
- Single train-test split can be misleading (lucky or unlucky split)
- Cross-validation gives you confidence your model will work in production
- Standard practice at OpenAI, Google, Meta for model development
- Critical for research papers, Kaggle competitions, and real deployments
- Prevents overfitting and gives realistic performance estimates

---

## ‚ùì The Problem: Is Your Model Really That Good?

**Scenario:** You built an AI model and tested it:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Accuracy: {score:.2%}")  # 95%! Amazing!
```

**But wait...**
- What if your test set happened to be easy? ü§î
- What if you got lucky with the random split?
- Will it still be 95% with different data?

**Solution: Cross-Validation!** Test your model on multiple different splits. üéØ

In [None]:
# Import our tools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (
    train_test_split,
    KFold,
    StratifiedKFold,
    LeaveOneOut,
    cross_val_score,
    cross_validate
)
from sklearn.datasets import make_classification, load_iris, load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Set style and random seed
sns.set_style('whitegrid')
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")

## üé≤ Single Split vs Cross-Validation

Let's see the difference with a real example:

In [None]:
# Load a real dataset
iris = load_iris()
X, y = iris.data, iris.target

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Method 1: Single split (traditional way)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
single_score = model.score(X_test, y_test)

print("üé≤ SINGLE TRAIN-TEST SPLIT")
print("=" * 50)
print(f"Accuracy: {single_score:.2%}")
print(f"\n‚ö†Ô∏è  Problem: This is based on only ONE random split!")
print(f"   What if we got lucky (or unlucky)?\n")

# Method 2: Multiple splits (cross-validation)
print("üîÑ CROSS-VALIDATION (5 different splits)")
print("=" * 50)

# Try 5 different random splits
scores = []
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
    print(f"Split {i+1}: {score:.2%}")

print(f"\nüìä Average: {np.mean(scores):.2%} (+/- {np.std(scores):.2%})")
print(f"\n‚úÖ Better! Now we have confidence in our estimate.")

## üìö K-Fold Cross-Validation

**How it works:**
1. Split your data into K equal parts (folds)
2. For each fold:
   - Use it as test set
   - Use other K-1 folds as training set
   - Train model and evaluate
3. Average the K scores

**Visual:**
```
5-Fold Cross-Validation:
Fold 1: [TEST] [TRAIN] [TRAIN] [TRAIN] [TRAIN]
Fold 2: [TRAIN] [TEST] [TRAIN] [TRAIN] [TRAIN]
Fold 3: [TRAIN] [TRAIN] [TEST] [TRAIN] [TRAIN]
Fold 4: [TRAIN] [TRAIN] [TRAIN] [TEST] [TRAIN]
Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST]
```

**Common K values:**
- K=5: Fast, good for large datasets
- K=10: Standard choice, good balance
- K=number of samples: Leave-One-Out (expensive!)

In [None]:
# Let's visualize K-Fold Cross-Validation
from sklearn.model_selection import KFold

# Create sample data
X_sample = np.arange(20).reshape(-1, 1)
y_sample = np.arange(20)

# Create 5-fold CV
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Visualize the splits
fig, axes = plt.subplots(5, 1, figsize=(12, 8))
fig.suptitle('5-Fold Cross-Validation Visualization', fontsize=16, fontweight='bold', y=0.995)

for idx, (train_idx, test_idx) in enumerate(kfold.split(X_sample)):
    # Create array to color
    colors = np.array(['blue'] * len(X_sample))
    colors[test_idx] = 'red'
    
    # Plot
    axes[idx].scatter(range(len(X_sample)), [1]*len(X_sample), 
                     c=colors, s=200, alpha=0.6)
    axes[idx].set_xlim(-1, 20)
    axes[idx].set_ylim(0.5, 1.5)
    axes[idx].set_yticks([])
    axes[idx].set_title(f'Fold {idx+1}: {len(test_idx)} test samples (red), '
                       f'{len(train_idx)} training samples (blue)', 
                       fontsize=11)
    axes[idx].set_xlabel('Sample Index')

plt.tight_layout()
plt.show()

print("üé® Legend:")
print("  üî¥ Red = Test set for this fold")
print("  üîµ Blue = Training set for this fold")
print("\nüí° Every sample gets to be in the test set exactly once!")

## üîß Implementing K-Fold Cross-Validation

Let's use it properly with `cross_val_score`:

In [None]:
# Load a real dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"üìä Dataset: {len(X)} samples, {X.shape[1]} features")
print(f"   Classes: {np.bincount(y)} (0=malignant, 1=benign)\n")

# Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform 10-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=10, scoring='accuracy')

print("üîÑ 10-Fold Cross-Validation Results")
print("=" * 50)
for i, score in enumerate(cv_scores, 1):
    print(f"Fold {i:2d}: {score:.4f} ({score*100:.2f}%)")

print("\n" + "=" * 50)
print(f"üìä Mean Accuracy: {cv_scores.mean():.4f} ({cv_scores.mean()*100:.2f}%)")
print(f"üìä Std Deviation: {cv_scores.std():.4f} ({cv_scores.std()*100:.2f}%)")
print(f"üìä 95% Confidence Interval: {cv_scores.mean():.2%} +/- {1.96 * cv_scores.std():.2%}")

# Visualize the results
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), cv_scores, 'bo-', linewidth=2, markersize=8, label='Fold Scores')
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', linewidth=2, label=f'Mean: {cv_scores.mean():.2%}')
plt.fill_between(range(1, 11), 
                 cv_scores.mean() - cv_scores.std(), 
                 cv_scores.mean() + cv_scores.std(), 
                 alpha=0.2, color='red', label=f'¬±1 Std Dev')
plt.xlabel('Fold Number', fontsize=12)
plt.ylabel('Accuracy Score', fontsize=12)
plt.title('10-Fold Cross-Validation Performance', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.ylim([0.9, 1.0])
plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print(f"   - Average performance: {cv_scores.mean():.2%}")
print(f"   - Performance varies by {cv_scores.std():.2%} between folds")
print(f"   - Low variance = Model is stable! ‚úÖ")

## üéØ Stratified K-Fold: For Imbalanced Data

**Problem with regular K-Fold:**
- Random splits might create imbalanced folds
- One fold might have mostly class 0, another mostly class 1

**Solution: Stratified K-Fold**
- Preserves class distribution in each fold
- Each fold has the same percentage of each class as the original dataset

**When to use:**
- ‚úÖ Imbalanced datasets (fraud, rare disease, etc.)
- ‚úÖ Classification problems (always a good choice!)
- ‚ùå Regression (use regular K-Fold)

**Real AI Use:**
- Standard for NLP classification tasks
- Medical diagnosis with rare diseases
- Fraud detection, anomaly detection

In [None]:
# Create an imbalanced dataset
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1 (imbalanced!)
    random_state=42
)

print("üìä Imbalanced Dataset:")
print(f"   Total samples: {len(y_imb)}")
print(f"   Class 0: {np.sum(y_imb == 0)} ({np.sum(y_imb == 0)/len(y_imb)*100:.1f}%)")
print(f"   Class 1: {np.sum(y_imb == 1)} ({np.sum(y_imb == 1)/len(y_imb)*100:.1f}%)")
print(f"   ‚ö†Ô∏è  Very imbalanced! (90/10 split)\n")

# Compare regular K-Fold vs Stratified K-Fold
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Regular K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
regular_scores = cross_val_score(model, X_imb, y_imb, cv=kfold, scoring='f1')

# Stratified K-Fold
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X_imb, y_imb, cv=stratified_kfold, scoring='f1')

# Compare results
print("üîÑ Regular K-Fold:")
print(f"   F1 Scores: {regular_scores}")
print(f"   Mean: {regular_scores.mean():.4f}, Std: {regular_scores.std():.4f}\n")

print("‚úÖ Stratified K-Fold (BETTER for imbalanced data):")
print(f"   F1 Scores: {stratified_scores}")
print(f"   Mean: {stratified_scores.mean():.4f}, Std: {stratified_scores.std():.4f}\n")

# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(5)
width = 0.35

bars1 = ax.bar(x - width/2, regular_scores, width, label='Regular K-Fold', alpha=0.7, color='orange')
bars2 = ax.bar(x + width/2, stratified_scores, width, label='Stratified K-Fold', alpha=0.7, color='green')

ax.axhline(y=regular_scores.mean(), color='orange', linestyle='--', alpha=0.5, 
          label=f'Regular Mean: {regular_scores.mean():.3f}')
ax.axhline(y=stratified_scores.mean(), color='green', linestyle='--', alpha=0.5,
          label=f'Stratified Mean: {stratified_scores.mean():.3f}')

ax.set_xlabel('Fold Number', fontsize=12)
ax.set_ylabel('F1 Score', fontsize=12)
ax.set_title('Regular vs Stratified K-Fold on Imbalanced Data', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f'Fold {i+1}' for i in range(5)])
ax.legend(fontsize=10)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
print("   Stratified K-Fold ensures each fold has the same 90/10 class distribution!")
print("   This gives more reliable and consistent scores. ‚úÖ")

## üî¨ Leave-One-Out Cross-Validation (LOOCV)

**Extreme case:** K = number of samples

**How it works:**
- For each sample:
  - Use it as test set (1 sample)
  - Use all others as training set (N-1 samples)
  - Train and evaluate
- Average all N scores

**Pros:**
- ‚úÖ Maximum training data (N-1 samples)
- ‚úÖ No randomness (deterministic)
- ‚úÖ Good for very small datasets

**Cons:**
- ‚ùå Very slow (train N models!)
- ‚ùå High variance in estimates
- ‚ùå Impractical for large datasets

**When to use:**
- Only for small datasets (< 1000 samples)
- When training is fast
- Research purposes

In [None]:
# Use a small dataset for LOOCV
iris = load_iris()
X_small = iris.data[:50]  # Use only 50 samples
y_small = iris.target[:50]

print(f"üìä Small Dataset: {len(X_small)} samples\n")

# Create a fast model (Logistic Regression)
model = LogisticRegression(max_iter=1000, random_state=42)

# Compare different CV strategies
print("‚è±Ô∏è  Comparing Cross-Validation Methods:\n")

import time

# 5-Fold
start = time.time()
scores_5fold = cross_val_score(model, X_small, y_small, cv=5)
time_5fold = time.time() - start
print(f"5-Fold CV:")
print(f"   Mean Score: {scores_5fold.mean():.4f}")
print(f"   Time: {time_5fold:.4f} seconds")
print(f"   Models trained: 5\n")

# 10-Fold
start = time.time()
scores_10fold = cross_val_score(model, X_small, y_small, cv=10)
time_10fold = time.time() - start
print(f"10-Fold CV:")
print(f"   Mean Score: {scores_10fold.mean():.4f}")
print(f"   Time: {time_10fold:.4f} seconds")
print(f"   Models trained: 10\n")

# Leave-One-Out
loo = LeaveOneOut()
start = time.time()
scores_loo = cross_val_score(model, X_small, y_small, cv=loo)
time_loo = time.time() - start
print(f"Leave-One-Out CV:")
print(f"   Mean Score: {scores_loo.mean():.4f}")
print(f"   Time: {time_loo:.4f} seconds")
print(f"   Models trained: {len(X_small)} (one per sample!)\n")

# Visualize
methods = ['5-Fold', '10-Fold', 'LOOCV']
means = [scores_5fold.mean(), scores_10fold.mean(), scores_loo.mean()]
times = [time_5fold, time_10fold, time_loo]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy
ax1.bar(methods, means, color=['blue', 'green', 'orange'], alpha=0.7)
ax1.set_ylabel('Mean Accuracy', fontsize=12)
ax1.set_title('Accuracy Comparison', fontsize=14, fontweight='bold')
ax1.set_ylim([0.9, 1.0])
ax1.grid(axis='y', alpha=0.3)
for i, v in enumerate(means):
    ax1.text(i, v + 0.005, f'{v:.3f}', ha='center', fontweight='bold')

# Plot 2: Time
ax2.bar(methods, times, color=['blue', 'green', 'orange'], alpha=0.7)
ax2.set_ylabel('Time (seconds)', fontsize=12)
ax2.set_title('Computation Time', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
for i, v in enumerate(times):
    ax2.text(i, v + 0.001, f'{v:.3f}s', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Takeaway:")
print(f"   LOOCV is {time_loo/time_5fold:.1f}x slower than 5-Fold!")
print(f"   But gives similar accuracy estimate.")
print(f"   ‚úÖ For most cases, 5-Fold or 10-Fold is best!")

## üìä Multiple Metrics with Cross-Validation

Don't just evaluate accuracy! Get precision, recall, F1 all at once:

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Define multiple scoring metrics
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

# Perform cross-validation with multiple metrics
cv_results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True)

# Display results
print("üìä COMPREHENSIVE CROSS-VALIDATION REPORT")
print("=" * 60)
print(f"\n{'Metric':<15} {'Test Mean':<12} {'Test Std':<12} {'Train Mean':<12}")
print("-" * 60)

for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    test_key = f'test_{metric}'
    train_key = f'train_{metric}'
    test_mean = cv_results[test_key].mean()
    test_std = cv_results[test_key].std()
    train_mean = cv_results[train_key].mean()
    
    print(f"{metric.upper():<15} {test_mean:>6.4f}       {test_std:>6.4f}       {train_mean:>6.4f}")

print("=" * 60)

# Create a detailed DataFrame
results_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Test Score': [
        cv_results['test_accuracy'].mean(),
        cv_results['test_precision'].mean(),
        cv_results['test_recall'].mean(),
        cv_results['test_f1'].mean(),
        cv_results['test_roc_auc'].mean()
    ],
    'Train Score': [
        cv_results['train_accuracy'].mean(),
        cv_results['train_precision'].mean(),
        cv_results['train_recall'].mean(),
        cv_results['train_f1'].mean(),
        cv_results['train_roc_auc'].mean()
    ]
})

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(results_df))
width = 0.35

bars1 = ax.bar(x - width/2, results_df['Train Score'], width, 
              label='Training Score', alpha=0.7, color='skyblue')
bars2 = ax.bar(x + width/2, results_df['Test Score'], width, 
              label='Test Score (CV)', alpha=0.7, color='orange')

ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Cross-Validation: All Metrics (Train vs Test)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(results_df['Metric'])
ax.legend(fontsize=11)
ax.set_ylim([0.9, 1.0])
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
               f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
gap = results_df['Train Score'].mean() - results_df['Test Score'].mean()
print(f"   Average Train-Test Gap: {gap:.4f}")
if gap < 0.05:
    print("   ‚úÖ Great! Model generalizes well (low overfitting)")
elif gap < 0.10:
    print("   ‚ö†Ô∏è  Some overfitting detected")
else:
    print("   ‚ùå High overfitting! Model memorizing training data")

## ü§ñ Real AI Example: Robust Model Evaluation for Production

**Scenario:** You're deploying an AI content moderation system for a social media platform.

**Requirements:**
- Must be reliable across different types of content
- Performance must be consistent
- Need confidence intervals for stakeholders

Let's use cross-validation properly!

In [None]:
# Simulate content moderation dataset
# 1 = toxic content, 0 = safe content
X_content, y_content = make_classification(
    n_samples=2000,
    n_features=50,  # Text embeddings from BERT/GPT
    n_informative=40,
    n_redundant=10,
    n_classes=2,
    weights=[0.85, 0.15],  # 15% toxic content
    random_state=42
)

print("üåê CONTENT MODERATION AI - PRODUCTION EVALUATION")
print("=" * 60)
print(f"Dataset: {len(y_content)} posts")
print(f"Safe: {np.sum(y_content == 0)} ({np.sum(y_content == 0)/len(y_content)*100:.1f}%)")
print(f"Toxic: {np.sum(y_content == 1)} ({np.sum(y_content == 1)/len(y_content)*100:.1f}%)\n")

# Test multiple models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42)
}

# Stratified 10-Fold CV (best for imbalanced classification)
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Evaluate each model
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_content, y_content, cv=cv, scoring='f1')
    results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores,
        'ci_lower': scores.mean() - 1.96 * scores.std(),
        'ci_upper': scores.mean() + 1.96 * scores.std()
    }

# Display results
print("üìä Model Comparison (F1-Score):\n")
print(f"{'Model':<25} {'Mean':<10} {'Std':<10} {'95% CI'}")
print("-" * 60)
for name, res in results.items():
    ci = f"[{res['ci_lower']:.3f}, {res['ci_upper']:.3f}]"
    print(f"{name:<25} {res['mean']:.4f}     {res['std']:.4f}     {ci}")

# Visualize with confidence intervals
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Mean scores with error bars
model_names = list(results.keys())
means = [results[m]['mean'] for m in model_names]
stds = [results[m]['std'] for m in model_names]

ax1.bar(model_names, means, yerr=stds, capsize=10, alpha=0.7, 
       color=['#3498db', '#2ecc71', '#e74c3c'])
ax1.set_ylabel('F1-Score', fontsize=12)
ax1.set_title('Model Performance with Standard Deviation', fontsize=14, fontweight='bold')
ax1.set_ylim([0, 1])
ax1.grid(axis='y', alpha=0.3)
for i, (m, s) in enumerate(zip(means, stds)):
    ax1.text(i, m + s + 0.02, f'{m:.3f}\n¬±{s:.3f}', ha='center', fontweight='bold')

# Plot 2: Distribution of scores across folds
positions = [1, 2, 3]
bp = ax2.boxplot([results[m]['scores'] for m in model_names],
                 labels=model_names,
                 patch_artist=True,
                 notch=True,
                 showmeans=True)

colors = ['#3498db', '#2ecc71', '#e74c3c']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)

ax2.set_ylabel('F1-Score', fontsize=12)
ax2.set_title('Score Distribution Across 10 Folds', fontsize=14, fontweight='bold')
ax2.set_ylim([0, 1])
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Recommendation
best_model = max(results.keys(), key=lambda k: results[k]['mean'])
print(f"\nüèÜ RECOMMENDATION FOR PRODUCTION:")
print(f"   Best Model: {best_model}")
print(f"   Expected F1-Score: {results[best_model]['mean']:.2%}")
print(f"   95% Confidence: {results[best_model]['ci_lower']:.2%} - {results[best_model]['ci_upper']:.2%}")
print(f"   Consistency (low std): {results[best_model]['std']:.4f} ‚úÖ")
print(f"\nüí° This means in production, you can expect F1-Score between ")
print(f"   {results[best_model]['ci_lower']:.1%} and {results[best_model]['ci_upper']:.1%} with 95% confidence!")

## üéØ YOUR TURN: Evaluate Your Own Model

**Challenge:** You're building a spam email detector.

**Tasks:**
1. Load the dataset (provided below)
2. Use Stratified 5-Fold CV (data is imbalanced)
3. Evaluate with multiple metrics
4. Compare 3 different models
5. Choose the best one for production

In [None]:
# Generate spam detection dataset
X_spam, y_spam = make_classification(
    n_samples=1000,
    n_features=30,
    n_informative=25,
    n_classes=2,
    weights=[0.7, 0.3],  # 30% spam
    random_state=42
)

print("üìß Spam Detection Dataset")
print(f"Total emails: {len(y_spam)}")
print(f"Ham (normal): {np.sum(y_spam == 0)}")
print(f"Spam: {np.sum(y_spam == 1)}")
print("\nüéØ YOUR TASK: Complete the evaluation below!\n")

# YOUR CODE HERE!
# 1. Create 3 different models
model1 = # YOUR CODE
model2 = # YOUR CODE
model3 = # YOUR CODE

# 2. Create Stratified K-Fold CV
cv = # YOUR CODE (hint: use StratifiedKFold with 5 splits)

# 3. Evaluate each model
# Hint: use cross_val_score with scoring='f1'
scores1 = # YOUR CODE
scores2 = # YOUR CODE
scores3 = # YOUR CODE

# 4. Print results
print("Model 1:", scores1.mean())
print("Model 2:", scores2.mean())
print("Model 3:", scores3.mean())

# 5. Which model would you choose? Why?

### ‚úÖ Solution (Run after trying!)

In [None]:
# SOLUTION
print("üìß SPAM DETECTION - COMPLETE EVALUATION")
print("=" * 60)

# 1. Create models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42)
}

# 2. Stratified K-Fold (important for imbalanced data!)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Evaluate with multiple metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

print("\nüìä Evaluation Results:\n")
all_results = {}

for name, model in models.items():
    print(f"\n{name}:")
    print("-" * 40)
    
    model_scores = {}
    for metric in scoring:
        scores = cross_val_score(model, X_spam, y_spam, cv=cv, scoring=metric)
        model_scores[metric] = scores
        print(f"{metric.upper():<12}: {scores.mean():.4f} (+/- {scores.std():.4f})")
    
    all_results[name] = model_scores

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Spam Detection: Model Comparison', fontsize=16, fontweight='bold')

for idx, metric in enumerate(scoring):
    ax = axes[idx // 2, idx % 2]
    
    means = [all_results[m][metric].mean() for m in models.keys()]
    stds = [all_results[m][metric].std() for m in models.keys()]
    
    bars = ax.bar(models.keys(), means, yerr=stds, capsize=5, 
                 alpha=0.7, color=['#3498db', '#2ecc71', '#e74c3c'])
    ax.set_ylabel(metric.upper(), fontsize=11)
    ax.set_title(f'{metric.upper()} Comparison', fontsize=12, fontweight='bold')
    ax.set_ylim([0.7, 1.0])
    ax.grid(axis='y', alpha=0.3)
    
    for i, (m, s) in enumerate(zip(means, stds)):
        ax.text(i, m + s + 0.01, f'{m:.3f}', ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

# Final recommendation
print("\n" + "=" * 60)
print("üèÜ FINAL RECOMMENDATION:")
print("=" * 60)

# Choose based on F1-score (balance of precision and recall)
f1_scores = {name: all_results[name]['f1'].mean() for name in models.keys()}
best_model = max(f1_scores.keys(), key=lambda k: f1_scores[k])

print(f"\n‚úÖ Best Model: {best_model}")
print(f"\nüìä Performance:")
for metric in scoring:
    mean = all_results[best_model][metric].mean()
    std = all_results[best_model][metric].std()
    print(f"   {metric.upper():<12}: {mean:.2%} (+/- {std:.2%})")

print(f"\nüí° Why {best_model}?")
print("   - Highest F1-score (best balance of precision/recall)")
print("   - Low variance (consistent across folds)")
print("   - Good for spam detection where both metrics matter")
print("\nüöÄ Ready for production deployment!")

## üìã Cross-Validation Best Practices

**1. Choose the Right K:**
- K=5: Fast, good for large datasets (>10,000 samples)
- K=10: Standard choice, good balance
- K=20+: Overkill for most cases
- LOOCV: Only for tiny datasets (<1000 samples)

**2. Always Use Stratified for Classification:**
- Preserves class distribution
- More reliable estimates
- Standard in Scikit-learn's `cross_val_score`

**3. Report Mean AND Standard Deviation:**
- Mean alone can be misleading
- High std = model is unstable
- Always report: "95.2% (+/- 2.1%)"

**4. Use Multiple Metrics:**
- Accuracy + Precision + Recall + F1
- Different metrics tell different stories
- Use `cross_validate` for multiple metrics

**5. Watch for Overfitting:**
- Compare train vs test scores
- Large gap = overfitting
- Use `return_train_score=True`

## üéâ Congratulations!

**You just mastered:**
- ‚úÖ Why single train-test split is unreliable
- ‚úÖ K-Fold Cross-Validation
- ‚úÖ Stratified K-Fold for imbalanced data
- ‚úÖ Leave-One-Out CV and when to use it
- ‚úÖ Evaluating multiple metrics at once
- ‚úÖ Getting confidence intervals for production
- ‚úÖ Real AI application: robust model evaluation

**üéØ Key Takeaways:**
1. **Always use cross-validation** for reliable estimates
2. **Stratified K-Fold** is best for classification
3. **Report mean ¬± std** for transparency
4. **K=5 or K=10** works for most cases
5. **Multiple metrics** give complete picture

**üöÄ Practice Exercise (Do before Day 3!):**

Load the Iris dataset and:
1. Compare Regular K-Fold vs Stratified K-Fold
2. Use K=3, 5, 10 - how do results differ?
3. Evaluate with accuracy, precision, recall, F1
4. Which model would you deploy to production?

---

**üìö Next Lesson:** Day 3 - Hyperparameter Tuning (Find the BEST model settings!)

**üí¨ Questions?** Try different K values and see how estimates change!

---

*"Cross-validation: Because one test set is never enough!"* üîÑ