# üìä Day 1: Evaluation Metrics

**üéØ Goal:** Master the metrics that tell you how good your AI model really is

**‚è±Ô∏è Time:** 45-60 minutes

**üåü Why This Matters for AI:**
- Accuracy alone can be misleading - learn when to use precision, recall, and F1
- Essential for evaluating RAG systems, chatbots, image classifiers, and more
- Used by OpenAI, Google, Meta to evaluate GPT-4, Gemini, and LLaMA models
- Critical for medical AI (missing a disease = bad!), fraud detection, content moderation

---

## üéØ The Problem: Is 95% Accuracy Good?

Imagine you built an AI model to detect credit card fraud:
- Dataset: 10,000 transactions
- Fraudulent: 100 (1%)
- Legitimate: 9,900 (99%)

**Your model:** "All transactions are legitimate!"
- **Accuracy:** 99% ‚úÖ (Looks amazing!)
- **Problem:** It catches ZERO fraud cases! ‚ùå

**This is why we need better metrics!** Let's learn them all. üëá

In [None]:
# Let's import our tools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")

## üîç The Confusion Matrix: Foundation of All Metrics

Before understanding metrics, you need to understand the **confusion matrix**.

### Real Example: Email Spam Detection

When your model makes predictions, there are 4 possible outcomes:

| | Actually Spam | Actually Not Spam |
|---|---|---|
| **Predicted Spam** | ‚úÖ True Positive (TP)<br>*Correctly caught spam* | ‚ùå False Positive (FP)<br>*Wrongly marked as spam* |
| **Predicted Not Spam** | ‚ùå False Negative (FN)<br>*Missed spam* | ‚úÖ True Negative (TN)<br>*Correctly identified normal* |

**Let's see this in action:**

In [None]:
# Simulated spam detection results
# Let's say we tested 100 emails

y_true = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0,  # Actual labels (1=spam, 0=not spam)
          0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
          0, 1, 0, 0, 0, 1, 1, 0, 0, 1]

y_pred = [1, 1, 0, 1, 0, 0, 1, 1, 0, 0,  # Model's predictions
          0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
          0, 1, 0, 0, 0, 0, 1, 0, 0, 1]

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Visualize it beautifully
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Not Spam', 'Spam'],
            yticklabels=['Not Spam', 'Spam'])
plt.title('Confusion Matrix: Email Spam Detection', fontsize=14, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.show()

# Extract values
tn, fp, fn, tp = cm.ravel()

print("üìä Confusion Matrix Breakdown:")
print(f"\n‚úÖ True Positives (TP): {tp} - Correctly identified spam")
print(f"‚úÖ True Negatives (TN): {tn} - Correctly identified normal emails")
print(f"‚ùå False Positives (FP): {fp} - Normal emails wrongly marked as spam")
print(f"‚ùå False Negatives (FN): {fn} - Spam that got through")

## üìè Metric 1: Accuracy

**Definition:** What percentage of predictions were correct?

**Formula:**
```
Accuracy = (TP + TN) / (TP + TN + FP + FN)
```

**When to use:**
- ‚úÖ Balanced datasets (equal classes)
- ‚ùå Imbalanced datasets (like fraud detection)

**Real AI Use:**
- MNIST digit classification (balanced: 10 digits)
- Multimodal image classification with equal categories

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

# Manual calculation to understand
manual_accuracy = (tp + tn) / (tp + tn + fp + fn)

print(f"üéØ Accuracy: {accuracy:.2%}")
print(f"üìù Manual calculation: ({tp} + {tn}) / ({tp} + {tn} + {fp} + {fn}) = {manual_accuracy:.2%}")
print(f"\nüí° This means {accuracy:.0%} of our predictions were correct!")

## üéØ Metric 2: Precision

**Definition:** Of all the emails we flagged as spam, how many were actually spam?

**Formula:**
```
Precision = TP / (TP + FP)
```

**Question it answers:** "When my model says YES, how often is it right?"

**When to use:**
- ‚úÖ False Positives are costly (e.g., medical diagnosis for expensive treatment)
- ‚úÖ Content recommendation (don't recommend bad content)
- ‚úÖ RAG systems (don't retrieve irrelevant documents)

**Real AI Use (2024-2025):**
- **RAG Systems:** High precision = retrieved documents are actually relevant
- **Agentic AI:** High precision = agent actions are correct when taken
- **Video content moderation:** Don't falsely flag safe content

In [None]:
# Calculate precision
precision = precision_score(y_true, y_pred)

# Manual calculation
manual_precision = tp / (tp + fp)

print(f"üéØ Precision: {precision:.2%}")
print(f"üìù Manual calculation: {tp} / ({tp} + {fp}) = {manual_precision:.2%}")
print(f"\nüí° When we flag an email as spam, we're correct {precision:.0%} of the time")
print(f"‚ö†Ô∏è  {fp} normal emails were wrongly marked as spam (False Positives)")

## üîç Metric 3: Recall (Sensitivity)

**Definition:** Of all the actual spam emails, how many did we catch?

**Formula:**
```
Recall = TP / (TP + FN)
```

**Question it answers:** "Of all the actual positives, how many did I find?"

**When to use:**
- ‚úÖ False Negatives are costly (e.g., cancer detection - can't miss cases!)
- ‚úÖ Fraud detection (catch all fraud)
- ‚úÖ Search engines (find all relevant results)

**Real AI Use (2024-2025):**
- **Medical AI:** High recall = catch all disease cases
- **Security systems:** High recall = detect all threats
- **RAG retrieval:** High recall = find all relevant documents

In [None]:
# Calculate recall
recall = recall_score(y_true, y_pred)

# Manual calculation
manual_recall = tp / (tp + fn)

print(f"üîç Recall: {recall:.2%}")
print(f"üìù Manual calculation: {tp} / ({tp} + {fn}) = {manual_recall:.2%}")
print(f"\nüí° We caught {recall:.0%} of all spam emails")
print(f"‚ö†Ô∏è  {fn} spam emails slipped through (False Negatives)")

## ‚öñÔ∏è Metric 4: F1-Score (The Balance)

**Definition:** The harmonic mean of precision and recall

**Formula:**
```
F1 = 2 √ó (Precision √ó Recall) / (Precision + Recall)
```

**When to use:**
- ‚úÖ You need a balance between precision and recall
- ‚úÖ Imbalanced datasets
- ‚úÖ You want a single metric that considers both FP and FN

**Real AI Use:**
- Standard metric for NLP tasks (text classification, named entity recognition)
- Hugging Face model evaluation
- Most Kaggle competitions use F1 or macro-F1

In [None]:
# Calculate F1-score
f1 = f1_score(y_true, y_pred)

# Manual calculation
manual_f1 = 2 * (precision * recall) / (precision + recall)

print(f"‚öñÔ∏è  F1-Score: {f1:.2%}")
print(f"üìù Manual calculation: 2 √ó ({precision:.2f} √ó {recall:.2f}) / ({precision:.2f} + {recall:.2f}) = {manual_f1:.2%}")
print(f"\nüí° F1 balances precision ({precision:.0%}) and recall ({recall:.0%})")

## üìã All Metrics Together

Let's see a comprehensive report:

In [None]:
# Classification report shows everything!
print("üìä COMPLETE CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(y_true, y_pred, 
                          target_names=['Not Spam', 'Spam']))

print("\nüí° How to read this:")
print("- Precision: When model says 'Spam', how often is it right?")
print("- Recall: Of all actual spam, how much did we catch?")
print("- F1-score: Balanced metric combining both")
print("- Support: Number of actual occurrences in each class")

## üìà ROC Curve & AUC

**ROC (Receiver Operating Characteristic) Curve:**
- Shows the trade-off between True Positive Rate (Recall) and False Positive Rate
- Helps you choose the best threshold for your model

**AUC (Area Under the Curve):**
- Single number to measure ROC curve quality
- Range: 0.0 to 1.0 (higher is better)
- 0.5 = random guessing, 1.0 = perfect model

**Real AI Use:**
- Standard for binary classification evaluation
- Used in medical diagnostics, fraud detection
- Threshold tuning for production systems

In [None]:
# For ROC curve, we need probability scores
# Let's create a simple classifier

# Generate a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=15, n_redundant=5,
                          random_state=42, weights=[0.7, 0.3])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of positive class

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--', label='Random Classifier (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve: Model Performance', fontsize=14, fontweight='bold')
plt.legend(loc="lower right", fontsize=10)
plt.grid(alpha=0.3)
plt.show()

print(f"üéØ AUC Score: {auc_score:.3f}")
print(f"\nüí° Interpretation:")
if auc_score >= 0.9:
    print("   Excellent model! üåü")
elif auc_score >= 0.8:
    print("   Good model! ‚úÖ")
elif auc_score >= 0.7:
    print("   Fair model, room for improvement üìà")
else:
    print("   Poor model, needs work üîß")

## ü§ñ Real AI Example: Evaluating a RAG Retrieval System

**Scenario:** You built a RAG (Retrieval-Augmented Generation) system for a customer support chatbot.

**Question:** How do you evaluate if it retrieves the RIGHT documents?

Let's simulate this!

In [None]:
# Simulated RAG evaluation
# Let's say we tested 50 queries

# Ground truth: which documents are actually relevant (human-labeled)
actually_relevant = np.array([1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
                             0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
                             0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
                             1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
                             0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

# What your RAG system retrieved
retrieved = np.array([1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
                     0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
                     0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
                     1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
                     0, 1, 0, 1, 0, 1, 1, 1, 0, 1])

# Calculate metrics
rag_precision = precision_score(actually_relevant, retrieved)
rag_recall = recall_score(actually_relevant, retrieved)
rag_f1 = f1_score(actually_relevant, retrieved)

print("ü§ñ RAG SYSTEM EVALUATION")
print("=" * 50)
print(f"\nüìä Precision: {rag_precision:.2%}")
print("   ‚Üí Of documents retrieved, how many were actually relevant?")
print("   ‚Üí High precision = Low noise, users see relevant docs")

print(f"\nüîç Recall: {rag_recall:.2%}")
print("   ‚Üí Of all relevant documents, how many did we retrieve?")
print("   ‚Üí High recall = Comprehensive, didn't miss important info")

print(f"\n‚öñÔ∏è  F1-Score: {rag_f1:.2%}")
print("   ‚Üí Balanced metric for overall retrieval quality")

# Confusion matrix for RAG
cm_rag = confusion_matrix(actually_relevant, retrieved)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rag, annot=True, fmt='d', cmap='Greens',
           xticklabels=['Not Retrieved', 'Retrieved'],
           yticklabels=['Not Relevant', 'Relevant'])
plt.title('RAG System: Retrieval Performance', fontsize=14, fontweight='bold')
plt.ylabel('Actually Relevant?', fontsize=12)
plt.xlabel('Retrieved by RAG?', fontsize=12)
plt.show()

print("\nüéØ What This Means for Your RAG System:")
print(f"‚úÖ Retrieved {rag_precision:.0%} relevant documents (precision)")
print(f"‚úÖ Found {rag_recall:.0%} of all relevant documents (recall)")
print("\nüí° Optimization Strategy:")
if rag_precision < 0.8:
    print("   - Improve embedding quality (better model or fine-tuning)")
    print("   - Add re-ranking stage")
if rag_recall < 0.8:
    print("   - Retrieve more candidates (increase top-k)")
    print("   - Improve query expansion")
    print("   - Check document chunking strategy")

## üéØ YOUR TURN: Medical Diagnosis System

You're building an AI system to detect a rare disease from medical images.

**Dataset:**
- 1000 patients tested
- 50 actually have the disease (5%)
- 950 are healthy (95%)

**Your model's predictions:**
- Correctly identified: 45 sick patients (TP)
- Missed: 5 sick patients (FN)
- False alarms: 30 healthy patients wrongly flagged (FP)
- Correctly identified: 920 healthy patients (TN)

**Calculate:**
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Is this a good model for medical diagnosis? Why or why not?

In [None]:
# Given values
tp = 45   # Correctly identified sick patients
fn = 5    # Missed sick patients (FALSE NEGATIVE - VERY BAD!)
fp = 30   # Healthy wrongly flagged (False Positive - inconvenient but safer)
tn = 920  # Correctly identified healthy

# YOUR CODE HERE - Calculate the metrics
# Hint: Use the formulas we learned!

accuracy = # YOUR CODE
precision = # YOUR CODE  
recall = # YOUR CODE
f1 = # YOUR CODE

print("üè• MEDICAL DIAGNOSIS AI EVALUATION")
print("=" * 50)
print(f"Accuracy:  {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall:    {recall:.2%}")
print(f"F1-Score:  {f1:.2%}")
print("\nü§î Analysis:")
print(f"   - We missed {fn} sick patients (False Negatives)")
print(f"   - We wrongly flagged {fp} healthy patients (False Positives)")
print("\n‚ùì Which is worse for medical diagnosis? Think about it!")

### ‚úÖ Solution (Run this cell after trying!)

In [None]:
# SOLUTION
tp, fn, fp, tn = 45, 5, 30, 920

accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

print("üè• SOLUTION: Medical Diagnosis AI Evaluation")
print("=" * 50)
print(f"\nüìä Metrics:")
print(f"   Accuracy:  {accuracy:.2%} - {(tp+tn)}/{(tp+tn+fp+fn)} correct")
print(f"   Precision: {precision:.2%} - {tp}/{tp+fp} flagged patients actually sick")
print(f"   Recall:    {recall:.2%} - {tp}/{tp+fn} sick patients detected")
print(f"   F1-Score:  {f1:.2%} - Balanced metric")

print("\nüéØ ANALYSIS:")
print(f"\n‚úÖ GOOD: {recall:.0%} recall means we caught most sick patients")
print(f"‚ö†Ô∏è  CONCERN: We missed {fn} sick patients (False Negatives)")
print(f"‚ö†Ô∏è  CONCERN: {fp} healthy people will undergo unnecessary tests (False Positives)")

print("\nüí° VERDICT:")
print("   For medical diagnosis:")
print("   - High RECALL is CRITICAL (can't miss sick patients!)")
print("   - {:.0%} recall is good, but not perfect".format(recall))
print("   - Consider: Better to have false alarms than miss disease")
print("   - Recommendation: Aim for 95%+ recall, accept lower precision")

print("\nüîß How to improve:")
print("   1. Lower the decision threshold ‚Üí Higher recall (catch more cases)")
print("   2. Collect more training data for rare disease cases")
print("   3. Use ensemble models for better reliability")
print("   4. Always have human doctor review flagged cases")

## üìä Metric Selection Guide

**Quick Reference: Which Metric to Use?**

| Use Case | Metric | Why? |
|----------|--------|------|
| Balanced classes (MNIST) | Accuracy | Equal importance for all classes |
| Cancer detection | Recall | Can't miss positive cases! |
| Spam filter | Precision | Don't flag important emails |
| RAG retrieval | F1 | Balance relevance & coverage |
| Fraud detection | Recall | Catch all fraud |
| YouTube recommendations | Precision | Don't recommend bad videos |
| Search engines | Recall | Find all relevant results |
| Imbalanced data | F1 or AUC | Single metric that considers both |

**2024-2025 AI Applications:**
- **GPT-4 evaluation:** Precision & Recall for factual accuracy
- **RAG systems:** Precision@k, Recall@k, MRR (Mean Reciprocal Rank)
- **Multimodal AI:** Per-modality F1 scores
- **Agentic AI:** Task completion rate + precision of actions

## üéØ BONUS CHALLENGE: Build a Metric Dashboard

Create a function that takes predictions and returns a beautiful evaluation dashboard!

In [None]:
def evaluate_model(y_true, y_pred, class_names=['Negative', 'Positive']):
    """
    Complete evaluation dashboard for binary classification
    """
    # Calculate all metrics
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: Confusion Matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
               xticklabels=class_names, yticklabels=class_names)
    axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
    axes[0].set_ylabel('Actual', fontsize=12)
    axes[0].set_xlabel('Predicted', fontsize=12)
    
    # Plot 2: Metrics Bar Chart
    metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
    values = [acc, prec, rec, f1]
    colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
    
    bars = axes[1].bar(metrics, values, color=colors, alpha=0.7)
    axes[1].set_ylim([0, 1])
    axes[1].set_ylabel('Score', fontsize=12)
    axes[1].set_title('Performance Metrics', fontsize=14, fontweight='bold')
    axes[1].grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar, value in zip(bars, values):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height,
                    f'{value:.2%}', ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed report
    print("\nüìä COMPREHENSIVE EVALUATION REPORT")
    print("=" * 60)
    print(f"\n{'Metric':<15} {'Score':<10} {'Interpretation'}")
    print("-" * 60)
    print(f"{'Accuracy':<15} {acc:>6.2%}    Overall correctness")
    print(f"{'Precision':<15} {prec:>6.2%}    Positive prediction accuracy")
    print(f"{'Recall':<15} {rec:>6.2%}    True positive detection rate")
    print(f"{'F1-Score':<15} {f1:>6.2%}    Precision-Recall balance")
    print("=" * 60)
    
    return {'accuracy': acc, 'precision': prec, 'recall': rec, 'f1': f1}

# Test it!
test_true = np.array([1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1])
test_pred = np.array([1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1])

results = evaluate_model(test_true, test_pred, class_names=['Normal', 'Anomaly'])

## üéâ Congratulations!

**You just mastered:**
- ‚úÖ Confusion matrix (TP, TN, FP, FN)
- ‚úÖ Accuracy, Precision, Recall, F1-Score
- ‚úÖ ROC curves and AUC
- ‚úÖ When to use which metric
- ‚úÖ Real AI applications (RAG systems, medical AI)
- ‚úÖ How to evaluate models properly

**üéØ Key Takeaways:**
1. **Accuracy is not enough** - especially for imbalanced data
2. **Precision** = "When I predict positive, am I right?"
3. **Recall** = "Of all positives, how many did I find?"
4. **F1** = Balance between precision and recall
5. **Choose metrics based on your use case** (medical = high recall!)

**üöÄ Practice Exercise (Do before Day 2!):**

Imagine you're building an AI content moderator for social media:
- Test set: 1000 posts
- Toxic: 100, Safe: 900
- Your model: TP=85, FN=15, FP=50, TN=850

Calculate all metrics and decide:
- Is this good enough for production?
- Which metric matters most?
- How would you improve it?

---

**üìö Next Lesson:** Day 2 - Cross-Validation (Make sure your metrics are reliable!)

**üí¨ Questions?** Review the ROC curve section, it's powerful for threshold tuning!

---

*"In God we trust, all others must bring data... and proper evaluation metrics!"* üìä