# üß™ Day 3: Statistical Testing for AI/ML

**üéØ Goal:** Learn to compare models scientifically and make data-driven decisions

**‚è±Ô∏è Time:** 45-60 minutes

**üåü Why This Matters for AI:**
- How do you know if Model A is REALLY better than Model B?
- Is the difference due to skill or just random luck?
- Statistical testing provides the answer!
- Essential for A/B testing, model deployment decisions, research papers
- Critical for 2024-2025 AI: Comparing RAG systems, Agentic AI performance, Multimodal models

**Real examples today:**
- A/B test two LLM prompts
- Compare RAG retrieval systems statistically
- Determine if model improvements are significant

---

## üìö What is Statistical Testing?

**Statistical Testing** answers the question: *"Is this difference real or just random chance?"*

**Everyday example:**
- You flip a coin 10 times, get 7 heads and 3 tails
- Is the coin unfair or did you just get lucky?
- Statistical testing tells you!

**AI example:**
- Model A: 85% accuracy
- Model B: 87% accuracy
- Is Model B REALLY better or just got lucky on this test set?
- Statistical testing tells you!

---

## üîß Setup: Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import seaborn as sns

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Libraries imported successfully!")

## üéØ Part 1: Hypothesis Testing Fundamentals

**The Scientific Method for AI:**

1. **Null Hypothesis (H‚ÇÄ):** "No difference" - the boring answer
   - Example: "Both models have the same accuracy"

2. **Alternative Hypothesis (H‚ÇÅ):** "There IS a difference" - what we want to prove
   - Example: "Model B is better than Model A"

3. **P-value:** Probability of seeing results this extreme if H‚ÇÄ is true
   - Low p-value (< 0.05) = Reject H‚ÇÄ = difference is REAL
   - High p-value (> 0.05) = Can't reject H‚ÇÄ = might be random

4. **Significance Level (Œ±):** Usually 0.05 (5%)
   - If p-value < 0.05 ‚Üí "statistically significant"

---

### Simple Example: Is This Model Better Than Random?

**Scenario:** Binary classifier (yes/no prediction). Random guessing = 50% accuracy.

Your model got 60% accuracy on 100 examples. Is it REALLY better than random?

In [None]:
# Data
n_samples = 100
correct_predictions = 60  # 60% accuracy
null_hypothesis_probability = 0.5  # Random guessing = 50%

# Binomial test: Is 60/100 significantly different from 50/100?
p_value = stats.binom_test(correct_predictions, n_samples, null_hypothesis_probability, alternative='greater')

print("üéØ HYPOTHESIS TEST: Is Model Better Than Random?")
print("=" * 60)
print(f"\nH‚ÇÄ (Null):        Model is just guessing randomly (50%)")
print(f"H‚ÇÅ (Alternative): Model is better than random")
print(f"\nObserved accuracy: {correct_predictions}/{n_samples} = {correct_predictions/n_samples:.0%}")
print(f"Expected (random): {null_hypothesis_probability:.0%}")
print(f"\nP-value: {p_value:.4f}")
print(f"Significance level: 0.05")

if p_value < 0.05:
    print(f"\n‚úÖ CONCLUSION: Reject H‚ÇÄ")
    print(f"   The model IS significantly better than random! (p < 0.05)")
else:
    print(f"\n‚ùå CONCLUSION: Cannot reject H‚ÇÄ")
    print(f"   The model might just be getting lucky (p > 0.05)")

print("=" * 60)

### Visualizing the P-value

In [None]:
# Simulate what we'd expect from random guessing
np.random.seed(42)
random_results = np.random.binomial(n_samples, 0.5, 10000)

plt.figure(figsize=(10, 6))
plt.hist(random_results, bins=30, color='lightblue', edgecolor='black', alpha=0.7, density=True)
plt.axvline(50, color='green', linestyle='--', linewidth=2, label='Expected (random guessing = 50)')
plt.axvline(correct_predictions, color='red', linestyle='--', linewidth=2, label=f'Our model = {correct_predictions}')
plt.xlabel('Number of Correct Predictions (out of 100)', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Distribution Under Null Hypothesis (Random Guessing)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"üí° The red line shows our model's performance")
print(f"üí° It's in the tail of the distribution ‚Üí unlikely to happen by chance")
print(f"üí° P-value = area in the tail beyond the red line")

## üéØ YOUR TURN: Test Your Model

**Scenario:** You built a sentiment classifier. Test if it's better than random!

In [None]:
# Your classifier's results
n_reviews = 200
correct_classifications = 130  # 65% accuracy

# YOUR CODE HERE:
# 1. Run binomial test (random = 50%)
p_value = stats.binom_test(correct_classifications, n_reviews, 0.5, alternative='greater')

# 2. Print results
print(f"Observed: {correct_classifications}/{n_reviews} = {correct_classifications/n_reviews:.0%}")
print(f"P-value: {p_value:.6f}")

# 3. Interpret (is p < 0.05?)
if p_value < 0.05:
    print(f"\n‚úÖ Statistically significant! Your classifier works!")
else:
    print(f"\n‚ùå Not significant. Might need more training data.")

## üìä Part 2: Comparing Two Models (T-Test)

**Most common scenario in AI:** Which of two models is better?

**T-test** compares the means of two groups.

**Types:**
- **Paired t-test:** Same data, two different models (MOST COMMON in ML)
- **Independent t-test:** Different datasets

Let's compare two models!

In [None]:
# Example: Two models tested on 30 different tasks
np.random.seed(42)

# Model A: Older model
model_a_scores = np.random.normal(loc=0.82, scale=0.05, size=30)  # Mean 82%, std 5%

# Model B: Your new model (slightly better)
model_b_scores = np.random.normal(loc=0.86, scale=0.05, size=30)  # Mean 86%, std 5%

# Perform paired t-test
t_statistic, p_value = stats.ttest_rel(model_b_scores, model_a_scores)

print("ü§ñ COMPARING TWO AI MODELS")
print("=" * 60)
print(f"\nModel A (baseline):")
print(f"  Mean: {np.mean(model_a_scores):.1%}")
print(f"  Std:  {np.std(model_a_scores):.1%}")

print(f"\nModel B (new model):")
print(f"  Mean: {np.mean(model_b_scores):.1%}")
print(f"  Std:  {np.std(model_b_scores):.1%}")

print(f"\nDifference: {np.mean(model_b_scores) - np.mean(model_a_scores):.1%}")

print(f"\nüìä Statistical Test:")
print(f"  T-statistic: {t_statistic:.3f}")
print(f"  P-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"\n‚úÖ SIGNIFICANT! Model B is statistically better than Model A")
    print(f"   Deploy Model B with confidence!")
else:
    print(f"\n‚ùå NOT SIGNIFICANT. Difference might be due to chance")
    print(f"   Need more testing or the improvement is too small")

print("=" * 60)

### Visualizing Model Comparison

In [None]:
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
axes[0].boxplot([model_a_scores, model_b_scores], labels=['Model A', 'Model B'])
axes[0].set_ylabel('Accuracy Score', fontsize=12)
axes[0].set_title('Model Performance Comparison\n(Box Plot)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Histogram overlay
axes[1].hist(model_a_scores, bins=15, alpha=0.6, label='Model A', color='blue', edgecolor='black')
axes[1].hist(model_b_scores, bins=15, alpha=0.6, label='Model B', color='orange', edgecolor='black')
axes[1].axvline(np.mean(model_a_scores), color='blue', linestyle='--', linewidth=2)
axes[1].axvline(np.mean(model_b_scores), color='orange', linestyle='--', linewidth=2)
axes[1].set_xlabel('Accuracy Score', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Score Distributions\n(Histogram)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"üí° Even though Model B appears better visually...")
print(f"üí° We need statistical testing to be SURE it's not just luck!")

## üéØ YOUR TURN: Compare RAG Systems

**Scenario:** You're testing two RAG (Retrieval-Augmented Generation) configurations.

Which one retrieves more relevant documents?

In [None]:
# Two RAG systems tested on 25 queries
np.random.seed(123)

# RAG System 1: Basic embedding
rag_system_1 = np.random.normal(loc=0.75, scale=0.08, size=25)

# RAG System 2: Fine-tuned embedding
rag_system_2 = np.random.normal(loc=0.80, scale=0.08, size=25)

# YOUR CODE HERE:
# 1. Calculate means
mean_1 = np.mean(rag_system_1)
mean_2 = np.mean(rag_system_2)

# 2. Run paired t-test
t_stat, p_val = stats.ttest_rel(rag_system_2, rag_system_1)

# 3. Print results
print(f"RAG System 1 (Basic):      {mean_1:.1%}")
print(f"RAG System 2 (Fine-tuned): {mean_2:.1%}")
print(f"\nImprovement: {mean_2 - mean_1:.1%}")
print(f"P-value: {p_val:.4f}")

# 4. Decide which to use
if p_val < 0.05:
    print(f"\n‚úÖ Use RAG System 2! Significantly better (p < 0.05)")
else:
    print(f"\nü§î No significant difference. Stick with System 1 (simpler)")

## üéØ Part 3: Confidence Intervals

**Confidence Interval:** Range where the true value likely falls

**95% Confidence Interval:** We're 95% confident the true value is in this range

**Why it matters:**
- Model shows 85% accuracy ‚Üí but true accuracy might be 82-88%
- Confidence intervals show uncertainty
- Critical for reporting model performance honestly

In [None]:
# Example: Model tested on 100 samples
np.random.seed(42)
model_scores = np.random.normal(loc=0.85, scale=0.10, size=100)

# Calculate 95% confidence interval
mean_score = np.mean(model_scores)
std_error = stats.sem(model_scores)  # Standard error of the mean
confidence_interval = stats.t.interval(
    confidence=0.95,
    df=len(model_scores)-1,
    loc=mean_score,
    scale=std_error
)

print("üìä MODEL PERFORMANCE WITH CONFIDENCE INTERVAL")
print("=" * 60)
print(f"\nMean accuracy: {mean_score:.1%}")
print(f"\n95% Confidence Interval: [{confidence_interval[0]:.1%}, {confidence_interval[1]:.1%}]")
print(f"\nüí° Interpretation:")
print(f"   We are 95% confident the TRUE accuracy is between")
print(f"   {confidence_interval[0]:.1%} and {confidence_interval[1]:.1%}")
print(f"\nüí° When reporting to stakeholders:")
print(f"   DON'T say: 'The model is {mean_score:.1%} accurate'")
print(f"   DO say: 'The model is {mean_score:.1%} accurate (95% CI: {confidence_interval[0]:.1%}-{confidence_interval[1]:.1%})'")
print("=" * 60)

### Visualizing Confidence Intervals

In [None]:
# Compare 3 models with confidence intervals
np.random.seed(42)

models = {
    'Model A\n(Baseline)': np.random.normal(0.80, 0.08, 50),
    'Model B\n(Improved)': np.random.normal(0.85, 0.08, 50),
    'Model C\n(SOTA)': np.random.normal(0.88, 0.08, 50),
}

# Calculate means and CIs
means = []
cis = []
names = []

for name, scores in models.items():
    mean = np.mean(scores)
    se = stats.sem(scores)
    ci = stats.t.interval(0.95, len(scores)-1, loc=mean, scale=se)
    
    names.append(name)
    means.append(mean)
    cis.append((ci[1] - mean, mean - ci[0]))  # Error bars

# Plot
plt.figure(figsize=(10, 6))
x_pos = np.arange(len(names))
plt.errorbar(x_pos, means, yerr=np.array(cis).T, fmt='o', markersize=10, 
             capsize=10, capthick=2, linewidth=2, color='darkblue')
plt.xticks(x_pos, names, fontsize=11)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Comparison with 95% Confidence Intervals', fontsize=14, fontweight='bold')
plt.ylim(0.7, 1.0)
plt.grid(axis='y', alpha=0.3)
plt.axhline(y=0.85, color='red', linestyle='--', alpha=0.3, label='Target: 85%')
plt.legend()
plt.tight_layout()
plt.show()

print("üí° Error bars show uncertainty")
print("üí° If confidence intervals DON'T overlap ‚Üí significant difference")
print("üí° If they DO overlap ‚Üí difference might not be significant")

## üéØ Part 4: A/B Testing for AI Models

**A/B Testing:** Compare two versions in real-world conditions

**Common in AI:**
- Testing two different prompts for LLMs
- Comparing two RAG configurations
- Testing two recommendation algorithms
- Evaluating two chatbot personalities

**Let's run a real A/B test!**

### Example: A/B Testing LLM Prompts

**Scenario:** You have two prompts for a customer service chatbot.

Which one gets better user satisfaction?

In [None]:
# Simulate A/B test data
np.random.seed(42)

# Group A: Standard prompt (500 users)
# Satisfaction score: 1-5 stars
group_a_satisfaction = np.random.choice([1, 2, 3, 4, 5], size=500, p=[0.05, 0.10, 0.25, 0.40, 0.20])

# Group B: Improved prompt (500 users)
# Slightly higher satisfaction
group_b_satisfaction = np.random.choice([1, 2, 3, 4, 5], size=500, p=[0.03, 0.07, 0.20, 0.45, 0.25])

# Calculate metrics
mean_a = np.mean(group_a_satisfaction)
mean_b = np.mean(group_b_satisfaction)

# Statistical test
t_stat, p_value = stats.ttest_ind(group_b_satisfaction, group_a_satisfaction)

# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(group_a_satisfaction)**2 + np.std(group_b_satisfaction)**2) / 2)
cohens_d = (mean_b - mean_a) / pooled_std

print("üß™ A/B TEST RESULTS: LLM Prompt Comparison")
print("=" * 70)
print(f"\nGroup A (Standard Prompt):")
print(f"  Users: {len(group_a_satisfaction)}")
print(f"  Average satisfaction: {mean_a:.2f} / 5 stars")
print(f"  Std: {np.std(group_a_satisfaction):.2f}")

print(f"\nGroup B (Improved Prompt):")
print(f"  Users: {len(group_b_satisfaction)}")
print(f"  Average satisfaction: {mean_b:.2f} / 5 stars")
print(f"  Std: {np.std(group_b_satisfaction):.2f}")

print(f"\nüìä Statistical Analysis:")
print(f"  Difference: +{mean_b - mean_a:.2f} stars ({(mean_b - mean_a)/mean_a * 100:.1f}% improvement)")
print(f"  P-value: {p_value:.4f}")
print(f"  Effect size (Cohen's d): {cohens_d:.3f}")

print(f"\nüéØ DECISION:")
if p_value < 0.05:
    print(f"  ‚úÖ STATISTICALLY SIGNIFICANT! (p < 0.05)")
    print(f"  ‚úÖ Improved prompt is REALLY better")
    print(f"  ‚úÖ RECOMMENDATION: Deploy improved prompt to all users")
else:
    print(f"  ‚ùå NOT significant (p > 0.05)")
    print(f"  ‚ùå Difference might be random")
    print(f"  ‚ùå RECOMMENDATION: Keep testing or stick with standard prompt")

# Interpret effect size
if abs(cohens_d) < 0.2:
    effect_interpretation = "small"
elif abs(cohens_d) < 0.5:
    effect_interpretation = "medium"
else:
    effect_interpretation = "large"

print(f"\nüí° Effect size is {effect_interpretation} (d = {cohens_d:.3f})")
print("=" * 70)

### Visualizing A/B Test Results

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. Distribution comparison
axes[0].hist(group_a_satisfaction, bins=5, alpha=0.6, label='Group A (Standard)', 
             color='blue', edgecolor='black', density=True)
axes[0].hist(group_b_satisfaction, bins=5, alpha=0.6, label='Group B (Improved)', 
             color='orange', edgecolor='black', density=True)
axes[0].set_xlabel('Satisfaction (1-5 stars)', fontsize=11)
axes[0].set_ylabel('Density', fontsize=11)
axes[0].set_title('Distribution Comparison', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# 2. Box plot
axes[1].boxplot([group_a_satisfaction, group_b_satisfaction], 
                labels=['Group A\n(Standard)', 'Group B\n(Improved)'])
axes[1].set_ylabel('Satisfaction (1-5 stars)', fontsize=11)
axes[1].set_title('Box Plot Comparison', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

# 3. Mean comparison with error bars
se_a = stats.sem(group_a_satisfaction)
se_b = stats.sem(group_b_satisfaction)
ci_a = stats.t.interval(0.95, len(group_a_satisfaction)-1, loc=mean_a, scale=se_a)
ci_b = stats.t.interval(0.95, len(group_b_satisfaction)-1, loc=mean_b, scale=se_b)

groups = ['Group A\n(Standard)', 'Group B\n(Improved)']
means_ab = [mean_a, mean_b]
errors = [(mean_a - ci_a[0], ci_a[1] - mean_a), (mean_b - ci_b[0], ci_b[1] - mean_b)]

axes[2].bar(groups, means_ab, yerr=np.array(errors).T, capsize=10, 
            color=['blue', 'orange'], alpha=0.7, edgecolor='black', linewidth=2)
axes[2].set_ylabel('Average Satisfaction', fontsize=11)
axes[2].set_title('Mean with 95% CI', fontsize=12, fontweight='bold')
axes[2].set_ylim(0, 5)
axes[2].grid(axis='y', alpha=0.3)

# Add values on bars
for i, (group, mean) in enumerate(zip(groups, means_ab)):
    axes[2].text(i, mean + 0.15, f'{mean:.2f}', ha='center', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.show()

## üéØ YOUR TURN: A/B Test Two RAG Configurations

**Scenario:** Testing two RAG chunk sizes for retrieval quality.

Run a complete A/B test analysis!

In [None]:
# A/B test data: Relevance scores (0-100)
np.random.seed(456)

# Configuration A: 512 token chunks (200 queries)
config_a_scores = np.random.normal(loc=78, scale=12, size=200)

# Configuration B: 256 token chunks (200 queries)
config_b_scores = np.random.normal(loc=82, scale=12, size=200)

# YOUR CODE HERE:
# 1. Calculate means
mean_config_a = np.mean(config_a_scores)
mean_config_b = np.mean(config_b_scores)

# 2. Run t-test
t_stat, p_val = stats.ttest_ind(config_b_scores, config_a_scores)

# 3. Calculate effect size
pooled_std = np.sqrt((np.std(config_a_scores)**2 + np.std(config_b_scores)**2) / 2)
cohens_d = (mean_config_b - mean_config_a) / pooled_std

# 4. Print comprehensive results
print("üîç RAG CONFIGURATION A/B TEST")
print("=" * 60)
print(f"\nConfig A (512 tokens): {mean_config_a:.1f}")
print(f"Config B (256 tokens): {mean_config_b:.1f}")
print(f"\nImprovement: {mean_config_b - mean_config_a:.1f} points")
print(f"P-value: {p_val:.4f}")
print(f"Effect size: {cohens_d:.3f}")

# 5. Make recommendation
if p_val < 0.05:
    print(f"\n‚úÖ Use Config B (256 tokens)! Significantly better.")
else:
    print(f"\nü§î No significant difference. Use Config A (simpler).")

print("=" * 60)

## üéØ REAL AI EXAMPLE: Comparing Multimodal Models Statistically

**Scenario:** You're evaluating 3 multimodal models (vision + text) for image captioning.

Let's perform a complete statistical comparison!

In [None]:
# Generate realistic multimodal model performance data
np.random.seed(42)
n_images = 100

# Model 1: CLIP-based (baseline)
clip_scores = np.random.normal(loc=72, scale=8, size=n_images)

# Model 2: Fine-tuned BLIP
blip_scores = np.random.normal(loc=78, scale=7, size=n_images)

# Model 3: Custom transformer
custom_scores = np.random.normal(loc=81, scale=9, size=n_images)

models_data = {
    'CLIP (Baseline)': clip_scores,
    'BLIP (Fine-tuned)': blip_scores,
    'Custom Transformer': custom_scores
}

print("üñºÔ∏è MULTIMODAL MODEL COMPARISON (Image Captioning)")
print("=" * 70)
print(f"\nTest set: {n_images} images")
print(f"Metric: BLEU score (0-100)\n")

# Calculate statistics for each model
results = {}
for model_name, scores in models_data.items():
    mean = np.mean(scores)
    std = np.std(scores)
    se = stats.sem(scores)
    ci = stats.t.interval(0.95, len(scores)-1, loc=mean, scale=se)
    
    results[model_name] = {
        'mean': mean,
        'std': std,
        'ci': ci,
        'scores': scores
    }
    
    print(f"{model_name}:")
    print(f"  Mean BLEU: {mean:.2f}")
    print(f"  Std Dev:   {std:.2f}")
    print(f"  95% CI:    [{ci[0]:.2f}, {ci[1]:.2f}]\n")

# Pairwise comparisons
print("\nüìä PAIRWISE STATISTICAL TESTS:")
print("=" * 70)

comparisons = [
    ('BLIP (Fine-tuned)', 'CLIP (Baseline)'),
    ('Custom Transformer', 'CLIP (Baseline)'),
    ('Custom Transformer', 'BLIP (Fine-tuned)')
]

for model1, model2 in comparisons:
    t_stat, p_val = stats.ttest_rel(results[model1]['scores'], results[model2]['scores'])
    diff = results[model1]['mean'] - results[model2]['mean']
    
    print(f"\n{model1} vs {model2}:")
    print(f"  Difference: {diff:+.2f} points")
    print(f"  T-statistic: {t_stat:.3f}")
    print(f"  P-value: {p_val:.4f}")
    
    if p_val < 0.05:
        winner = model1 if diff > 0 else model2
        print(f"  ‚úÖ {winner} is SIGNIFICANTLY better (p < 0.05)")
    else:
        print(f"  ‚ùå No significant difference (p > 0.05)")

print("\n" + "=" * 70)
print("\nüéØ FINAL RECOMMENDATION:")
best_model = max(results.items(), key=lambda x: x[1]['mean'])[0]
print(f"\n  Deploy: {best_model}")
print(f"  Mean BLEU: {results[best_model]['mean']:.2f}")
print(f"  95% CI: [{results[best_model]['ci'][0]:.2f}, {results[best_model]['ci'][1]:.2f}]")
print("\n  ‚úÖ Highest performance")
print("  ‚úÖ Statistically validated")
print("  ‚úÖ Ready for production")
print("=" * 70)

### Comprehensive Visualization

In [None]:
# Create publication-ready visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Box plots
axes[0, 0].boxplot([results[m]['scores'] for m in models_data.keys()], 
                    labels=['CLIP', 'BLIP', 'Custom'])
axes[0, 0].set_ylabel('BLEU Score', fontsize=11)
axes[0, 0].set_title('Score Distribution (Box Plot)', fontsize=12, fontweight='bold')
axes[0, 0].grid(axis='y', alpha=0.3)

# 2. Violin plots (shows distribution shape)
positions = [1, 2, 3]
parts = axes[0, 1].violinplot([results[m]['scores'] for m in models_data.keys()], 
                               positions=positions, showmeans=True, showmedians=True)
axes[0, 1].set_xticks(positions)
axes[0, 1].set_xticklabels(['CLIP', 'BLIP', 'Custom'])
axes[0, 1].set_ylabel('BLEU Score', fontsize=11)
axes[0, 1].set_title('Score Distribution (Violin Plot)', fontsize=12, fontweight='bold')
axes[0, 1].grid(axis='y', alpha=0.3)

# 3. Mean comparison with confidence intervals
model_names = list(models_data.keys())
means = [results[m]['mean'] for m in model_names]
errors = [(results[m]['mean'] - results[m]['ci'][0], 
           results[m]['ci'][1] - results[m]['mean']) for m in model_names]

x_pos = np.arange(len(model_names))
axes[1, 0].bar(x_pos, means, yerr=np.array(errors).T, capsize=10, 
               color=['skyblue', 'lightcoral', 'lightgreen'], 
               edgecolor='black', linewidth=2, alpha=0.8)
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(['CLIP', 'BLIP', 'Custom'])
axes[1, 0].set_ylabel('Mean BLEU Score', fontsize=11)
axes[1, 0].set_title('Mean Performance with 95% CI', fontsize=12, fontweight='bold')
axes[1, 0].grid(axis='y', alpha=0.3)

# Add values on bars
for i, mean in enumerate(means):
    axes[1, 0].text(i, mean + 2, f'{mean:.1f}', ha='center', fontweight='bold')

# 4. Overlapping histograms
for model_name, color in zip(model_names, ['blue', 'red', 'green']):
    axes[1, 1].hist(results[model_name]['scores'], bins=15, alpha=0.5, 
                    label=model_name.split()[0], color=color, edgecolor='black')
    axes[1, 1].axvline(results[model_name]['mean'], color=color, 
                       linestyle='--', linewidth=2)

axes[1, 1].set_xlabel('BLEU Score', fontsize=11)
axes[1, 1].set_ylabel('Frequency', fontsize=11)
axes[1, 1].set_title('Score Distributions (Histogram)', fontsize=12, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° This is how you present model comparisons in research papers!")
print("üí° Multiple visualizations + statistical tests = robust conclusions")

## üìù Summary & Key Takeaways

**You just learned:**

### Hypothesis Testing:
- ‚úÖ Null hypothesis (H‚ÇÄ) vs Alternative hypothesis (H‚ÇÅ)
- ‚úÖ P-value: probability results are due to chance
- ‚úÖ p < 0.05 = statistically significant
- ‚úÖ Used to validate if improvements are real

### Comparing Models:
- ‚úÖ T-test compares means of two groups
- ‚úÖ Paired t-test for same data, different models
- ‚úÖ Independent t-test for different datasets
- ‚úÖ Essential for choosing best model

### Confidence Intervals:
- ‚úÖ Range where true value likely falls
- ‚úÖ 95% CI = 95% confident true value is in range
- ‚úÖ Shows uncertainty in estimates
- ‚úÖ Critical for honest reporting

### A/B Testing:
- ‚úÖ Compare two versions in production
- ‚úÖ Used for prompts, models, configurations
- ‚úÖ Statistical validation of improvements
- ‚úÖ Effect size shows practical significance

### Real Applications:
- ‚úÖ Compared multimodal models scientifically
- ‚úÖ A/B tested LLM prompts
- ‚úÖ Validated RAG configurations
- ‚úÖ Made data-driven deployment decisions

---

## üöÄ Final Challenge: Complete Model Evaluation

**Scenario:** You're presenting to stakeholders. They want to know:
1. Which model to deploy?
2. Is it REALLY better?
3. How confident are you?

Perform a complete analysis!

In [None]:
# Two Agentic AI systems tested on 50 tasks
np.random.seed(789)

agent_current = np.random.normal(loc=75, scale=10, size=50)  # Current production agent
agent_new = np.random.normal(loc=80, scale=10, size=50)      # New improved agent

# YOUR COMPLETE ANALYSIS:

print("ü§ñ AGENTIC AI SYSTEM EVALUATION REPORT")
print("=" * 70)

# 1. Descriptive statistics
print("\n1Ô∏è‚É£ DESCRIPTIVE STATISTICS:\n")
for name, scores in [('Current Agent', agent_current), ('New Agent', agent_new)]:
    mean = np.mean(scores)
    std = np.std(scores)
    se = stats.sem(scores)
    ci = stats.t.interval(0.95, len(scores)-1, loc=mean, scale=se)
    
    print(f"{name}:")
    print(f"  Mean:   {mean:.2f}")
    print(f"  Std:    {std:.2f}")
    print(f"  95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]\n")

# 2. Statistical test
print("2Ô∏è‚É£ STATISTICAL TEST (Paired T-Test):\n")
t_stat, p_val = stats.ttest_rel(agent_new, agent_current)
print(f"  T-statistic: {t_stat:.3f}")
print(f"  P-value: {p_val:.4f}")
print(f"  Significant: {'YES ‚úÖ' if p_val < 0.05 else 'NO ‚ùå'}\n")

# 3. Effect size
print("3Ô∏è‚É£ EFFECT SIZE:\n")
diff = np.mean(agent_new) - np.mean(agent_current)
pooled_std = np.sqrt((np.std(agent_current)**2 + np.std(agent_new)**2) / 2)
cohens_d = diff / pooled_std
print(f"  Improvement: {diff:.2f} points ({diff/np.mean(agent_current)*100:.1f}%)")
print(f"  Cohen's d: {cohens_d:.3f}")
if abs(cohens_d) < 0.2:
    print(f"  Interpretation: Small effect\n")
elif abs(cohens_d) < 0.5:
    print(f"  Interpretation: Medium effect\n")
else:
    print(f"  Interpretation: Large effect\n")

# 4. Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].boxplot([agent_current, agent_new], labels=['Current', 'New'])
axes[0].set_ylabel('Performance Score')
axes[0].set_title('Agent Comparison', fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

axes[1].hist(agent_current, bins=15, alpha=0.6, label='Current', color='blue', edgecolor='black')
axes[1].hist(agent_new, bins=15, alpha=0.6, label='New', color='orange', edgecolor='black')
axes[1].set_xlabel('Performance Score')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Score Distributions', fontweight='bold')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# 5. Final recommendation
print("\n" + "=" * 70)
print("4Ô∏è‚É£ RECOMMENDATION FOR STAKEHOLDERS:\n")
if p_val < 0.05:
    print("  ‚úÖ DEPLOY NEW AGENT")
    print(f"  ‚úÖ {diff:.1f} point improvement ({diff/np.mean(agent_current)*100:.1f}% better)")
    print(f"  ‚úÖ Statistically significant (p = {p_val:.4f})")
    print(f"  ‚úÖ 95% confident improvement is real")
    print("\n  Expected impact: Better user experience, higher success rate")
else:
    print("  ü§î KEEP CURRENT AGENT")
    print(f"  ü§î Improvement not statistically significant (p = {p_val:.4f})")
    print("  ü§î Difference might be due to random variation")
    print("\n  Recommendation: Collect more data or improve new agent further")

print("=" * 70)
print("\nüí° This is how professionals make deployment decisions!")

## üéØ Why This Matters in 2024-2025

**Statistical testing is CRITICAL for modern AI:**

1. **RAG Systems:**
   - A/B test different chunk sizes
   - Compare embedding models statistically
   - Validate retrieval improvements
   - Optimize re-ranking strategies

2. **Agentic AI:**
   - Compare agent architectures
   - Validate decision-making improvements
   - Test different prompting strategies
   - Measure reliability statistically

3. **Multimodal Models:**
   - Compare fusion strategies
   - Test different modality combinations
   - Validate cross-modal improvements
   - Optimize attention mechanisms

4. **LLM Development:**
   - A/B test prompts
   - Compare fine-tuning approaches
   - Validate RLHF improvements
   - Test different sampling strategies

**Bottom line:** Without statistical testing, you're guessing. With it, you KNOW! üéØ

---

## üìö Week 5 Complete!

**üéâ Congratulations! You've mastered:**

**Day 1: Descriptive Statistics**
- Mean, median, mode
- Variance and standard deviation
- Distributions and histograms

**Day 2: Probability Theory**
- Probability fundamentals
- Conditional probability and Bayes' Theorem
- Probability distributions
- Built a Naive Bayes classifier!

**Day 3: Statistical Testing**
- Hypothesis testing
- T-tests and p-values
- Confidence intervals
- A/B testing for AI models

**You can now:**
- ‚úÖ Analyze any AI dataset scientifically
- ‚úÖ Compare models statistically
- ‚úÖ Make data-driven deployment decisions
- ‚úÖ Present results with confidence
- ‚úÖ Understand modern AI research papers

---

## üöÄ What's Next?

**Statistics is the foundation. Now you're ready for:**

- **Week 6:** Linear Algebra (matrices, vectors, transformations)
- **Week 7:** Calculus for ML (gradients, optimization)
- **Week 8:** NumPy & Data Processing
- **Phase 2:** Machine Learning Algorithms
- **Phase 3:** Deep Learning & Neural Networks
- **Phase 4:** Modern AI (Transformers, RAG, Agents!)

**Keep going! You're building the foundation for an AI career!** üöÄ

---

**üí¨ Questions?** Review all three notebooks, practice with real datasets!

*Remember: Statistics isn't just numbers - it's the scientific method for AI!* üìä