# üìä Day 1: Descriptive Statistics for AI/ML

**üéØ Goal:** Master statistical measures to understand and describe your AI data

**‚è±Ô∏è Time:** 45-60 minutes

**üåü Why This Matters for AI:**
- Statistics is the foundation of ALL machine learning
- Understand your data before training models (garbage in = garbage out!)
- Evaluate model performance with statistical metrics
- Critical for 2024-2025 AI: RAG systems, Agentic AI, Multimodal models all rely on statistics
- Data scientists spend 80% of their time analyzing data with these exact techniques

---

## üìö What is Descriptive Statistics?

**Descriptive statistics** summarizes and describes data in a meaningful way.

Imagine you're training an AI model:
- You have 10,000 training examples
- Each example has features (numbers, measurements)
- How do you understand what your data looks like?

**That's where descriptive statistics comes in!** üéØ

We'll learn 3 key concepts:
1. **Measures of Central Tendency** (mean, median, mode) - "What's typical?"
2. **Measures of Spread** (variance, standard deviation) - "How spread out is the data?"
3. **Distributions** (visualizing data patterns) - "What does the data look like?"

---

## üîß Setup: Import Libraries

We'll use:
- `numpy` - Fast numerical computations
- `matplotlib` - Data visualization
- `statistics` - Built-in Python statistics

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import statistics

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("‚úÖ Libraries imported successfully!")

## üìè Part 1: Measures of Central Tendency

**Central tendency** answers: *"What's the typical or middle value?"*

### 1Ô∏è‚É£ Mean (Average)

**Formula:** Sum all values √∑ Number of values

**When to use:** Most common measure, good for normally distributed data

**AI Example:** Average model accuracy across training epochs

In [None]:
# Example: Model accuracy scores over 10 training epochs
accuracy_scores = [0.72, 0.78, 0.81, 0.85, 0.87, 0.89, 0.90, 0.91, 0.92, 0.93]

# Calculate mean
mean_accuracy = np.mean(accuracy_scores)

print("ü§ñ Model Accuracy Scores:", accuracy_scores)
print(f"üìä Average Accuracy: {mean_accuracy:.2%}")
print(f"\nüí° Interpretation: On average, the model is {mean_accuracy:.1%} accurate")

### 2Ô∏è‚É£ Median (Middle Value)

**Definition:** The middle value when data is sorted

**When to use:** Better than mean when you have outliers (extreme values)

**AI Example:** Median response time for AI chatbot (handles slow outliers better)

In [None]:
# Example: Chatbot response times in seconds (with some slow outliers)
response_times = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 15.0, 20.0]  # Last two are outliers!

# Calculate mean vs median
mean_time = np.mean(response_times)
median_time = np.median(response_times)

print("‚è±Ô∏è Response Times (seconds):", response_times)
print(f"\nüìä Mean (Average): {mean_time:.2f} seconds")
print(f"üìä Median (Middle): {median_time:.2f} seconds")
print(f"\nüí° See the difference? Outliers (15.0, 20.0) made the mean misleading!")
print(f"üí° Median ({median_time:.2f}s) better represents typical response time")

### 3Ô∏è‚É£ Mode (Most Frequent Value)

**Definition:** The value that appears most often

**When to use:** Categorical data or finding most common outcomes

**AI Example:** Most common prediction class in a classifier

In [None]:
# Example: Image classification predictions over 20 images
predictions = ['cat', 'dog', 'cat', 'bird', 'cat', 'dog', 'cat', 'cat', 
               'bird', 'cat', 'dog', 'cat', 'cat', 'bird', 'cat', 'dog', 
               'cat', 'cat', 'dog', 'cat']

# Calculate mode
mode_prediction = statistics.mode(predictions)

print("üñºÔ∏è Model Predictions:", predictions)
print(f"\nüìä Mode (Most Common): {mode_prediction}")
print(f"üí° The model predicts '{mode_prediction}' most frequently")
print(f"üí° This might indicate class imbalance in your training data!")

## üéØ YOUR TURN: Calculate Central Tendency

**Scenario:** You're analyzing a RAG (Retrieval-Augmented Generation) system's retrieval scores.

Calculate mean, median, and interpret the results!

In [None]:
# RAG retrieval relevance scores (0-100)
retrieval_scores = [85, 92, 78, 88, 95, 90, 87, 15, 91, 89]  # One outlier at 15!

# YOUR CODE HERE:
# 1. Calculate the mean
mean_score = np.mean(retrieval_scores)

# 2. Calculate the median
median_score = np.median(retrieval_scores)

# 3. Print both
print(f"Mean Score: {mean_score:.2f}")
print(f"Median Score: {median_score:.2f}")

# 4. Which one is better for this data and why?
print(f"\nüí° The median ({median_score:.2f}) is better because it's not affected by the outlier (15)")

## üìê Part 2: Measures of Spread (Variability)

**Spread** answers: *"How much do values differ from each other?"*

Two models can have the same average accuracy but very different consistency!

### 1Ô∏è‚É£ Variance

**Definition:** Average of squared differences from the mean

**Formula:** Œ£(x - mean)¬≤ / n

**AI Use:** Measures model stability/consistency

In [None]:
# Two different AI models with SAME average accuracy but different consistency
model_a_accuracy = [0.85, 0.86, 0.85, 0.86, 0.85, 0.86, 0.85, 0.86, 0.85, 0.86]  # Very consistent
model_b_accuracy = [0.70, 0.95, 0.75, 0.90, 0.80, 0.88, 0.82, 0.92, 0.78, 0.95]  # Very inconsistent

# Calculate means
mean_a = np.mean(model_a_accuracy)
mean_b = np.mean(model_b_accuracy)

# Calculate variances
var_a = np.var(model_a_accuracy)
var_b = np.var(model_b_accuracy)

print("üìä Model A (Consistent):")
print(f"   Mean Accuracy: {mean_a:.2%}")
print(f"   Variance: {var_a:.6f}")

print("\nüìä Model B (Inconsistent):")
print(f"   Mean Accuracy: {mean_b:.2%}")
print(f"   Variance: {var_b:.6f}")

print(f"\nüí° Both models have ~85% average accuracy")
print(f"üí° But Model B has {var_b/var_a:.1f}x higher variance = less reliable!")
print(f"üí° In production, you'd choose Model A (lower variance = more predictable)")

### 2Ô∏è‚É£ Standard Deviation (Most Important!)

**Definition:** Square root of variance

**Why better than variance?** Same units as original data (easier to interpret)

**Rule of thumb (Normal distribution):**
- 68% of data within 1 standard deviation of mean
- 95% of data within 2 standard deviations of mean
- 99.7% of data within 3 standard deviations of mean

In [None]:
# Example: Training loss values during model training
training_losses = [2.5, 2.3, 2.1, 2.0, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4]

# Calculate mean and standard deviation
mean_loss = np.mean(training_losses)
std_loss = np.std(training_losses)

print("üìâ Training Losses:", training_losses)
print(f"\nüìä Mean Loss: {mean_loss:.2f}")
print(f"üìä Standard Deviation: {std_loss:.2f}")
print(f"\nüí° Typical loss values range from {mean_loss - std_loss:.2f} to {mean_loss + std_loss:.2f}")
print(f"üí° Lower std = more stable training process")

## üéØ YOUR TURN: Analyze Model Variability

**Scenario:** You're testing two multimodal AI models (combining vision + text) for image captioning.

In [None]:
# BLEU scores (0-100) for two multimodal models
model_vision_text_a = [78, 82, 79, 81, 80, 82, 79, 81, 80, 82]
model_vision_text_b = [65, 85, 70, 90, 68, 88, 72, 86, 75, 92]

# YOUR CODE HERE:
# 1. Calculate mean for both models
mean_a = np.mean(model_vision_text_a)
mean_b = np.mean(model_vision_text_b)

# 2. Calculate standard deviation for both
std_a = np.std(model_vision_text_a)
std_b = np.std(model_vision_text_b)

# 3. Print results
print(f"Model A: Mean={mean_a:.2f}, Std={std_a:.2f}")
print(f"Model B: Mean={mean_b:.2f}, Std={std_b:.2f}")

# 4. Which model would you deploy and why?
print(f"\nüí° Choose Model A: Similar mean but {std_b/std_a:.1f}x lower variance = more reliable!")

## üìä Part 3: Distributions and Histograms

**Distribution:** How values are spread across possible values

**Histogram:** Visual representation of distribution

**Why it matters:**
- Identify data patterns
- Detect outliers
- Understand data shape (normal, skewed, bimodal)
- Critical for feature engineering in ML

### Creating Your First Histogram

In [None]:
# Example: 1000 model prediction confidence scores (0-100)
np.random.seed(42)  # For reproducibility
confidence_scores = np.random.normal(loc=75, scale=10, size=1000)  # Normal distribution

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(confidence_scores, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
plt.axvline(np.mean(confidence_scores), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(confidence_scores):.2f}')
plt.axvline(np.median(confidence_scores), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(confidence_scores):.2f}')

plt.xlabel('Confidence Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Model Confidence Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"üìä Mean: {np.mean(confidence_scores):.2f}")
print(f"üìä Median: {np.median(confidence_scores):.2f}")
print(f"üìä Std Dev: {np.std(confidence_scores):.2f}")
print(f"\nüí° This is a NORMAL distribution - most common in AI/ML!")
print(f"üí° Symmetric, bell-shaped, mean ‚âà median")

### Understanding Different Distribution Shapes

In [None]:
# Create 3 different distributions
np.random.seed(42)

# 1. Normal distribution (symmetric)
normal_data = np.random.normal(loc=50, scale=10, size=1000)

# 2. Right-skewed (long tail on right) - common in real-world data
skewed_data = np.random.exponential(scale=20, size=1000)

# 3. Bimodal (two peaks) - might indicate two different user groups
bimodal_data = np.concatenate([np.random.normal(30, 5, 500), np.random.normal(70, 5, 500)])

# Plot all three
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(normal_data, bins=30, color='lightblue', edgecolor='black')
axes[0].set_title('Normal Distribution\n(Symmetric)', fontweight='bold')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(skewed_data, bins=30, color='lightcoral', edgecolor='black')
axes[1].set_title('Right-Skewed Distribution\n(Long right tail)', fontweight='bold')
axes[1].set_xlabel('Value')

axes[2].hist(bimodal_data, bins=30, color='lightgreen', edgecolor='black')
axes[2].set_title('Bimodal Distribution\n(Two peaks)', fontweight='bold')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()

print("üí° Understanding these patterns helps you:")
print("   1. Choose the right ML algorithm")
print("   2. Decide if data needs transformation")
print("   3. Identify potential data quality issues")

## üéØ REAL AI EXAMPLE: Analyzing Model Performance Metrics

**Scenario:** You're building an **Agentic AI** system (AI agents that can take actions). You need to analyze how well different agents perform across various tasks.

Let's use everything we learned!

In [None]:
# Performance scores for 3 different AI agents across 50 tasks
np.random.seed(42)

agent_gpt = np.random.normal(loc=85, scale=5, size=50)    # Consistent, high performer
agent_claude = np.random.normal(loc=87, scale=3, size=50) # More consistent, slightly better
agent_custom = np.random.normal(loc=82, scale=12, size=50) # Less consistent

# Calculate all statistics
agents = {
    'GPT Agent': agent_gpt,
    'Claude Agent': agent_claude,
    'Custom Agent': agent_custom
}

print("ü§ñ AGENTIC AI PERFORMANCE ANALYSIS")
print("=" * 60)

for name, scores in agents.items():
    print(f"\n{name}:")
    print(f"  Mean Score:      {np.mean(scores):.2f}")
    print(f"  Median Score:    {np.median(scores):.2f}")
    print(f"  Std Deviation:   {np.std(scores):.2f}")
    print(f"  Min-Max:         {np.min(scores):.2f} - {np.max(scores):.2f}")

# Visualize
plt.figure(figsize=(12, 6))

# Box plot for comparison
plt.subplot(1, 2, 1)
plt.boxplot([agent_gpt, agent_claude, agent_custom], labels=['GPT', 'Claude', 'Custom'])
plt.ylabel('Performance Score', fontsize=12)
plt.title('Agent Performance Comparison\n(Box Plot)', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)

# Histograms
plt.subplot(1, 2, 2)
plt.hist(agent_gpt, bins=15, alpha=0.5, label='GPT', color='blue')
plt.hist(agent_claude, bins=15, alpha=0.5, label='Claude', color='orange')
plt.hist(agent_custom, bins=15, alpha=0.5, label='Custom', color='green')
plt.xlabel('Performance Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Performance Distribution\n(Histogram)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 60)
print("üí° DECISION: Choose Claude Agent!")
print("   ‚úÖ Highest mean (87)")
print("   ‚úÖ Lowest std dev (3) = most reliable")
print("   ‚úÖ Tightest distribution = predictable performance")
print("=" * 60)

## üéØ CHALLENGE: Analyze Transformer Model Attention Scores

**Scenario:** You're analyzing attention scores from a Transformer model (like GPT, BERT) to understand how it focuses on different tokens.

Complete the analysis below!

In [None]:
# Attention weights from a transformer layer (0-1 scale)
np.random.seed(123)
attention_weights = np.random.beta(a=2, b=5, size=200)  # Beta distribution (common for attention)

# YOUR TASK: Complete this analysis

# 1. Calculate mean, median, std
mean_attention = np.mean(attention_weights)
median_attention = np.median(attention_weights)
std_attention = np.std(attention_weights)

print("üîç Transformer Attention Analysis")
print(f"Mean:   {mean_attention:.3f}")
print(f"Median: {median_attention:.3f}")
print(f"Std:    {std_attention:.3f}")

# 2. Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(attention_weights, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.axvline(mean_attention, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_attention:.3f}')
plt.xlabel('Attention Weight', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Transformer Attention Weights', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

# 3. Interpretation
print(f"\nüí° Interpretation:")
print(f"   - Most attention weights are LOW (right-skewed distribution)")
print(f"   - Model focuses on a FEW important tokens (sparse attention)")
print(f"   - This is NORMAL for transformers - they learn to be selective!")

## üìù Summary & Key Takeaways

**You just learned:**

### Measures of Central Tendency:
- ‚úÖ **Mean** - Average value (sensitive to outliers)
- ‚úÖ **Median** - Middle value (robust to outliers)
- ‚úÖ **Mode** - Most frequent value

### Measures of Spread:
- ‚úÖ **Variance** - Average squared deviation
- ‚úÖ **Standard Deviation** - Square root of variance (same units as data)
- ‚úÖ Lower spread = more consistent/reliable model

### Distributions:
- ‚úÖ **Histograms** - Visualize data patterns
- ‚úÖ **Normal, Skewed, Bimodal** - Different distribution shapes
- ‚úÖ Understanding shape guides ML decisions

### Real AI Applications:
- ‚úÖ Model performance evaluation
- ‚úÖ Agentic AI comparison
- ‚úÖ Transformer attention analysis
- ‚úÖ RAG system optimization

---

## üöÄ Practice Exercises

**Before Day 2, complete these exercises:**

1. **RAG Retrieval Analysis:**
   - Generate 100 random retrieval scores (use `np.random.normal`)
   - Calculate all statistics
   - Create a histogram
   - Add mean and median lines

2. **Multimodal Model Comparison:**
   - Create data for 2 multimodal models (vision+text+audio)
   - One should be consistent, one inconsistent
   - Calculate statistics
   - Decide which to deploy

3. **Outlier Detection:**
   - Create a dataset with deliberate outliers
   - Show how mean vs median differ
   - Visualize with histogram
   - Explain when to use each measure

---

## üéØ Real-World Connection: Why This Matters in 2024-2025

**These exact techniques are used daily by AI engineers for:**

1. **RAG Systems (2024 breakthrough):**
   - Analyze retrieval score distributions
   - Optimize chunk size using statistical metrics
   - Compare different embedding models

2. **Agentic AI (2025 trend):**
   - Measure agent reliability (low variance = good!)
   - Compare agent performance across tasks
   - Detect anomalous agent behavior

3. **Multimodal Models:**
   - Analyze attention distributions across modalities
   - Balance vision/text/audio contributions
   - Evaluate cross-modal alignment

4. **Transformer Models:**
   - Analyze attention weight distributions
   - Understand token importance
   - Debug model behavior

**Bottom line:** Statistics is not optional - it's ESSENTIAL for AI/ML! üéØ

---

## üìö Next Lesson

**Day 2: Probability Theory**
- Probability fundamentals
- Conditional probability and Bayes' theorem
- Probability distributions
- Build a Naive Bayes classifier!

---

**üí¨ Questions?** Review this notebook, experiment with different datasets, break things and fix them!

*Remember: Statistics is the language of AI. Master it, and you master AI!* üöÄ