# Statistics Foundations for Machine Learning

* * * 

### Icons used in this notebook
üîî **Question**: A quick question to help you understand what's going on.<br>
ü•ä **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
‚ö†Ô∏è **Warning**: Heads-up about tricky stuff or common mistakes.<br>
üí° **Tip**: How to do something a bit more efficiently or effectively.<br>
üé¨ **Demo**: Showing off something more advanced ‚Äì so you know what's possible!<br>
üìù **Notation**: Breaking down mathematical notation into plain English.<br>

### Learning Objectives

By the end of this workshop, you will be able to:

1. [Describe data using summary statistics](#part1) ‚Äì mean, variance, and standard deviation
2. [Understand distributions](#part2) ‚Äì especially the normal distribution and why it matters
3. [Reason from samples to populations](#part3) ‚Äì the sampling distribution and standard error
4. [Quantify uncertainty](#part4) ‚Äì confidence intervals and what they really mean
5. [Evaluate evidence](#part5) ‚Äì hypothesis testing, p-values, and effect sizes
6. [Understand relationships](#part6) ‚Äì correlation and its limitations
7. [Bridge to ML/NLP](#part7) ‚Äì vectors, matrices, and similarity measures

### Prerequisites

This workshop assumes basic Python familiarity (variables, lists, functions). No prior statistics knowledge is required.

## Setup

Let's import the libraries we'll use throughout this workshop.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For nicer plots
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)

# Set random seed for reproducibility
np.random.seed(42)

## Our Dataset: Community Survey

Throughout this workshop, we'll work with a simulated dataset from a community survey. This dataset contains information about 200 respondents, including their demographics and attitudes.

Let's create and explore this dataset:

In [None]:
# Create our simulated survey dataset
n = 200

# Generate correlated variables
education_years = np.random.normal(14, 3, n).clip(8, 22).round()
age = np.random.normal(42, 15, n).clip(18, 85).round()

# Income correlates with education (with noise)
income = (25000 + education_years * 4000 + np.random.normal(0, 15000, n)).clip(15000, 200000).round(-2)

# Political engagement correlates with education and age
engagement_score = (
    2 + 
    0.3 * (education_years - 14) + 
    0.02 * (age - 42) + 
    np.random.normal(0, 1.5, n)
).clip(0, 10).round(1)

# Social trust score
social_trust = (
    5 + 
    0.2 * (education_years - 14) + 
    np.random.normal(0, 2, n)
).clip(0, 10).round(1)

# News consumption (hours per week)
news_hours = np.random.exponential(5, n).clip(0, 30).round(1)

# Create DataFrame
survey = pd.DataFrame({
    'age': age,
    'education_years': education_years,
    'income': income,
    'social_trust': social_trust,
    'news_hours': news_hours,
    'political_engagement': engagement_score
})

survey.head(10)

In [None]:
# Basic info about our dataset
print(f"Dataset shape: {survey.shape[0]} respondents, {survey.shape[1]} variables")
print(f"\nVariables:")
for col in survey.columns:
    print(f"  - {col}")

---

<a id='part1'></a>

# Part 1: Describing Data

Before we can make claims about the world, we need to describe what we see in our data. This section covers the fundamental tools for summarizing data.

## The Big Question

> If someone asked you to describe a variable in your dataset using just one or two numbers, what would you tell them?

## Central Tendency: Where is the "Middle"?

The most basic question about a variable is: **What's a typical value?**

We have three main ways to answer this:

| Measure | What it is | When to use it |
|---------|------------|----------------|
| **Mean** | Sum of all values √∑ count | Symmetric data without extreme outliers |
| **Median** | Middle value when sorted | Skewed data or when outliers are present |
| **Mode** | Most frequent value | Categorical data or finding peaks |

### The Mean (Average)

The mean is the "center of gravity" of your data. Add up all values and divide by how many you have.

üìù **Notation**: 

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$$

Let's break this down piece by piece:
- $\bar{x}$ (read as "x-bar") is the mean
- $n$ is how many data points we have
- $\sum$ (capital sigma) means "add up"
- $x_i$ means "the i-th value" (so $x_1$ is the first value, $x_2$ is the second, etc.)
- $\sum_{i=1}^{n}$ means "add up from i=1 to i=n" ‚Äî in other words, add up all values

In plain English: **Add up all the x values, then divide by how many there are.**

In [None]:
# Let's calculate the mean of age, step by step
ages = survey['age'].values

# Step 1: Sum all values
total = sum(ages)
print(f"Sum of all ages: {total}")

# Step 2: Count how many values
n = len(ages)
print(f"Number of respondents: {n}")

# Step 3: Divide
mean_age = total / n
print(f"Mean age: {mean_age:.1f} years")

In [None]:
# Of course, NumPy and pandas do this for us
print(f"Mean age (numpy): {np.mean(ages):.1f}")
print(f"Mean age (pandas): {survey['age'].mean():.1f}")

### The Median

The median is the middle value when you sort your data. Half of the values are above it, half below.

Why do we need both mean and median? Let's see:

In [None]:
# Compare mean and median for income
print(f"Income - Mean: ${survey['income'].mean():,.0f}")
print(f"Income - Median: ${survey['income'].median():,.0f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Age (roughly symmetric)
axes[0].hist(survey['age'], bins=20, edgecolor='white', alpha=0.7)
axes[0].axvline(survey['age'].mean(), color='red', linestyle='--', label=f"Mean: {survey['age'].mean():.1f}")
axes[0].axvline(survey['age'].median(), color='blue', linestyle='--', label=f"Median: {survey['age'].median():.1f}")
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Count')
axes[0].set_title('Age Distribution (roughly symmetric)')
axes[0].legend()

# News hours (skewed)
axes[1].hist(survey['news_hours'], bins=20, edgecolor='white', alpha=0.7)
axes[1].axvline(survey['news_hours'].mean(), color='red', linestyle='--', label=f"Mean: {survey['news_hours'].mean():.1f}")
axes[1].axvline(survey['news_hours'].median(), color='blue', linestyle='--', label=f"Median: {survey['news_hours'].median():.1f}")
axes[1].set_xlabel('News Consumption (hours/week)')
axes[1].set_ylabel('Count')
axes[1].set_title('News Hours (right-skewed)')
axes[1].legend()

plt.tight_layout()
plt.show()

üîî **Question**: Look at the two histograms above. In which case are the mean and median similar? In which case are they different? Why?

üí° **Tip**: When data is skewed (has a long tail in one direction), the mean gets "pulled" toward the tail. The median is more robust to outliers and skew.

## Spread: How Variable is the Data?

Knowing the center isn't enough. Two datasets can have the same mean but look completely different:


In [None]:
# Two datasets with the same mean, different spread
narrow = np.random.normal(50, 5, 1000)   # Mean 50, small spread
wide = np.random.normal(50, 20, 1000)    # Mean 50, large spread

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(narrow, bins=30, edgecolor='white', alpha=0.7)
axes[0].axvline(np.mean(narrow), color='red', linestyle='--')
axes[0].set_xlim(-20, 120)
axes[0].set_title(f'Narrow spread (mean = {np.mean(narrow):.1f})')

axes[1].hist(wide, bins=30, edgecolor='white', alpha=0.7)
axes[1].axvline(np.mean(wide), color='red', linestyle='--')
axes[1].set_xlim(-20, 120)
axes[1].set_title(f'Wide spread (mean = {np.mean(wide):.1f})')

plt.tight_layout()
plt.show()

print(f"Both have nearly the same mean, but very different spreads!")

### Variance and Standard Deviation

We need a number that captures "how spread out" the data is. Here's the idea:

1. Find how far each point is from the mean
2. Square these distances (to make them all positive)
3. Take the average of these squared distances ‚Üí **Variance**
4. Take the square root to get back to original units ‚Üí **Standard Deviation**

üìù **Notation**:

**Variance** (sigma squared):
$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

**Standard Deviation** (sigma):
$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

Let's break this down:
- $(x_i - \bar{x})$ is how far point $i$ is from the mean (the "deviation")
- $(x_i - \bar{x})^2$ is that deviation squared
- We sum all squared deviations and divide by $n$ to get the average

In plain English: **The variance is the average squared distance from the mean. The standard deviation is the square root of that.**

In [None]:
# Calculate variance step by step for education_years
edu = survey['education_years'].values

# Step 1: Calculate the mean
mean_edu = np.mean(edu)
print(f"Mean education: {mean_edu:.2f} years")

# Step 2: Calculate deviations from the mean
deviations = edu - mean_edu
print(f"\nFirst 5 deviations: {deviations[:5].round(2)}")
print(f"(These show how far each person is from the mean)")

# Step 3: Square the deviations
squared_deviations = deviations ** 2
print(f"\nFirst 5 squared deviations: {squared_deviations[:5].round(2)}")

# Step 4: Take the mean of squared deviations = Variance
variance = np.mean(squared_deviations)
print(f"\nVariance: {variance:.2f} years¬≤")

# Step 5: Square root = Standard Deviation
std_dev = np.sqrt(variance)
print(f"Standard Deviation: {std_dev:.2f} years")

In [None]:
# Verify with NumPy
print(f"NumPy std: {np.std(edu):.2f}")
print(f"Our calculation: {std_dev:.2f}")

‚ö†Ô∏è **Warning**: You may see formulas that divide by $(n-1)$ instead of $n$. This is the "sample standard deviation" and is used when estimating the population SD from a sample. For now, don't worry about this distinction ‚Äî both give very similar results for large samples. NumPy uses $n$ by default; pandas uses $n-1$.

### Interpreting Standard Deviation

The standard deviation tells you: **On average, how far are data points from the mean?**

- Small SD ‚Üí data points cluster tightly around the mean
- Large SD ‚Üí data points are spread out


In [None]:
# Summary statistics for all variables
summary_stats = survey.describe().T[['mean', 'std', 'min', 'max']]
summary_stats.columns = ['Mean', 'Std Dev', 'Min', 'Max']
summary_stats.round(2)

## ü•ä Challenge 1: Calculate Summary Statistics

Calculate the mean, median, and standard deviation for the `social_trust` variable **by hand** (using basic Python operations), then verify with NumPy/pandas.

In [None]:
# Get the social_trust values
trust = survey['social_trust'].values

# YOUR CODE HERE
# Calculate mean
mean_trust = ___

# Calculate median (hint: sort first, then find middle value)
median_trust = ___

# Calculate standard deviation
std_trust = ___

print(f"Mean: {mean_trust:.2f}")
print(f"Median: {median_trust:.2f}")
print(f"Std Dev: {std_trust:.2f}")

In [None]:
# Verify with NumPy
print(f"\nVerification:")
print(f"Mean: {np.mean(trust):.2f}")
print(f"Median: {np.median(trust):.2f}")
print(f"Std Dev: {np.std(trust):.2f}")

---

<a id='part2'></a>

# Part 2: The Shape of Data

Data has shape, and that shape matters. The most important shape in statistics is the **normal distribution** (also called the bell curve or Gaussian distribution).

## The Normal Distribution

The normal distribution appears everywhere:
- Heights and weights of people
- Measurement errors
- Test scores
- Many natural phenomena

It's defined by just two parameters:
- **Œº (mu)**: the mean (center of the bell)
- **œÉ (sigma)**: the standard deviation (width of the bell)

In [None]:
# Visualize normal distributions with different parameters
x = np.linspace(-10, 20, 1000)

def normal_pdf(x, mu, sigma):
    """Calculate the normal probability density function."""
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means
for mu in [0, 5, 10]:
    axes[0].plot(x, normal_pdf(x, mu, 2), label=f'Œº = {mu}')
axes[0].set_title('Same spread (œÉ = 2), different centers')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].legend()

# Different standard deviations
for sigma in [1, 2, 4]:
    axes[1].plot(x, normal_pdf(x, 5, sigma), label=f'œÉ = {sigma}')
axes[1].set_title('Same center (Œº = 5), different spreads')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Density')
axes[1].legend()

plt.tight_layout()
plt.show()

### The 68-95-99.7 Rule

For normally distributed data, there's a handy rule:

- **68%** of data falls within **1 standard deviation** of the mean
- **95%** of data falls within **2 standard deviations** of the mean
- **99.7%** of data falls within **3 standard deviations** of the mean

This is incredibly useful for understanding what values are "typical" vs. "unusual."

In [None]:
# Visualize the 68-95-99.7 rule
x = np.linspace(-4, 4, 1000)
y = normal_pdf(x, 0, 1)

fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(x, y, 'k', linewidth=2)

# Fill regions
ax.fill_between(x, y, where=(x >= -1) & (x <= 1), alpha=0.3, color='blue', label='68% (¬±1œÉ)')
ax.fill_between(x, y, where=((x >= -2) & (x < -1)) | ((x > 1) & (x <= 2)), alpha=0.3, color='green', label='95% (¬±2œÉ)')
ax.fill_between(x, y, where=((x >= -3) & (x < -2)) | ((x > 2) & (x <= 3)), alpha=0.3, color='orange', label='99.7% (¬±3œÉ)')

ax.set_xlabel('Standard Deviations from Mean', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('The 68-95-99.7 Rule', fontsize=14)
ax.legend(loc='upper right')
ax.set_xlim(-4, 4)

plt.show()

### Z-Scores: Standardizing Data

A **z-score** tells you how many standard deviations a value is from the mean.

üìù **Notation**:

$$z = \frac{x - \bar{x}}{\sigma}$$

In plain English: **Subtract the mean, then divide by the standard deviation.**

After this transformation:
- A z-score of 0 means the value equals the mean
- A z-score of 1 means the value is one SD above the mean
- A z-score of -2 means the value is two SDs below the mean

Z-scores let you compare values from different distributions and are essential for data preprocessing in ML.

In [None]:
# Calculate z-scores for age
age_mean = survey['age'].mean()
age_std = survey['age'].std()

survey['age_zscore'] = (survey['age'] - age_mean) / age_std

# Look at a few examples
print("Original ages and their z-scores:")
print(survey[['age', 'age_zscore']].head(10).to_string())

print(f"\nInterpretation:")
print(f"Mean age: {age_mean:.1f}, SD: {age_std:.1f}")
print(f"A z-score of 1.0 means the person is {age_std:.1f} years older than average")

In [None]:
# After standardization, the z-scores have mean ‚âà 0 and SD ‚âà 1
print(f"Z-score mean: {survey['age_zscore'].mean():.6f}")
print(f"Z-score std: {survey['age_zscore'].std():.6f}")

üîî **Question**: If someone has an income z-score of 2.5, what can you say about their income relative to the sample?

üí° **Tip**: In ML preprocessing, we often standardize all features so they're on the same scale. This prevents features with large values (like income) from dominating features with small values (like age).

### When Data Isn't Normal

Not all data follows a normal distribution. It's important to visualize your data before assuming normality.

In [None]:
# Compare distributions of different variables
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

variables = ['age', 'education_years', 'income', 'news_hours']
titles = ['Age (roughly normal)', 'Education (bounded)', 'Income (right-skewed)', 'News Hours (exponential)']

for ax, var, title in zip(axes.flat, variables, titles):
    ax.hist(survey[var], bins=25, edgecolor='white', alpha=0.7, density=True)
    ax.set_xlabel(var)
    ax.set_ylabel('Density')
    ax.set_title(title)

plt.tight_layout()
plt.show()

## ü•ä Challenge 2: Find Unusual Values

Using z-scores, find respondents with "unusual" income (more than 2 standard deviations from the mean). How many are there? Are they unusually high, low, or both?

In [None]:
# YOUR CODE HERE

# Step 1: Calculate z-scores for income
income_zscore = ___

# Step 2: Find respondents with |z| > 2
unusual = ___

# Step 3: Examine them
print(f"Number of unusual income values: ___")
print(f"\nThese respondents:")
# Show the unusual cases

---

<a id='part3'></a>

# Part 3: From Sample to Population

This is the heart of statistics. We almost never see the whole population ‚Äî we only see a sample. How can we make claims about the population based on our sample?

## The Problem

Imagine we want to know the average political engagement of all adults in California. We can't survey everyone, so we survey 200 people.

Our sample mean is 4.8. But:
- If we surveyed a *different* 200 people, would we get exactly 4.8 again?
- Probably not! We'd get something close, but not identical.

This is **sampling variability**: different samples give different results.

## The Sampling Distribution

Here's a thought experiment:
1. Take a sample of 200 people, calculate the mean
2. Repeat this 1000 times
3. Plot all 1000 means

The distribution of these sample means is called the **sampling distribution**.

In [None]:
# Simulate the sampling distribution
# Pretend our survey IS the population (for demonstration)

# We'll repeatedly sample from a larger "population"
population = np.random.normal(50, 15, 100000)  # Large population, mean=50, sd=15

# Take many samples of different sizes and compute their means
sample_sizes = [10, 30, 100]
n_samples = 1000

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, n in zip(axes, sample_sizes):
    sample_means = [np.mean(np.random.choice(population, n, replace=False)) 
                    for _ in range(n_samples)]
    
    ax.hist(sample_means, bins=40, edgecolor='white', alpha=0.7, density=True)
    ax.axvline(50, color='red', linestyle='--', label=f'True mean = 50')
    ax.axvline(np.mean(sample_means), color='blue', linestyle='--', 
               label=f'Mean of sample means = {np.mean(sample_means):.2f}')
    ax.set_xlabel('Sample Mean')
    ax.set_ylabel('Density')
    ax.set_title(f'Sample size n = {n}\nSD of sample means = {np.std(sample_means):.2f}')
    ax.legend(fontsize=8)
    ax.set_xlim(35, 65)

plt.tight_layout()
plt.show()

üîî **Question**: Look at the three plots above. As sample size increases, what happens to the spread of the sampling distribution? Why does this matter?

## Key Insights

Two crucial observations from the simulation:

1. **The sampling distribution is centered on the true population mean.** Sample means aren't biased ‚Äî on average, they equal the population mean.

2. **Larger samples give more precise estimates.** The spread of the sampling distribution shrinks as sample size grows.

## The Central Limit Theorem

One of the most important results in statistics:

> **The sampling distribution of the mean is approximately normal, regardless of the shape of the population distribution, as long as the sample size is large enough.**

This is why the normal distribution is so central to statistics ‚Äî it describes how sample means behave.

In [None]:
# Demonstrate the Central Limit Theorem with a non-normal population
# Let's use an exponential distribution (very skewed)

skewed_population = np.random.exponential(10, 100000)  # Very right-skewed

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Top row: the population
axes[0, 0].hist(skewed_population, bins=50, edgecolor='white', alpha=0.7, density=True)
axes[0, 0].set_title('Population Distribution (exponential)')
axes[0, 0].set_xlabel('Value')

axes[0, 1].axis('off')
axes[0, 2].axis('off')

# Bottom row: sampling distributions for different sample sizes
sample_sizes = [5, 30, 100]

for ax, n in zip(axes[1], sample_sizes):
    sample_means = [np.mean(np.random.choice(skewed_population, n, replace=False)) 
                    for _ in range(1000)]
    
    ax.hist(sample_means, bins=40, edgecolor='white', alpha=0.7, density=True)
    ax.set_xlabel('Sample Mean')
    ax.set_ylabel('Density')
    ax.set_title(f'Sampling Distribution (n = {n})')

plt.tight_layout()
plt.show()

print("Notice: Even though the population is very skewed,")
print("the sampling distribution of the mean becomes normal as n increases!")

## Standard Error: The Precision of Our Estimate

The **standard error (SE)** is the standard deviation of the sampling distribution. It tells us how much sample means typically vary from the true population mean.

üìù **Notation**:

$$SE = \frac{\sigma}{\sqrt{n}}$$

Where:
- $\sigma$ is the population standard deviation
- $n$ is the sample size

In plain English: **The standard error equals the SD divided by the square root of the sample size.**

Key insights:
- Larger sample ‚Üí smaller SE ‚Üí more precise estimate
- To halve the SE, you need to quadruple the sample size (because of the square root)

In [None]:
# Calculate standard error for our survey
engagement = survey['political_engagement']

sample_mean = engagement.mean()
sample_std = engagement.std()
n = len(engagement)

# Standard Error
SE = sample_std / np.sqrt(n)

print(f"Sample mean: {sample_mean:.3f}")
print(f"Sample SD: {sample_std:.3f}")
print(f"Sample size: {n}")
print(f"Standard Error: {SE:.3f}")
print(f"\nInterpretation: Our sample mean is likely within about {SE:.3f} points")
print(f"of the true population mean.")

‚ö†Ô∏è **Warning**: Don't confuse **standard deviation** (spread of individual data points) with **standard error** (precision of the mean estimate). They answer different questions:
- SD: "How spread out are people's engagement scores?"
- SE: "How precisely do we know the average engagement score?"

## ü•ä Challenge 3: Sample Size and Precision

Calculate the standard error for the mean income:
1. Using all 200 respondents
2. Using only the first 50 respondents
3. Using only the first 25 respondents

What happens to the SE as sample size decreases?

In [None]:
# YOUR CODE HERE

income = survey['income'].values

# Calculate SE for n=200
se_200 = ___

# Calculate SE for n=50
se_50 = ___

# Calculate SE for n=25
se_25 = ___

print(f"SE with n=200: ${se_200:,.0f}")
print(f"SE with n=50: ${se_50:,.0f}")
print(f"SE with n=25: ${se_25:,.0f}")

---

<a id='part4'></a>

# Part 4: Quantifying Uncertainty

We have a sample mean, and we know it's probably close to the population mean (thanks to the Central Limit Theorem). But how close? We need to quantify our uncertainty.

## Confidence Intervals

A **confidence interval** gives us a range of plausible values for the population parameter.

üìù **Notation** (for a 95% confidence interval):

$$CI = \bar{x} \pm 1.96 \times SE$$

Where:
- $\bar{x}$ is the sample mean
- 1.96 is the z-value that captures 95% of a normal distribution
- $SE$ is the standard error

In plain English: **Take the sample mean, then go about 2 standard errors in each direction.**

In [None]:
# Calculate a 95% confidence interval for political engagement
engagement = survey['political_engagement']

mean = engagement.mean()
se = engagement.std() / np.sqrt(len(engagement))

# 95% CI
ci_lower = mean - 1.96 * se
ci_upper = mean + 1.96 * se

print(f"Sample mean: {mean:.3f}")
print(f"Standard error: {se:.3f}")
print(f"\n95% Confidence Interval: [{ci_lower:.3f}, {ci_upper:.3f}]")

### What Does "95% Confident" Mean?

This is often misunderstood!

**Wrong interpretation**: "There's a 95% chance the true mean is in this interval."

**Correct interpretation**: "If we repeated this sampling process many times, 95% of the confidence intervals we construct would contain the true mean."

The true mean is either in the interval or it's not ‚Äî we just don't know which. The "95%" refers to the procedure's reliability, not the probability for any single interval.

Let's visualize this:

In [None]:
# Simulate many confidence intervals
true_mean = 50
true_sd = 15
n = 100
n_simulations = 100

fig, ax = plt.subplots(figsize=(10, 12))

captured = 0
for i in range(n_simulations):
    sample = np.random.normal(true_mean, true_sd, n)
    sample_mean = np.mean(sample)
    sample_se = np.std(sample) / np.sqrt(n)
    ci_low = sample_mean - 1.96 * sample_se
    ci_high = sample_mean + 1.96 * sample_se
    
    # Check if CI contains true mean
    contains_mean = ci_low <= true_mean <= ci_high
    color = 'blue' if contains_mean else 'red'
    if contains_mean:
        captured += 1
    
    ax.plot([ci_low, ci_high], [i, i], color=color, linewidth=1)
    ax.plot(sample_mean, i, 'o', color=color, markersize=3)

ax.axvline(true_mean, color='green', linestyle='--', linewidth=2, label=f'True mean = {true_mean}')
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Sample Number', fontsize=12)
ax.set_title(f'100 Confidence Intervals: {captured}% captured the true mean\n(Blue = captured, Red = missed)', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

### Interpreting CI Width

- **Narrow CI** ‚Üí we have a precise estimate
- **Wide CI** ‚Üí lots of uncertainty

What makes CIs wider or narrower?
- More variability in the data ‚Üí wider CI
- Smaller sample size ‚Üí wider CI
- Higher confidence level (99% vs 95%) ‚Üí wider CI

In [None]:
# Calculate CIs for all variables
def calculate_ci(data, confidence=0.95):
    """Calculate confidence interval for the mean."""
    n = len(data)
    mean = np.mean(data)
    se = np.std(data) / np.sqrt(n)
    
    # Z-value for common confidence levels
    z_values = {0.90: 1.645, 0.95: 1.96, 0.99: 2.576}
    z = z_values.get(confidence, 1.96)
    
    margin = z * se
    return mean, mean - margin, mean + margin

print("95% Confidence Intervals for each variable:")
print("-" * 60)

for col in ['age', 'education_years', 'income', 'social_trust', 'political_engagement']:
    mean, ci_low, ci_high = calculate_ci(survey[col])
    print(f"{col:25} {mean:10.2f}  [{ci_low:10.2f}, {ci_high:10.2f}]")

üí° **Tip**: When you report statistics in research, always include a measure of uncertainty. A mean without a confidence interval or standard error is incomplete information.

## ü•ä Challenge 4: Compare Confidence Intervals

Calculate and compare:
1. A 90% confidence interval for mean income
2. A 95% confidence interval for mean income  
3. A 99% confidence interval for mean income

Which is widest? Why?

In [None]:
# YOUR CODE HERE

income = survey['income']

# Hint: Z-values are approximately:
# 90% CI: z = 1.645
# 95% CI: z = 1.96
# 99% CI: z = 2.576

# Calculate all three CIs


---

<a id='part5'></a>

# Part 5: Is This Real or Just Noise?

We see patterns in our data all the time. But could they just be random chance? This is where hypothesis testing comes in.

## The Logic of Hypothesis Testing

Imagine someone claims a coin is fair (50% heads). You flip it 100 times and get 60 heads. Is the coin unfair, or did you just get lucky?

The logic of hypothesis testing:

1. **Start with skepticism**: Assume there's no effect (the "null hypothesis")
2. **Ask**: How surprising is our data if the null were true?
3. **Decide**: If very surprising, maybe the null is wrong

## The Null Hypothesis

The **null hypothesis (H‚ÇÄ)** represents the "boring" explanation:
- "There is no difference between groups"
- "There is no relationship between variables"
- "The true mean equals some specific value"

The **alternative hypothesis (H‚ÇÅ or H‚Çê)** is what we're testing for:
- "There IS a difference"
- "There IS a relationship"

## The P-Value

The **p-value** is the probability of seeing data as extreme as ours (or more extreme) **if the null hypothesis were true**.

üìù **Important**: The p-value is NOT:
- ‚ùå The probability that the null hypothesis is true
- ‚ùå The probability that your result is a fluke
- ‚ùå The probability that you made an error

The p-value IS:
- ‚úÖ The probability of the data given the null hypothesis

Let's see this with the coin example:

In [None]:
# Coin flip example
from scipy import stats

# We got 60 heads out of 100 flips
n_flips = 100
n_heads = 60

# If the coin is fair (p=0.5), what's the probability of getting 60+ heads?
# This is a one-tailed test; for two-tailed, we'd also consider 40 or fewer

# Calculate p-value (two-tailed)
p_value = 2 * (1 - stats.binom.cdf(n_heads - 1, n_flips, 0.5))

print(f"Observed: {n_heads} heads out of {n_flips} flips")
print(f"Expected if fair: {n_flips * 0.5} heads")
print(f"\nP-value: {p_value:.4f}")
print(f"\nInterpretation: If the coin were fair, there's about a {p_value*100:.1f}% chance")
print(f"of seeing results this extreme (60+ or 40- heads).")

In [None]:
# Visualize
x = np.arange(0, 101)
prob = stats.binom.pmf(x, n_flips, 0.5)

fig, ax = plt.subplots(figsize=(12, 5))

# Plot all probabilities
ax.bar(x, prob, color='lightblue', edgecolor='white')

# Highlight extreme values
ax.bar(x[x >= 60], prob[x >= 60], color='red', edgecolor='white', label='60+ heads')
ax.bar(x[x <= 40], prob[x <= 40], color='red', edgecolor='white', label='40- heads')

ax.axvline(60, color='darkred', linestyle='--', label='Our result (60)')
ax.axvline(50, color='green', linestyle='--', label='Expected if fair (50)')

ax.set_xlabel('Number of Heads', fontsize=12)
ax.set_ylabel('Probability', fontsize=12)
ax.set_title('Distribution of Heads in 100 Flips (if coin is fair)\nRed regions = "extreme" results', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

## Statistical Significance

By convention, we often use **p < 0.05** as a threshold for "statistical significance."

- p < 0.05 ‚Üí "Statistically significant" ‚Üí We reject the null hypothesis
- p ‚â• 0.05 ‚Üí "Not statistically significant" ‚Üí We fail to reject the null

‚ö†Ô∏è **Warning**: This threshold is arbitrary! There's nothing magical about 0.05. A p-value of 0.049 is not meaningfully different from 0.051. Always consider the context and effect size.

## Example: Testing a Mean

Let's test whether the average political engagement in our sample differs from a hypothetical population value of 5.0.

In [None]:
# One-sample t-test
engagement = survey['political_engagement']
hypothesized_mean = 5.0

# Perform the test
t_stat, p_value = stats.ttest_1samp(engagement, hypothesized_mean)

print(f"Sample mean: {engagement.mean():.3f}")
print(f"Hypothesized population mean: {hypothesized_mean}")
print(f"\nT-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"\nConclusion: The difference is statistically significant (p < 0.05).")
    print(f"We reject the null hypothesis that the true mean equals {hypothesized_mean}.")
else:
    print(f"\nConclusion: The difference is not statistically significant (p ‚â• 0.05).")
    print(f"We cannot reject the null hypothesis.")

## Effect Size: What P-Values Don't Tell You

A p-value tells you whether an effect is likely to be "real" (not just noise). It does NOT tell you whether the effect is **large enough to matter**.

With a large enough sample, even tiny effects become "statistically significant."

Always ask:
1. Is it statistically significant? (p-value)
2. Is it practically significant? (effect size)

In [None]:
# Demonstrate how sample size affects significance
# True effect is tiny: population mean is 5.1, not 5.0

true_mean = 5.1
hypothesized = 5.0
population_sd = 1.5

print("Testing whether mean differs from 5.0 when true mean is 5.1")
print("(A difference of just 0.1 points!)")
print("-" * 50)

for n in [50, 200, 1000, 10000]:
    sample = np.random.normal(true_mean, population_sd, n)
    t_stat, p_val = stats.ttest_1samp(sample, hypothesized)
    print(f"n = {n:5}: sample mean = {sample.mean():.3f}, p = {p_val:.4f} {'*' if p_val < 0.05 else ''}")

print("\n* = statistically significant at p < 0.05")
print("\nNotice: The same tiny effect becomes 'significant' with enough data!")

### Cohen's d: A Common Effect Size Measure

**Cohen's d** expresses the difference in terms of standard deviations:

$$d = \frac{\bar{x} - \mu_0}{\sigma}$$

Rough guidelines:
- d ‚âà 0.2: small effect
- d ‚âà 0.5: medium effect
- d ‚âà 0.8: large effect

In [None]:
# Calculate Cohen's d for our engagement test
engagement = survey['political_engagement']
hypothesized_mean = 5.0

cohens_d = (engagement.mean() - hypothesized_mean) / engagement.std()

print(f"Sample mean: {engagement.mean():.3f}")
print(f"Hypothesized mean: {hypothesized_mean}")
print(f"Sample SD: {engagement.std():.3f}")
print(f"\nCohen's d: {cohens_d:.3f}")

# Interpret
if abs(cohens_d) < 0.2:
    size = "negligible"
elif abs(cohens_d) < 0.5:
    size = "small"
elif abs(cohens_d) < 0.8:
    size = "medium"
else:
    size = "large"
    
print(f"Effect size interpretation: {size}")

## ü•ä Challenge 5: Interpret Results

A researcher conducts two studies:

**Study A**: n = 30, difference from expected = 0.8 points, p = 0.12  
**Study B**: n = 5000, difference from expected = 0.08 points, p = 0.001

Questions:
1. Which study has a statistically significant result?
2. Which study found a more practically meaningful effect?
3. What explains this apparent paradox?

*Your answer here:*



---

<a id='part6'></a>

# Part 6: Relationships Between Variables

Most research is about relationships: Does X predict Y? Are X and Y related?

## Correlation

**Correlation** measures the strength and direction of a linear relationship between two variables.

üìù **Notation**: Pearson's correlation coefficient:

$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

This looks complicated! The intuition:
- When X is above its mean, is Y also above its mean? (positive correlation)
- When X is above its mean, is Y below its mean? (negative correlation)
- No pattern? (no correlation)

The result is always between -1 and +1:
- r = +1: perfect positive correlation
- r = 0: no linear correlation
- r = -1: perfect negative correlation

In [None]:
# Visualize different correlations
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Generate example data
n = 100
x = np.random.normal(0, 1, n)

correlations = [0.9, 0.5, 0, -0.7]
titles = ['Strong positive (r ‚âà 0.9)', 'Moderate positive (r ‚âà 0.5)', 
          'No correlation (r ‚âà 0)', 'Strong negative (r ‚âà -0.7)']

for ax, r, title in zip(axes, correlations, titles):
    # Generate y with desired correlation
    noise = np.random.normal(0, 1, n)
    y = r * x + np.sqrt(1 - r**2) * noise
    
    ax.scatter(x, y, alpha=0.6)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    actual_r = np.corrcoef(x, y)[0, 1]
    ax.set_title(f'{title}\n(actual r = {actual_r:.2f})')

plt.tight_layout()
plt.show()

In [None]:
# Calculate correlations in our survey data
variables = ['age', 'education_years', 'income', 'social_trust', 'political_engagement']
correlation_matrix = survey[variables].corr()

# Display as heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(correlation_matrix, cmap='RdBu_r', vmin=-1, vmax=1)

# Add labels
ax.set_xticks(range(len(variables)))
ax.set_yticks(range(len(variables)))
ax.set_xticklabels(variables, rotation=45, ha='right')
ax.set_yticklabels(variables)

# Add correlation values
for i in range(len(variables)):
    for j in range(len(variables)):
        text = ax.text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                       ha='center', va='center', color='black', fontsize=10)

ax.set_title('Correlation Matrix', fontsize=14)
plt.colorbar(im, ax=ax, label='Correlation')
plt.tight_layout()
plt.show()

üîî **Question**: Looking at the correlation matrix, which pairs of variables are most strongly related? Does the direction (positive/negative) make intuitive sense?

### r¬≤ : Variance Explained

If you square the correlation, you get **r¬≤** ("r-squared"), which tells you the proportion of variance in Y that's explained by X.

For example, if r = 0.6, then r¬≤ = 0.36, meaning 36% of the variance in Y can be explained by its linear relationship with X.

In [None]:
# Education and income relationship
r = survey['education_years'].corr(survey['income'])
r_squared = r ** 2

print(f"Correlation between education and income: r = {r:.3f}")
print(f"R-squared: {r_squared:.3f}")
print(f"\nInterpretation: {r_squared*100:.1f}% of the variance in income")
print(f"can be explained by education years.")

## ‚ö†Ô∏è Correlation Does Not Imply Causation

This is perhaps the most important warning in statistics. If X and Y are correlated, there are several possibilities:

1. **X causes Y** (education ‚Üí income)
2. **Y causes X** (reverse causation)
3. **Z causes both X and Y** (confounding variable)
4. **Pure coincidence** (spurious correlation)

Correlation alone cannot distinguish between these!

In [None]:
# Scatter plot with regression line
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(survey['education_years'], survey['income'], alpha=0.5)

# Add regression line
z = np.polyfit(survey['education_years'], survey['income'], 1)
p = np.poly1d(z)
x_line = np.linspace(survey['education_years'].min(), survey['education_years'].max(), 100)
ax.plot(x_line, p(x_line), 'r-', linewidth=2, label=f'r = {r:.2f}')

ax.set_xlabel('Education (years)', fontsize=12)
ax.set_ylabel('Income ($)', fontsize=12)
ax.set_title('Education vs. Income', fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

## ü•ä Challenge 6: Explore Relationships

1. Find the correlation between `social_trust` and `political_engagement`
2. Is the correlation statistically significant?
3. Create a scatter plot of the relationship
4. Write one sentence interpreting what you found

In [None]:
# YOUR CODE HERE

# 1. Calculate correlation

# 2. Test significance (hint: use stats.pearsonr)

# 3. Create scatter plot


---

<a id='part7'></a>

# Part 7: Bridge to ML/NLP ‚Äî Vectors and Matrices

Everything we've covered so far is foundational for statistics. Now let's add one more perspective that's essential for machine learning and NLP: **thinking about data geometrically**.

## The Key Insight

> **Every row in your dataset is a point in space.**

This might sound abstract, but it's the foundation of modern ML and NLP.

## Vectors: Data as Points

A **vector** is just an ordered list of numbers. In data science:

- Each person/document/observation is a vector
- Each number in the vector represents a feature/measurement

üìù **Notation**:

$$\vec{v} = [v_1, v_2, ..., v_n]$$

or in column form:

$$\vec{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

The number of elements is the **dimensionality** of the vector.

In [None]:
# Each row in our survey is a vector
# Let's look at the first person represented as a vector

person_0 = survey[['age', 'education_years', 'income', 'social_trust', 'political_engagement']].iloc[0].values

print("First respondent as a vector:")
print(person_0)
print(f"\nThis is a {len(person_0)}-dimensional vector.")
print(f"\nEach dimension represents:")
for i, col in enumerate(['age', 'education_years', 'income', 'social_trust', 'political_engagement']):
    print(f"  Dimension {i+1}: {col} = {person_0[i]}")

In [None]:
# In 2D, we can visualize vectors as points
# Let's use just age and income

fig, ax = plt.subplots(figsize=(10, 8))

ax.scatter(survey['age'], survey['income'], alpha=0.6)

# Highlight a few specific people as vectors
for i in [0, 10, 50]:
    ax.scatter(survey.iloc[i]['age'], survey.iloc[i]['income'], 
               s=100, edgecolor='red', facecolor='none', linewidth=2)
    ax.annotate(f'Person {i}\n({survey.iloc[i]["age"]:.0f}, ${survey.iloc[i]["income"]:,.0f})', 
                (survey.iloc[i]['age'], survey.iloc[i]['income']),
                xytext=(10, 10), textcoords='offset points')

ax.set_xlabel('Age (dimension 1)', fontsize=12)
ax.set_ylabel('Income (dimension 2)', fontsize=12)
ax.set_title('Each Person is a Point (Vector) in Space', fontsize=14)

plt.tight_layout()
plt.show()

print("Each dot is a person represented by a 2D vector: [age, income]")
print("With 5 features, each person is a point in 5-dimensional space!")

## Matrices: Collections of Vectors

A **matrix** is a 2D array of numbers. In data science:

- Each row is one observation (a vector)
- Each column is one feature
- The whole dataset is a matrix

üìù **Notation**:

$$X = \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{bmatrix}$$

Where:
- $n$ = number of rows (observations)
- $p$ = number of columns (features)
- $x_{i,j}$ = value in row $i$, column $j$

We say $X$ has shape $(n \times p)$ or "n by p."

In [None]:
# Our survey as a matrix
X = survey[['age', 'education_years', 'income', 'social_trust', 'political_engagement']].values

print(f"Matrix shape: {X.shape}")
print(f"This means: {X.shape[0]} observations √ó {X.shape[1]} features")
print(f"\nFirst 5 rows of the matrix:")
print(X[:5])

## Why This Matters for NLP

In NLP, we represent text as vectors:

- **Bag of Words**: A document is a vector of word counts
- **TF-IDF**: A document is a vector of weighted word frequencies
- **Word Embeddings**: A word is a vector of learned features

A corpus (collection of documents) becomes a matrix where:
- Each row is a document
- Each column is a word (or feature)


In [None]:
# Simple example: documents as vectors
# Vocabulary: ['cat', 'dog', 'fish', 'pet', 'animal']

# Three "documents" represented as word count vectors
doc1 = np.array([2, 0, 0, 1, 1])  # "The cat is a pet animal. I love my cat."
doc2 = np.array([0, 3, 0, 1, 1])  # "My dog is the best pet. Dogs are great animals. Dog!"
doc3 = np.array([1, 1, 2, 1, 0])  # "I have a cat, a dog, and two fish as pets."

# Stack into a document-term matrix
doc_term_matrix = np.vstack([doc1, doc2, doc3])

print("Document-Term Matrix:")
print("Columns: ['cat', 'dog', 'fish', 'pet', 'animal']")
print(doc_term_matrix)
print(f"\nShape: {doc_term_matrix.shape} (3 documents √ó 5 words)")

## Measuring Similarity: The Dot Product

Once we have vectors, we can measure how similar they are. The **dot product** is the foundation of similarity measurement.

üìù **Notation**:

$$\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i \times b_i = a_1 b_1 + a_2 b_2 + ... + a_n b_n$$

In plain English: **Multiply corresponding elements and add them up.**

In [None]:
# Calculate dot product step by step
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Step by step
print(f"Vector a: {a}")
print(f"Vector b: {b}")
print(f"\nStep by step:")
print(f"  {a[0]} √ó {b[0]} = {a[0] * b[0]}")
print(f"  {a[1]} √ó {b[1]} = {a[1] * b[1]}")
print(f"  {a[2]} √ó {b[2]} = {a[2] * b[2]}")
print(f"  Sum: {a[0]*b[0]} + {a[1]*b[1]} + {a[2]*b[2]} = {np.dot(a, b)}")

# Using numpy
print(f"\nNumPy dot product: {np.dot(a, b)}")

## Cosine Similarity

For comparing documents (or any vectors), we often use **cosine similarity**. It measures the angle between two vectors, ignoring their length.

üìù **Notation**:

$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \times ||\vec{b}||}$$

Where $||\vec{a}||$ is the length (magnitude) of vector $a$:

$$||\vec{a}|| = \sqrt{\sum_{i=1}^{n} a_i^2}$$

Cosine similarity ranges from -1 to 1:
- 1 = identical direction (very similar)
- 0 = perpendicular (no similarity)
- -1 = opposite direction

**Why cosine similarity for documents?** It ignores document length. A long document about cats and a short document about cats will have high cosine similarity, even though their raw word counts differ.

In [None]:
def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    dot_product = np.dot(a, b)
    magnitude_a = np.sqrt(np.sum(a ** 2))
    magnitude_b = np.sqrt(np.sum(b ** 2))
    return dot_product / (magnitude_a * magnitude_b)

# Compare our documents
print("Document similarity (cosine):")
print(f"Doc1 vs Doc2: {cosine_similarity(doc1, doc2):.3f}")
print(f"Doc1 vs Doc3: {cosine_similarity(doc1, doc3):.3f}")
print(f"Doc2 vs Doc3: {cosine_similarity(doc2, doc3):.3f}")

print("\nInterpretation:")
print("Doc1 (cat-focused) and Doc2 (dog-focused) are least similar.")
print("Doc3 (mixed) is somewhat similar to both.")

In [None]:
# Visualize in 2D (using just 'cat' and 'dog' dimensions)
fig, ax = plt.subplots(figsize=(8, 8))

# Plot vectors as arrows from origin
colors = ['blue', 'red', 'green']
labels = ['Doc1 (cat)', 'Doc2 (dog)', 'Doc3 (mixed)']

for i, (doc, color, label) in enumerate(zip([doc1, doc2, doc3], colors, labels)):
    # Just use first two dimensions (cat, dog)
    ax.arrow(0, 0, doc[0], doc[1], head_width=0.1, head_length=0.1, 
             fc=color, ec=color, linewidth=2)
    ax.annotate(label, (doc[0], doc[1]), fontsize=12, 
                xytext=(5, 5), textcoords='offset points')

ax.set_xlim(-0.5, 3.5)
ax.set_ylim(-0.5, 3.5)
ax.set_xlabel('"cat" count', fontsize=12)
ax.set_ylabel('"dog" count', fontsize=12)
ax.set_title('Documents as Vectors\n(showing only cat/dog dimensions)', fontsize=14)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.tight_layout()
plt.show()

print("Cosine similarity measures the ANGLE between vectors.")
print("Doc1 and Doc2 point in very different directions ‚Üí low similarity.")

## Matrix Multiplication: Batch Operations

In ML, we often need to perform many dot products at once. This is what **matrix multiplication** does.

üìù **Notation**:

If $A$ is $(m \times n)$ and $B$ is $(n \times p)$, then $C = AB$ is $(m \times p)$.

Each element $c_{i,j}$ is the dot product of row $i$ of $A$ with column $j$ of $B$.

**In practice, you don't need to do this by hand** ‚Äî NumPy handles it. But understanding that it's "many dot products at once" helps you understand why linear algebra is so central to ML.

In [None]:
# Matrix multiplication example: computing all pairwise similarities at once
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine

# All pairwise cosine similarities
similarity_matrix = sklearn_cosine(doc_term_matrix)

print("Document Similarity Matrix:")
print("             Doc1    Doc2    Doc3")
for i, row in enumerate(similarity_matrix):
    print(f"Doc{i+1}        {row[0]:.3f}   {row[1]:.3f}   {row[2]:.3f}")

## ü•ä Challenge 7: Compute Document Similarity

Here are three new documents represented as word count vectors:

Vocabulary: `['data', 'science', 'machine', 'learning', 'statistics']`

- Doc A: "Data science uses statistics." ‚Üí `[1, 1, 0, 0, 1]`
- Doc B: "Machine learning is great." ‚Üí `[0, 0, 1, 1, 0]`  
- Doc C: "Data science and machine learning." ‚Üí `[1, 1, 1, 1, 0]`

1. Calculate the cosine similarity between each pair of documents by hand
2. Which two documents are most similar?
3. Verify with NumPy

In [None]:
# YOUR CODE HERE

doc_a = np.array([1, 1, 0, 0, 1])
doc_b = np.array([0, 0, 1, 1, 0])
doc_c = np.array([1, 1, 1, 1, 0])

# Calculate similarities


---

# Summary and Next Steps

## What We Covered

In this workshop, you learned:

1. **Describing data**: Mean, variance, standard deviation ‚Äî summarizing what your data looks like

2. **Distributions**: The normal distribution, z-scores, and why standardization matters

3. **Sample to population**: The sampling distribution, Central Limit Theorem, and standard error ‚Äî understanding uncertainty

4. **Confidence intervals**: Quantifying how precise our estimates are

5. **Hypothesis testing**: P-values, statistical significance, and effect sizes ‚Äî evaluating evidence

6. **Correlation**: Measuring relationships between variables (and its limitations)

7. **Vectors and matrices**: Thinking geometrically about data, dot products, and cosine similarity

## Key Takeaways

- **Variability is the game**: Statistics is fundamentally about understanding and quantifying uncertainty
- **Sample ‚Üí Population is the key inference**: We almost never see the whole population
- **p < 0.05 is not magic**: Always consider effect sizes alongside significance
- **Correlation ‚â† Causation**: One of the most important lessons in data analysis
- **Data is geometry**: Every row is a point in space ‚Äî this perspective unlocks ML and NLP

## Where This Leads

| This Workshop | ML Workshop | NLP Workshop |
|---------------|-------------|---------------|
| Mean, variance, standardization | ‚Üí Preprocessing, feature scaling | |
| Distributions, probability | ‚Üí Classification, probability outputs | |
| Correlation, relationships | ‚Üí Feature selection, model coefficients | |
| Vectors, matrices | ‚Üí | ‚Üí Bag of words, TF-IDF, embeddings |
| Cosine similarity | ‚Üí | ‚Üí Document similarity, word vectors |

## Recommended Next Steps

1. **D-Lab's Python Machine Learning** workshop ‚Äî covers regression, classification, and preprocessing
2. **D-Lab's Python Natural Language Processing** workshop ‚Äî covers text preprocessing, bag of words, and word embeddings

## Resources

- [Seeing Theory](https://seeing-theory.brown.edu/) ‚Äî Beautiful visualizations of statistical concepts
- [StatQuest with Josh Starmer](https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw) ‚Äî Excellent YouTube explanations
- [3Blue1Brown Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) ‚Äî Visual introduction to vectors and matrices

---

## ü•ä Take-Home Challenge: Full Analysis

Using the survey data, conduct a mini-analysis:

1. Choose two variables you think might be related
2. Calculate descriptive statistics for both
3. Test whether they're correlated (and if it's statistically significant)
4. Calculate and interpret the effect size
5. Create a visualization
6. Write 2-3 sentences summarizing your findings

Remember: Correlation doesn't mean causation!

In [None]:
# YOUR CODE HERE
