# Statistics and Probability for Data Science

Understanding statistics and probability is fundamental for data science, machine learning, and NLP. This notebook covers essential statistical concepts that you'll encounter in data analysis and modeling.

## Why Statistics Matter:
- **Data Analysis**: Describe and summarize datasets
- **Machine Learning**: Understand model performance and uncertainty
- **NLP**: Analyze text patterns, word frequencies, language models
- **Hypothesis Testing**: Make data-driven decisions
- **Feature Engineering**: Create meaningful variables from raw data

## Topics Covered:
- Descriptive statistics
- Probability distributions
- Central Limit Theorem
- Correlation and causation
- Hypothesis testing
- Confidence intervals
- Practical applications with Python

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, binom, poisson, chi2_contingency
import warnings

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Descriptive Statistics

In [None]:
# Generate sample dataset: student exam scores
n_students = 1000
exam_scores = np.random.normal(75, 12, n_students)  # mean=75, std=12
exam_scores = np.clip(exam_scores, 0, 100)  # Ensure scores are between 0-100

print("📊 Descriptive Statistics for Exam Scores:")
print("=" * 50)

# Measures of Central Tendency
mean_score = np.mean(exam_scores)
median_score = np.median(exam_scores)
mode_result = stats.mode(exam_scores.round())
mode_score = mode_result.mode[0] if len(mode_result.mode) > 0 else "N/A"

print(f"Mean (average): {mean_score:.2f}")
print(f"Median (middle value): {median_score:.2f}")
print(f"Mode (most frequent): {mode_score}")
print()

# Measures of Spread/Variability
std_score = np.std(exam_scores, ddof=1)  # Sample standard deviation
var_score = np.var(exam_scores, ddof=1)  # Sample variance
range_score = np.max(exam_scores) - np.min(exam_scores)
iqr_score = np.percentile(exam_scores, 75) - np.percentile(exam_scores, 25)

print(f"Standard Deviation: {std_score:.2f}")
print(f"Variance: {var_score:.2f}")
print(f"Range: {range_score:.2f}")
print(f"Interquartile Range (IQR): {iqr_score:.2f}")
print()

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print("Percentiles:")
for p in percentiles:
    value = np.percentile(exam_scores, p)
    print(f"  {p:2d}th percentile: {value:.2f}")

# Skewness and Kurtosis
skewness = stats.skew(exam_scores)
kurtosis = stats.kurtosis(exam_scores)

print(f"\nSkewness: {skewness:.3f}")
if skewness > 0.5:
    print("  → Right-skewed (tail extends to the right)")
elif skewness < -0.5:
    print("  → Left-skewed (tail extends to the left)")
else:
    print("  → Approximately symmetric")

print(f"Kurtosis: {kurtosis:.3f}")
if kurtosis > 0:
    print("  → Heavy-tailed (more extreme values than normal distribution)")
else:
    print("  → Light-tailed (fewer extreme values than normal distribution)")

In [None]:
# Visualize the distribution
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Statistical Analysis of Exam Scores', fontsize=16)

# Histogram
axes[0, 0].hist(exam_scores, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].axvline(mean_score, color='red', linestyle='--', label=f'Mean: {mean_score:.1f}')
axes[0, 0].axvline(median_score, color='green', linestyle='--', label=f'Median: {median_score:.1f}')
axes[0, 0].set_title('Distribution of Exam Scores')
axes[0, 0].set_xlabel('Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()

# Box plot
axes[0, 1].boxplot(exam_scores, vert=True)
axes[0, 1].set_title('Box Plot of Exam Scores')
axes[0, 1].set_ylabel('Score')
axes[0, 1].grid(True, alpha=0.3)

# Q-Q plot (Quantile-Quantile plot)
stats.probplot(exam_scores, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot (Normal Distribution)')
axes[1, 0].grid(True, alpha=0.3)

# Cumulative Distribution Function
sorted_scores = np.sort(exam_scores)
cumulative_prob = np.arange(1, len(sorted_scores) + 1) / len(sorted_scores)
axes[1, 1].plot(sorted_scores, cumulative_prob, linewidth=2)
axes[1, 1].set_title('Cumulative Distribution Function')
axes[1, 1].set_xlabel('Score')
axes[1, 1].set_ylabel('Cumulative Probability')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Identify outliers using IQR method
Q1 = np.percentile(exam_scores, 25)
Q3 = np.percentile(exam_scores, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = exam_scores[(exam_scores < lower_bound) | (exam_scores > upper_bound)]
print(f"\n🔍 Outlier Analysis:")
print(f"Number of outliers: {len(outliers)}")
print(f"Percentage of outliers: {len(outliers)/len(exam_scores)*100:.2f}%")
if len(outliers) > 0:
    print(f"Outlier range: {outliers.min():.2f} to {outliers.max():.2f}")

## Probability Distributions

In [None]:
# Common probability distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Common Probability Distributions', fontsize=16)

# 1. Normal Distribution
x_norm = np.linspace(-4, 4, 100)
y_norm = norm.pdf(x_norm, 0, 1)
axes[0, 0].plot(x_norm, y_norm, 'b-', linewidth=2, label='μ=0, σ=1')
axes[0, 0].fill_between(x_norm, y_norm, alpha=0.3)
axes[0, 0].set_title('Normal Distribution')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Probability Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Binomial Distribution
n, p = 20, 0.3
x_binom = np.arange(0, n+1)
y_binom = binom.pmf(x_binom, n, p)
axes[0, 1].bar(x_binom, y_binom, alpha=0.7, color='orange')
axes[0, 1].set_title(f'Binomial Distribution (n={n}, p={p})')
axes[0, 1].set_xlabel('Number of Successes')
axes[0, 1].set_ylabel('Probability')
axes[0, 1].grid(True, alpha=0.3)

# 3. Poisson Distribution
lam = 3
x_poisson = np.arange(0, 15)
y_poisson = poisson.pmf(x_poisson, lam)
axes[0, 2].bar(x_poisson, y_poisson, alpha=0.7, color='green')
axes[0, 2].set_title(f'Poisson Distribution (λ={lam})')
axes[0, 2].set_xlabel('Number of Events')
axes[0, 2].set_ylabel('Probability')
axes[0, 2].grid(True, alpha=0.3)

# 4. Exponential Distribution
x_exp = np.linspace(0, 5, 100)
y_exp = stats.expon.pdf(x_exp, scale=1)
axes[1, 0].plot(x_exp, y_exp, 'r-', linewidth=2, label='λ=1')
axes[1, 0].fill_between(x_exp, y_exp, alpha=0.3, color='red')
axes[1, 0].set_title('Exponential Distribution')
axes[1, 0].set_xlabel('Value')
axes[1, 0].set_ylabel('Probability Density')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Uniform Distribution
x_uniform = np.linspace(-2, 2, 100)
y_uniform = stats.uniform.pdf(x_uniform, loc=-1, scale=2)
axes[1, 1].plot(x_uniform, y_uniform, 'purple', linewidth=2, label='a=-1, b=1')
axes[1, 1].fill_between(x_uniform, y_uniform, alpha=0.3, color='purple')
axes[1, 1].set_title('Uniform Distribution')
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Probability Density')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Chi-square Distribution
x_chi2 = np.linspace(0, 15, 100)
dfs = [1, 3, 5, 9]
colors = ['red', 'blue', 'green', 'orange']
for df, color in zip(dfs, colors):
    y_chi2 = stats.chi2.pdf(x_chi2, df)
    axes[1, 2].plot(x_chi2, y_chi2, color=color, linewidth=2, label=f'df={df}')
axes[1, 2].set_title('Chi-square Distribution')
axes[1, 2].set_xlabel('Value')
axes[1, 2].set_ylabel('Probability Density')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Practical examples of when to use each distribution
print("📊 When to Use Each Distribution:")
print("=" * 50)
distribution_uses = {
    "Normal": "Heights, test scores, measurement errors, many natural phenomena",
    "Binomial": "Number of successes in fixed trials (coin flips, A/B testing)",
    "Poisson": "Rare events over time (website visits, defects, accidents)",
    "Exponential": "Time between events, survival analysis, reliability",
    "Uniform": "Random number generation, equal probability outcomes",
    "Chi-square": "Goodness of fit tests, independence testing"
}

for dist, use_case in distribution_uses.items():
    print(f"{dist:12}: {use_case}")

## Central Limit Theorem

In [None]:
# Demonstrate Central Limit Theorem
def demonstrate_clt(population_dist='uniform', n_samples=1000, sample_sizes=[1, 5, 10, 30]):
    """
    Demonstrate the Central Limit Theorem with different sample sizes.
    """
    # Generate population based on distribution type
    if population_dist == 'uniform':
        population = np.random.uniform(0, 10, 10000)
        dist_name = 'Uniform (0, 10)'
    elif population_dist == 'exponential':
        population = np.random.exponential(2, 10000)
        dist_name = 'Exponential (λ=0.5)'
    else:  # skewed
        population = np.random.gamma(2, 2, 10000)
        dist_name = 'Gamma (skewed)'
    
    fig, axes = plt.subplots(2, len(sample_sizes), figsize=(16, 8))
    fig.suptitle(f'Central Limit Theorem Demonstration\nPopulation: {dist_name}', fontsize=14)
    
    # Show original population
    axes[0, 0].hist(population, bins=50, alpha=0.7, color='lightcoral', density=True)
    axes[0, 0].set_title('Original Population')
    axes[0, 0].set_ylabel('Density')
    
    sample_means = []
    
    for i, sample_size in enumerate(sample_sizes):
        # Generate sample means
        means = []
        for _ in range(n_samples):
            sample = np.random.choice(population, sample_size, replace=False)
            means.append(np.mean(sample))
        
        sample_means.append(means)
        
        # Plot histogram of sample means
        if i == 0:
            axes[0, i].hist(population, bins=50, alpha=0.7, color='lightcoral', density=True)
            axes[0, i].set_title('Original Population')
        else:
            axes[0, i].hist(means, bins=30, alpha=0.7, color='skyblue', density=True)
            axes[0, i].set_title(f'Sample Means (n={sample_size})')
        
        axes[0, i].set_xlabel('Value')
        if i == 0:
            axes[0, i].set_ylabel('Density')
        
        # Q-Q plot to check normality
        if i == 0:
            stats.probplot(population[:1000], dist="norm", plot=axes[1, i])
            axes[1, i].set_title('Q-Q Plot: Population')
        else:
            stats.probplot(means, dist="norm", plot=axes[1, i])
            axes[1, i].set_title(f'Q-Q Plot: n={sample_size}')
        
        axes[1, i].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Statistical analysis
    print(f"\n📊 Central Limit Theorem Analysis:")
    print("=" * 50)
    print(f"Population mean: {np.mean(population):.3f}")
    print(f"Population std: {np.std(population):.3f}")
    print()
    
    for i, sample_size in enumerate(sample_sizes[1:], 1):  # Skip population
        means = sample_means[i]
        theoretical_std = np.std(population) / np.sqrt(sample_size)
        actual_std = np.std(means)
        
        print(f"Sample size n={sample_size}:")
        print(f"  Mean of sample means: {np.mean(means):.3f}")
        print(f"  Std of sample means: {actual_std:.3f}")
        print(f"  Theoretical std (σ/√n): {theoretical_std:.3f}")
        print(f"  Difference: {abs(actual_std - theoretical_std):.3f}")
        
        # Test normality with Shapiro-Wilk test
        shapiro_stat, shapiro_p = stats.shapiro(means[:5000])  # Limit sample size for test
        print(f"  Normality test p-value: {shapiro_p:.4f}")
        print(f"  {'✅ Approximately normal' if shapiro_p > 0.05 else '❌ Not normal'}")
        print()

# Demonstrate CLT with uniform distribution
demonstrate_clt('uniform')

print("\n🎯 Key Insights from Central Limit Theorem:")
clt_insights = [
    "Sample means approach normal distribution regardless of population shape",
    "Larger sample sizes lead to more normal distributions",
    "Mean of sample means equals population mean",
    "Standard error decreases as sample size increases (σ/√n)",
    "CLT is foundation for confidence intervals and hypothesis testing"
]

for insight in clt_insights:
    print(f"• {insight}")

## Correlation and Causation

In [None]:
# Generate correlated datasets
n = 100

# Dataset 1: Strong positive correlation
x1 = np.random.normal(0, 1, n)
y1 = 2 * x1 + np.random.normal(0, 0.5, n)

# Dataset 2: No correlation
x2 = np.random.normal(0, 1, n)
y2 = np.random.normal(0, 1, n)

# Dataset 3: Negative correlation
x3 = np.random.normal(0, 1, n)
y3 = -1.5 * x3 + np.random.normal(0, 0.8, n)

# Dataset 4: Non-linear relationship
x4 = np.linspace(-2, 2, n)
y4 = x4**2 + np.random.normal(0, 0.3, n)

# Calculate correlations
corr1 = np.corrcoef(x1, y1)[0, 1]
corr2 = np.corrcoef(x2, y2)[0, 1]
corr3 = np.corrcoef(x3, y3)[0, 1]
corr4 = np.corrcoef(x4, y4)[0, 1]

# Plotting
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Correlation Examples', fontsize=16)

datasets = [(x1, y1, corr1, 'Strong Positive'), (x2, y2, corr2, 'No Correlation'), 
           (x3, y3, corr3, 'Negative'), (x4, y4, corr4, 'Non-linear')]

positions = [(0,0), (0,1), (1,0), (1,1)]

for (x, y, corr, title), (i, j) in zip(datasets, positions):
    axes[i, j].scatter(x, y, alpha=0.6)
    
    # Add trend line for linear relationships
    if title != 'Non-linear':
        z = np.polyfit(x, y, 1)
        p = np.poly1d(z)
        axes[i, j].plot(x, p(x), "r--", alpha=0.8)
    
    axes[i, j].set_title(f'{title}\nCorrelation: {corr:.3f}')
    axes[i, j].set_xlabel('X')
    axes[i, j].set_ylabel('Y')
    axes[i, j].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation interpretation
def interpret_correlation(r):
    abs_r = abs(r)
    if abs_r >= 0.9:
        return "Very strong"
    elif abs_r >= 0.7:
        return "Strong"
    elif abs_r >= 0.5:
        return "Moderate"
    elif abs_r >= 0.3:
        return "Weak"
    else:
        return "Very weak"

print("\n📊 Correlation Analysis:")
print("=" * 40)
for i, (title, corr) in enumerate([("Strong Positive", corr1), ("No Correlation", corr2), 
                                  ("Negative", corr3), ("Non-linear", corr4)]):
    strength = interpret_correlation(corr)
    direction = "positive" if corr > 0 else "negative" if corr < 0 else "no"
    print(f"{title:15}: r = {corr:6.3f} ({strength} {direction} correlation)")

print("\n⚠️ Important Notes:")
print("• Correlation measures LINEAR relationships only")
print("• Non-linear relationships may have low correlation but strong association")
print("• Correlation ≠ Causation (correlation does not imply causation)")

In [None]:
# Spurious correlation example
np.random.seed(123)
years = np.arange(2010, 2021)
n_years = len(years)

# Create two unrelated time series that happen to be correlated
# Ice cream sales (increases over time due to warming)
ice_cream_sales = 1000 + 50 * np.arange(n_years) + np.random.normal(0, 20, n_years)

# Sunglasses sales (also increases over time due to fashion trends)
sunglasses_sales = 500 + 30 * np.arange(n_years) + np.random.normal(0, 15, n_years)

# Calculate correlation
spurious_corr = np.corrcoef(ice_cream_sales, sunglasses_sales)[0, 1]

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

# Time series
ax1.plot(years, ice_cream_sales, 'o-', label='Ice Cream Sales', color='red')
ax1.set_xlabel('Year')
ax1.set_ylabel('Ice Cream Sales', color='red')
ax1.tick_params(axis='y', labelcolor='red')

ax1_twin = ax1.twinx()
ax1_twin.plot(years, sunglasses_sales, 's-', label='Sunglasses Sales', color='blue')
ax1_twin.set_ylabel('Sunglasses Sales', color='blue')
ax1_twin.tick_params(axis='y', labelcolor='blue')

ax1.set_title('Time Series: Both Trending Up')
ax1.grid(True, alpha=0.3)

# Scatter plot
ax2.scatter(ice_cream_sales, sunglasses_sales, alpha=0.7, color='green')
z = np.polyfit(ice_cream_sales, sunglasses_sales, 1)
p = np.poly1d(z)
ax2.plot(ice_cream_sales, p(ice_cream_sales), "r--", alpha=0.8)
ax2.set_xlabel('Ice Cream Sales')
ax2.set_ylabel('Sunglasses Sales')
ax2.set_title(f'Spurious Correlation\nr = {spurious_corr:.3f}')
ax2.grid(True, alpha=0.3)

# After detrending
# Remove linear trend from both series
ice_cream_detrended = ice_cream_sales - (1000 + 50 * np.arange(n_years))
sunglasses_detrended = sunglasses_sales - (500 + 30 * np.arange(n_years))
detrended_corr = np.corrcoef(ice_cream_detrended, sunglasses_detrended)[0, 1]

ax3.scatter(ice_cream_detrended, sunglasses_detrended, alpha=0.7, color='purple')
ax3.set_xlabel('Ice Cream Sales (detrended)')
ax3.set_ylabel('Sunglasses Sales (detrended)')
ax3.set_title(f'After Detrending\nr = {detrended_corr:.3f}')
ax3.grid(True, alpha=0.3)
ax3.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax3.axvline(x=0, color='black', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

print("🚨 Spurious Correlation Example:")
print("=" * 40)
print(f"Original correlation: {spurious_corr:.3f}")
print(f"After detrending: {detrended_corr:.3f}")
print()
print("Explanation:")
print("• Both variables increase over time (common trend)")
print("• This creates artificial correlation")
print("• After removing trend, correlation disappears")
print("• Ice cream and sunglasses sales are not causally related")
print("• The correlation is due to a confounding variable (time/season)")

print("\n🎯 Correlation vs Causation Guidelines:")
guidelines = [
    "High correlation doesn't prove causation",
    "Look for confounding variables",
    "Consider reverse causation",
    "Use experimental design to establish causation",
    "Apply domain knowledge and logic",
    "Beware of spurious correlations in time series data"
]

for guideline in guidelines:
    print(f"• {guideline}")

## Hypothesis Testing

In [None]:
# Hypothesis testing examples

def perform_hypothesis_test():
    print("🧪 Hypothesis Testing Examples")
    print("=" * 50)
    
    # Example 1: One-sample t-test
    print("\n1️⃣ One-Sample T-Test:")
    print("H₀: μ = 100 (population mean is 100)")
    print("H₁: μ ≠ 100 (population mean is not 100)")
    
    # Generate sample data
    np.random.seed(42)
    sample = np.random.normal(103, 8, 50)  # True mean is 103
    
    # Perform t-test
    t_stat, p_value = stats.ttest_1samp(sample, 100)
    
    print(f"Sample mean: {np.mean(sample):.2f}")
    print(f"Sample size: {len(sample)}")
    print(f"t-statistic: {t_stat:.3f}")
    print(f"p-value: {p_value:.4f}")
    
    alpha = 0.05
    if p_value < alpha:
        print(f"✅ Reject H₀ (p < {alpha}): Evidence that mean ≠ 100")
    else:
        print(f"❌ Fail to reject H₀ (p ≥ {alpha}): Insufficient evidence")
    
    # Example 2: Two-sample t-test
    print("\n2️⃣ Two-Sample T-Test:")
    print("H₀: μ₁ = μ₂ (no difference between group means)")
    print("H₁: μ₁ ≠ μ₂ (difference between group means)")
    
    # Generate two samples
    group1 = np.random.normal(100, 10, 30)  # Control group
    group2 = np.random.normal(105, 10, 35)  # Treatment group
    
    # Perform independent t-test
    t_stat2, p_value2 = stats.ttest_ind(group1, group2)
    
    print(f"Group 1 mean: {np.mean(group1):.2f} (n={len(group1)})")
    print(f"Group 2 mean: {np.mean(group2):.2f} (n={len(group2)})")
    print(f"Difference: {np.mean(group2) - np.mean(group1):.2f}")
    print(f"t-statistic: {t_stat2:.3f}")
    print(f"p-value: {p_value2:.4f}")
    
    if p_value2 < alpha:
        print(f"✅ Reject H₀ (p < {alpha}): Significant difference between groups")
    else:
        print(f"❌ Fail to reject H₀ (p ≥ {alpha}): No significant difference")
    
    # Example 3: Chi-square test of independence
    print("\n3️⃣ Chi-Square Test of Independence:")
    print("H₀: Variables are independent")
    print("H₁: Variables are dependent")
    
    # Create contingency table
    # Gender vs Preference
    contingency_table = np.array([[30, 20, 10],   # Male: A, B, C
                                 [20, 35, 25]])   # Female: A, B, C
    
    chi2_stat, chi2_p, dof, expected = chi2_contingency(contingency_table)
    
    print("Contingency Table (Gender vs Preference):")
    print("        A    B    C")
    print(f"Male   {contingency_table[0, 0]:2d}  {contingency_table[0, 1]:2d}  {contingency_table[0, 2]:2d}")
    print(f"Female {contingency_table[1, 0]:2d}  {contingency_table[1, 1]:2d}  {contingency_table[1, 2]:2d}")
    print()
    print(f"Chi-square statistic: {chi2_stat:.3f}")
    print(f"Degrees of freedom: {dof}")
    print(f"p-value: {chi2_p:.4f}")
    
    if chi2_p < alpha:
        print(f"✅ Reject H₀ (p < {alpha}): Variables are dependent")
    else:
        print(f"❌ Fail to reject H₀ (p ≥ {alpha}): Variables are independent")
    
    return sample, group1, group2

# Perform the tests
sample_data, group1_data, group2_data = perform_hypothesis_test()

# Visualize the hypothesis tests
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# One-sample test visualization
axes[0].hist(sample_data, bins=15, alpha=0.7, color='lightblue', density=True)
axes[0].axvline(100, color='red', linestyle='--', linewidth=2, label='H₀: μ = 100')
axes[0].axvline(np.mean(sample_data), color='green', linestyle='-', linewidth=2, label=f'Sample mean = {np.mean(sample_data):.1f}')
axes[0].set_title('One-Sample T-Test')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Two-sample test visualization
axes[1].hist(group1_data, bins=15, alpha=0.6, color='lightcoral', density=True, label=f'Group 1 (μ={np.mean(group1_data):.1f})')
axes[1].hist(group2_data, bins=15, alpha=0.6, color='lightgreen', density=True, label=f'Group 2 (μ={np.mean(group2_data):.1f})')
axes[1].set_title('Two-Sample T-Test')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Density')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Confidence Intervals

In [None]:
# Confidence intervals demonstration
def calculate_confidence_intervals(data, confidence_levels=[0.90, 0.95, 0.99]):
    """
    Calculate confidence intervals for different confidence levels.
    """
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of the mean
    
    print(f"📊 Confidence Intervals (n={n}):")
    print("=" * 40)
    print(f"Sample mean: {mean:.3f}")
    print(f"Standard error: {std_err:.3f}")
    print()
    
    intervals = {}
    
    for confidence in confidence_levels:
        # Calculate t-critical value
        alpha = 1 - confidence
        t_critical = stats.t.ppf(1 - alpha/2, df=n-1)
        
        # Calculate margin of error
        margin_error = t_critical * std_err
        
        # Calculate confidence interval
        ci_lower = mean - margin_error
        ci_upper = mean + margin_error
        
        intervals[confidence] = (ci_lower, ci_upper)
        
        print(f"{confidence*100:4.0f}% CI: [{ci_lower:7.3f}, {ci_upper:7.3f}] (width: {ci_upper - ci_lower:.3f})")
    
    return intervals

# Generate sample data
np.random.seed(42)
sample_size = 50
true_mean = 25
sample = np.random.normal(true_mean, 5, sample_size)

# Calculate confidence intervals
ci_intervals = calculate_confidence_intervals(sample)

# Visualize confidence intervals
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Sample distribution with confidence intervals
ax1.hist(sample, bins=15, alpha=0.7, color='lightblue', density=True)
ax1.axvline(true_mean, color='red', linestyle='--', linewidth=2, label=f'True mean = {true_mean}')
ax1.axvline(np.mean(sample), color='green', linestyle='-', linewidth=2, label=f'Sample mean = {np.mean(sample):.2f}')

# Add confidence intervals
colors = ['orange', 'purple', 'brown']
confidences = [0.90, 0.95, 0.99]
y_positions = [0.12, 0.10, 0.08]

for i, (conf, color, y_pos) in enumerate(zip(confidences, colors, y_positions)):
    ci_lower, ci_upper = ci_intervals[conf]
    ax1.plot([ci_lower, ci_upper], [y_pos, y_pos], color=color, linewidth=4, 
             label=f'{conf*100:.0f}% CI')
    ax1.plot([ci_lower, ci_lower], [y_pos-0.005, y_pos+0.005], color=color, linewidth=2)
    ax1.plot([ci_upper, ci_upper], [y_pos-0.005, y_pos+0.005], color=color, linewidth=2)

ax1.set_title('Sample Distribution with Confidence Intervals')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: CI simulation - show coverage probability
def simulate_ci_coverage(true_mean, std_dev, sample_size, n_simulations=1000, confidence=0.95):
    """
    Simulate confidence interval coverage.
    """
    coverage_count = 0
    ci_lowers = []
    ci_uppers = []
    sample_means = []
    
    alpha = 1 - confidence
    t_critical = stats.t.ppf(1 - alpha/2, df=sample_size-1)
    
    for _ in range(n_simulations):
        # Generate random sample
        sample = np.random.normal(true_mean, std_dev, sample_size)
        sample_mean = np.mean(sample)
        std_err = stats.sem(sample)
        
        # Calculate CI
        margin_error = t_critical * std_err
        ci_lower = sample_mean - margin_error
        ci_upper = sample_mean + margin_error
        
        # Check if CI contains true mean
        if ci_lower <= true_mean <= ci_upper:
            coverage_count += 1
        
        ci_lowers.append(ci_lower)
        ci_uppers.append(ci_upper)
        sample_means.append(sample_mean)
    
    coverage_probability = coverage_count / n_simulations
    return coverage_probability, ci_lowers, ci_uppers, sample_means

# Run simulation
coverage_prob, ci_lowers, ci_uppers, sample_means = simulate_ci_coverage(true_mean, 5, sample_size, 100, 0.95)

# Plot first 50 confidence intervals
n_show = 50
for i in range(n_show):
    color = 'green' if ci_lowers[i] <= true_mean <= ci_uppers[i] else 'red'
    ax2.plot([ci_lowers[i], ci_uppers[i]], [i, i], color=color, alpha=0.7, linewidth=2)
    ax2.plot(sample_means[i], i, 'o', color='blue', markersize=3, alpha=0.7)

ax2.axvline(true_mean, color='black', linestyle='--', linewidth=2, label=f'True mean = {true_mean}')
ax2.set_title(f'95% Confidence Intervals\nCoverage: {coverage_prob*100:.1f}% (Expected: 95%)')
ax2.set_xlabel('Value')
ax2.set_ylabel('Sample Number')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n🎯 Confidence Interval Interpretation:")
print("=" * 50)
interpretations = [
    "95% CI means: If we repeated this study many times, 95% of the",
    "confidence intervals would contain the true population mean",
    "",
    "Common misconceptions:",
    "❌ 'There's a 95% chance the true mean is in this interval'",
    "✅ 'This interval was created by a method that captures the",
    "   true mean 95% of the time'",
    "",
    "Factors affecting CI width:",
    "• Higher confidence level → Wider interval",
    "• Larger sample size → Narrower interval",
    "• Higher variability → Wider interval"
]

for interpretation in interpretations:
    print(interpretation)

## Key Takeaways

### Essential Statistical Concepts:

1. **Descriptive Statistics**:
   - **Central Tendency**: Mean, median, mode
   - **Variability**: Standard deviation, variance, IQR
   - **Shape**: Skewness, kurtosis
   - **Position**: Percentiles, quartiles

2. **Probability Distributions**:
   - **Normal**: Most common, bell-shaped
   - **Binomial**: Fixed number of trials, binary outcomes
   - **Poisson**: Rare events over time/space
   - **Exponential**: Time between events

3. **Central Limit Theorem**:
   - Sample means approach normal distribution
   - Foundation for inference
   - Standard error = σ/√n

4. **Correlation vs Causation**:
   - Correlation measures linear association
   - High correlation ≠ causation
   - Watch for confounding variables
   - Beware spurious correlations

5. **Hypothesis Testing**:
   - Set null (H₀) and alternative (H₁) hypotheses
   - Choose significance level (α)
   - Calculate test statistic and p-value
   - Make decision based on p-value

6. **Confidence Intervals**:
   - Provide range of plausible values
   - Interpretation is about the method, not the specific interval
   - Width depends on confidence level, sample size, and variability

### For Data Science Applications:

- **Exploratory Data Analysis**: Use descriptive statistics to understand data
- **Feature Engineering**: Apply statistical transformations
- **Model Validation**: Use statistical tests to compare models
- **A/B Testing**: Apply hypothesis testing to business decisions
- **Uncertainty Quantification**: Use confidence intervals for predictions

### For NLP Applications:

- **Text Analysis**: Word frequency distributions, n-gram statistics
- **Language Models**: Probability distributions over words/sequences
- **Evaluation**: Statistical significance of model improvements
- **Sampling**: Confidence intervals for accuracy metrics

## Practice Exercises

1. **Analyze a real dataset**: Calculate all descriptive statistics and create visualizations
2. **A/B Test simulation**: Design and analyze an experiment with statistical tests
3. **Correlation analysis**: Find and explain spurious correlations in time series data
4. **Distribution fitting**: Identify which probability distribution best fits your data
5. **Power analysis**: Calculate required sample sizes for detecting effects
6. **Bootstrap confidence intervals**: Use resampling methods for CI estimation
7. **Multiple testing correction**: Handle multiple hypothesis testing problems

## Next Steps

Build on these foundations to:
- **Advanced statistics**: Regression, ANOVA, time series analysis
- **Machine learning**: Understanding model assumptions and validation
- **Bayesian statistics**: Alternative approach to inference
- **Experimental design**: Planning studies for causal inference

Statistics is the foundation that allows you to make reliable conclusions from data!