# SciPy Statistics

## Learning Objectives

By the end of this notebook, you will be able to:

1. Calculate descriptive statistics using `scipy.stats`
2. Work with probability distributions (normal, uniform, exponential)
3. Perform hypothesis testing (t-tests, chi-square tests)
4. Calculate and interpret confidence intervals
5. Apply statistical methods to real-world scientific data

---

## 1. Introduction to scipy.stats

The `scipy.stats` module provides a comprehensive collection of statistical functions, probability distributions, and statistical tests. It's essential for scientific computing and data analysis.

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Set up plotting style
plt.style.use('seaborn-v0_8-whitegrid')

print("SciPy version:", stats.__name__)
print("Ready for statistical analysis!")

---

## 2. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. SciPy provides efficient functions for computing these measures.

### 2.1 Basic Descriptive Statistics

In [None]:
# Generate sample data: daily temperatures in Celsius
temperatures = np.random.normal(loc=20, scale=5, size=365)

# Calculate descriptive statistics
print("Descriptive Statistics for Daily Temperatures")
print("=" * 45)
print(f"Mean:              {np.mean(temperatures):.2f}°C")
print(f"Median:            {np.median(temperatures):.2f}°C")
print(f"Standard Deviation: {np.std(temperatures):.2f}°C")
print(f"Variance:          {np.var(temperatures):.2f}")
print(f"Min:               {np.min(temperatures):.2f}°C")
print(f"Max:               {np.max(temperatures):.2f}°C")
print(f"Range:             {np.ptp(temperatures):.2f}°C")  # peak-to-peak

### 2.2 Using scipy.stats.describe()

In [None]:
# Get comprehensive statistics with one function call
result = stats.describe(temperatures)

print("scipy.stats.describe() Output")
print("=" * 40)
print(f"Number of observations: {result.nobs}")
print(f"Minimum: {result.minmax[0]:.2f}°C")
print(f"Maximum: {result.minmax[1]:.2f}°C")
print(f"Mean: {result.mean:.2f}°C")
print(f"Variance: {result.variance:.2f}")
print(f"Skewness: {result.skewness:.4f}")
print(f"Kurtosis: {result.kurtosis:.4f}")

### 2.3 Skewness and Kurtosis

- **Skewness**: Measures asymmetry of the distribution
  - Positive skew: tail extends to the right
  - Negative skew: tail extends to the left
  - Zero: symmetric distribution

- **Kurtosis**: Measures "tailedness" of the distribution
  - Positive kurtosis: heavier tails than normal
  - Negative kurtosis: lighter tails than normal
  - Zero: similar to normal distribution

In [None]:
# Create datasets with different skewness
symmetric_data = np.random.normal(0, 1, 10000)
right_skewed = np.random.exponential(2, 10000)
left_skewed = -np.random.exponential(2, 10000)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

datasets = [
    (symmetric_data, "Symmetric (Normal)"),
    (right_skewed, "Right Skewed (Exponential)"),
    (left_skewed, "Left Skewed")
]

for ax, (data, title) in zip(axes, datasets):
    ax.hist(data, bins=50, density=True, alpha=0.7, edgecolor='black')
    ax.set_title(f"{title}\nSkewness: {stats.skew(data):.2f}")
    ax.set_xlabel("Value")
    ax.set_ylabel("Density")

plt.tight_layout()
plt.show()

### 2.4 Percentiles and Quantiles

In [None]:
# Calculate percentiles for temperature data
percentiles = [10, 25, 50, 75, 90]

print("Temperature Percentiles")
print("=" * 30)
for p in percentiles:
    value = np.percentile(temperatures, p)
    print(f"{p}th percentile: {value:.2f}°C")

# Interquartile range (IQR)
q1 = np.percentile(temperatures, 25)
q3 = np.percentile(temperatures, 75)
iqr = q3 - q1
print(f"\nInterquartile Range (IQR): {iqr:.2f}°C")

# Using scipy's IQR function
print(f"IQR (scipy.stats.iqr): {stats.iqr(temperatures):.2f}°C")

### 2.5 Mode and Other Statistics

In [None]:
# Mode works best with discrete data
# Let's use exam scores as an example
exam_scores = np.random.choice([60, 65, 70, 75, 80, 85, 90, 95, 100], 
                               size=100, 
                               p=[0.05, 0.10, 0.15, 0.20, 0.20, 0.15, 0.10, 0.04, 0.01])

mode_result = stats.mode(exam_scores, keepdims=True)
print(f"Mode: {mode_result.mode[0]} (appears {mode_result.count[0]} times)")

# Geometric mean (useful for growth rates)
growth_rates = [1.05, 1.08, 1.02, 1.10, 1.03]  # 5%, 8%, 2%, 10%, 3% growth
geo_mean = stats.gmean(growth_rates)
print(f"\nGeometric mean of growth rates: {geo_mean:.4f}")
print(f"Average annual growth rate: {(geo_mean - 1) * 100:.2f}%")

# Harmonic mean (useful for averaging rates)
speeds = [60, 80, 70]  # mph for different segments of a trip
harm_mean = stats.hmean(speeds)
print(f"\nHarmonic mean of speeds: {harm_mean:.2f} mph")

---

## 3. Probability Distributions

SciPy provides a unified interface for working with many probability distributions. Each distribution has methods for:
- `pdf(x)` / `pmf(x)`: Probability density/mass function
- `cdf(x)`: Cumulative distribution function
- `ppf(q)`: Percent point function (inverse of CDF)
- `rvs(size)`: Random variates
- `fit(data)`: Fit distribution parameters to data

### 3.1 Normal Distribution

In [None]:
# Create a normal distribution object
# loc = mean, scale = standard deviation
mu, sigma = 100, 15  # IQ scores: mean=100, std=15
normal_dist = stats.norm(loc=mu, scale=sigma)

# Generate random samples
samples = normal_dist.rvs(size=1000)

# Calculate probabilities
print("IQ Score Analysis (Normal Distribution)")
print("=" * 45)

# What's the probability of IQ < 85?
prob_below_85 = normal_dist.cdf(85)
print(f"P(IQ < 85): {prob_below_85:.4f} ({prob_below_85*100:.2f}%)")

# What's the probability of IQ between 85 and 115?
prob_85_to_115 = normal_dist.cdf(115) - normal_dist.cdf(85)
print(f"P(85 < IQ < 115): {prob_85_to_115:.4f} ({prob_85_to_115*100:.2f}%)")

# What IQ score is at the 95th percentile?
iq_95th = normal_dist.ppf(0.95)
print(f"95th percentile IQ: {iq_95th:.1f}")

# Probability density at mean
pdf_at_mean = normal_dist.pdf(mu)
print(f"PDF at mean (x=100): {pdf_at_mean:.6f}")

In [None]:
# Visualize the normal distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)

# PDF plot
ax1 = axes[0]
ax1.plot(x, normal_dist.pdf(x), 'b-', linewidth=2, label='PDF')
ax1.fill_between(x, normal_dist.pdf(x), where=(x >= 85) & (x <= 115), 
                  alpha=0.3, label='P(85 < IQ < 115)')
ax1.axvline(mu, color='red', linestyle='--', label=f'Mean = {mu}')
ax1.set_xlabel('IQ Score')
ax1.set_ylabel('Probability Density')
ax1.set_title('Normal Distribution PDF')
ax1.legend()

# CDF plot
ax2 = axes[1]
ax2.plot(x, normal_dist.cdf(x), 'g-', linewidth=2)
ax2.axhline(0.5, color='gray', linestyle=':', alpha=0.5)
ax2.axvline(mu, color='red', linestyle='--', label=f'Mean = {mu}')
ax2.set_xlabel('IQ Score')
ax2.set_ylabel('Cumulative Probability')
ax2.set_title('Normal Distribution CDF')
ax2.legend()

plt.tight_layout()
plt.show()

### 3.2 Uniform Distribution

In [None]:
# Uniform distribution: all values equally likely within a range
# scipy.stats.uniform uses loc (start) and scale (width)
# For U(a, b): loc=a, scale=b-a

a, b = 0, 10  # Random number between 0 and 10
uniform_dist = stats.uniform(loc=a, scale=b-a)

print("Uniform Distribution U(0, 10)")
print("=" * 35)
print(f"Mean: {uniform_dist.mean():.2f}")
print(f"Variance: {uniform_dist.var():.2f}")
print(f"Standard Deviation: {uniform_dist.std():.2f}")

# Probability of value between 3 and 7
prob_3_to_7 = uniform_dist.cdf(7) - uniform_dist.cdf(3)
print(f"\nP(3 < X < 7): {prob_3_to_7:.4f}")

# Generate samples
uniform_samples = uniform_dist.rvs(size=10000)

# Visualize
fig, ax = plt.subplots(figsize=(10, 4))
ax.hist(uniform_samples, bins=50, density=True, alpha=0.7, 
        edgecolor='black', label='Samples')
x = np.linspace(-1, 11, 100)
ax.plot(x, uniform_dist.pdf(x), 'r-', linewidth=2, label='Theoretical PDF')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Uniform Distribution U(0, 10)')
ax.legend()
plt.show()

### 3.3 Exponential Distribution

In [None]:
# Exponential distribution: models time between events
# Often used for: time between customer arrivals, radioactive decay, etc.
# scipy uses scale = 1/lambda (mean of distribution)

# Example: Customer arrivals with average rate of 5 per hour
rate = 5  # customers per hour (lambda)
mean_time = 1/rate  # mean time between arrivals in hours

exp_dist = stats.expon(scale=mean_time)

print("Exponential Distribution (Customer Arrivals)")
print("=" * 45)
print(f"Rate (lambda): {rate} customers/hour")
print(f"Mean time between arrivals: {mean_time*60:.1f} minutes")
print(f"Variance: {exp_dist.var()*3600:.2f} minutes^2")

# Probability of waiting more than 15 minutes (0.25 hours)
prob_wait_15 = 1 - exp_dist.cdf(0.25)
print(f"\nP(wait > 15 minutes): {prob_wait_15:.4f}")

# Probability of waiting less than 5 minutes
prob_wait_5 = exp_dist.cdf(5/60)
print(f"P(wait < 5 minutes): {prob_wait_5:.4f}")

# Median wait time
median_wait = exp_dist.ppf(0.5)
print(f"Median wait time: {median_wait*60:.2f} minutes")

In [None]:
# Visualize exponential distribution
x = np.linspace(0, 1, 200)  # 0 to 1 hour

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PDF
ax1 = axes[0]
ax1.plot(x*60, exp_dist.pdf(x), 'b-', linewidth=2)
ax1.fill_between(x*60, exp_dist.pdf(x), where=(x <= 0.25), alpha=0.3)
ax1.axvline(15, color='red', linestyle='--', label='15 minutes')
ax1.set_xlabel('Wait Time (minutes)')
ax1.set_ylabel('Probability Density')
ax1.set_title('Exponential Distribution PDF\n(Time Between Customer Arrivals)')
ax1.legend()

# Survival function (1 - CDF)
ax2 = axes[1]
ax2.plot(x*60, 1 - exp_dist.cdf(x), 'g-', linewidth=2)
ax2.axhline(0.5, color='gray', linestyle=':', alpha=0.5, label='50%')
ax2.set_xlabel('Wait Time (minutes)')
ax2.set_ylabel('Probability of Waiting Longer')
ax2.set_title('Survival Function P(X > t)')
ax2.legend()

plt.tight_layout()
plt.show()

### 3.4 Comparing Distributions

In [None]:
# Compare different distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Normal distributions with different parameters
ax = axes[0, 0]
x = np.linspace(-6, 6, 200)
for mu, sigma in [(0, 1), (0, 2), (2, 1)]:
    ax.plot(x, stats.norm.pdf(x, mu, sigma), 
            label=f'μ={mu}, σ={sigma}')
ax.set_title('Normal Distributions')
ax.legend()

# Exponential distributions with different rates
ax = axes[0, 1]
x = np.linspace(0, 5, 200)
for lam in [0.5, 1.0, 2.0]:
    ax.plot(x, stats.expon.pdf(x, scale=1/lam), 
            label=f'λ={lam}')
ax.set_title('Exponential Distributions')
ax.legend()

# Gamma distributions
ax = axes[0, 2]
x = np.linspace(0, 15, 200)
for shape in [1, 2, 5]:
    ax.plot(x, stats.gamma.pdf(x, shape), 
            label=f'k={shape}')
ax.set_title('Gamma Distributions')
ax.legend()

# Beta distributions
ax = axes[1, 0]
x = np.linspace(0, 1, 200)
for a, b in [(0.5, 0.5), (2, 5), (5, 2)]:
    ax.plot(x, stats.beta.pdf(x, a, b), 
            label=f'α={a}, β={b}')
ax.set_title('Beta Distributions')
ax.legend()

# Chi-square distributions
ax = axes[1, 1]
x = np.linspace(0, 20, 200)
for df in [2, 4, 8]:
    ax.plot(x, stats.chi2.pdf(x, df), 
            label=f'df={df}')
ax.set_title('Chi-Square Distributions')
ax.legend()

# t-distributions
ax = axes[1, 2]
x = np.linspace(-4, 4, 200)
ax.plot(x, stats.norm.pdf(x), 'k--', label='Normal', linewidth=2)
for df in [1, 5, 30]:
    ax.plot(x, stats.t.pdf(x, df), 
            label=f'df={df}')
ax.set_title('Student\'s t-Distributions')
ax.legend()

plt.tight_layout()
plt.show()

---

## 4. Hypothesis Testing

Hypothesis testing helps us make statistical decisions about populations based on sample data.

### 4.1 One-Sample t-Test

Tests whether a sample mean differs significantly from a known population mean.

In [None]:
# Example: Testing if a manufacturing process produces parts with correct mean weight
# Specification: parts should weigh 50 grams

# Sample of parts from the production line
np.random.seed(123)
sample_weights = np.random.normal(loc=50.5, scale=2, size=30)  # slightly off-target

# Null hypothesis: μ = 50 grams
# Alternative hypothesis: μ ≠ 50 grams (two-tailed test)
target_weight = 50

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(sample_weights, target_weight)

print("One-Sample t-Test: Manufacturing Quality Control")
print("=" * 50)
print(f"Sample size: {len(sample_weights)}")
print(f"Sample mean: {np.mean(sample_weights):.3f} grams")
print(f"Sample std: {np.std(sample_weights, ddof=1):.3f} grams")
print(f"Target mean: {target_weight} grams")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H0 (p < {alpha})")
    print("The sample mean is significantly different from the target.")
else:
    print(f"\nConclusion: Fail to reject H0 (p >= {alpha})")
    print("No significant difference from the target weight.")

### 4.2 Two-Sample t-Test (Independent Samples)

Tests whether two independent groups have different means.

In [None]:
# Example: Comparing effectiveness of two treatments
np.random.seed(456)

# Treatment A: recovery time in days
treatment_a = np.random.normal(loc=14, scale=3, size=25)

# Treatment B: recovery time in days
treatment_b = np.random.normal(loc=12, scale=3, size=25)

# Perform two-sample t-test (assuming equal variances)
t_stat, p_value = stats.ttest_ind(treatment_a, treatment_b)

print("Two-Sample t-Test: Treatment Comparison")
print("=" * 45)
print(f"Treatment A: n={len(treatment_a)}, mean={np.mean(treatment_a):.2f}, std={np.std(treatment_a, ddof=1):.2f}")
print(f"Treatment B: n={len(treatment_b)}, mean={np.mean(treatment_b):.2f}, std={np.std(treatment_b, ddof=1):.2f}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Welch's t-test (does not assume equal variances)
t_stat_welch, p_value_welch = stats.ttest_ind(treatment_a, treatment_b, equal_var=False)
print(f"\nWelch's t-test (unequal variances):")
print(f"t-statistic: {t_stat_welch:.4f}")
print(f"p-value: {p_value_welch:.4f}")

In [None]:
# Visualize the two groups
fig, ax = plt.subplots(figsize=(10, 5))

# Create box plots
bp = ax.boxplot([treatment_a, treatment_b], labels=['Treatment A', 'Treatment B'],
                patch_artist=True)

# Color the boxes
colors = ['lightblue', 'lightgreen']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

# Add individual points
for i, (data, color) in enumerate(zip([treatment_a, treatment_b], ['blue', 'green']), 1):
    x = np.random.normal(i, 0.04, size=len(data))
    ax.scatter(x, data, alpha=0.6, color=color, s=30)

ax.set_ylabel('Recovery Time (days)')
ax.set_title(f'Treatment Comparison\np-value = {p_value:.4f}')
plt.show()

### 4.3 Paired t-Test

Tests whether the mean difference between paired observations is significantly different from zero.

In [None]:
# Example: Before and after measurements (e.g., weight loss program)
np.random.seed(789)

n_subjects = 20
weight_before = np.random.normal(loc=85, scale=10, size=n_subjects)
# After program: average loss of 3 kg with some variation
weight_loss = np.random.normal(loc=3, scale=2, size=n_subjects)
weight_after = weight_before - weight_loss

# Paired t-test
t_stat, p_value = stats.ttest_rel(weight_before, weight_after)

print("Paired t-Test: Weight Loss Program Effectiveness")
print("=" * 50)
print(f"Number of subjects: {n_subjects}")
print(f"Mean weight before: {np.mean(weight_before):.2f} kg")
print(f"Mean weight after: {np.mean(weight_after):.2f} kg")
print(f"Mean difference: {np.mean(weight_before - weight_after):.2f} kg")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nConclusion: The weight loss is statistically significant.")

### 4.4 Chi-Square Test

Tests for independence between categorical variables.

In [None]:
# Example: Testing if customer satisfaction depends on product category
# Contingency table: rows = categories, columns = satisfaction levels

# Observed frequencies
observed = np.array([
    [50, 30, 20],   # Electronics: Satisfied, Neutral, Dissatisfied
    [40, 35, 25],   # Clothing
    [60, 25, 15],   # Home & Garden
])

# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print("Chi-Square Test of Independence")
print("=" * 45)
print("\nObserved Frequencies:")
print("               Satisfied  Neutral  Dissatisfied")
categories = ['Electronics', 'Clothing', 'Home & Garden']
for cat, row in zip(categories, observed):
    print(f"{cat:14} {row[0]:9} {row[1]:8} {row[2]:12}")

print("\nExpected Frequencies (if independent):")
print("               Satisfied  Neutral  Dissatisfied")
for cat, row in zip(categories, expected):
    print(f"{cat:14} {row[0]:9.1f} {row[1]:8.1f} {row[2]:12.1f}")

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nConclusion: Satisfaction depends on product category.")
else:
    print("\nConclusion: No significant relationship between satisfaction and category.")

### 4.5 Chi-Square Goodness of Fit Test

In [None]:
# Example: Testing if a die is fair
# Expected: each face should appear 1/6 of the time

# Observed frequencies from 600 rolls
observed_rolls = np.array([92, 108, 97, 103, 95, 105])
expected_rolls = np.array([100, 100, 100, 100, 100, 100])  # fair die

chi2, p_value = stats.chisquare(observed_rolls, f_exp=expected_rolls)

print("Chi-Square Goodness of Fit: Testing a Die")
print("=" * 45)
print(f"\nTotal rolls: {sum(observed_rolls)}")
print("\nFace   Observed  Expected")
for i in range(6):
    print(f"  {i+1}      {observed_rolls[i]:3}       {expected_rolls[i]}")

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nConclusion: The die appears to be biased.")
else:
    print("\nConclusion: No evidence that the die is biased.")

---

## 5. Confidence Intervals

Confidence intervals provide a range of plausible values for a population parameter.

### 5.1 Confidence Interval for the Mean

In [None]:
# Example: Estimating average battery life
np.random.seed(42)
battery_life = np.random.normal(loc=8.5, scale=1.2, size=50)  # hours

# Calculate 95% confidence interval
confidence = 0.95
n = len(battery_life)
mean = np.mean(battery_life)
sem = stats.sem(battery_life)  # standard error of the mean

# Using t-distribution (small sample or unknown population variance)
ci = stats.t.interval(confidence, df=n-1, loc=mean, scale=sem)

print("95% Confidence Interval for Battery Life")
print("=" * 45)
print(f"Sample size: {n}")
print(f"Sample mean: {mean:.3f} hours")
print(f"Standard deviation: {np.std(battery_life, ddof=1):.3f} hours")
print(f"Standard error: {sem:.3f} hours")
print(f"\n95% Confidence Interval: ({ci[0]:.3f}, {ci[1]:.3f}) hours")
print(f"Margin of error: ±{(ci[1] - ci[0])/2:.3f} hours")

In [None]:
# Compare confidence intervals at different confidence levels
confidence_levels = [0.90, 0.95, 0.99]

print("Confidence Intervals at Different Levels")
print("=" * 50)

fig, ax = plt.subplots(figsize=(10, 5))

for i, conf in enumerate(confidence_levels):
    ci = stats.t.interval(conf, df=n-1, loc=mean, scale=sem)
    margin = (ci[1] - ci[0]) / 2
    print(f"{conf*100:.0f}% CI: ({ci[0]:.3f}, {ci[1]:.3f}), margin: ±{margin:.3f}")
    
    # Plot
    ax.errorbar(mean, i, xerr=margin, fmt='o', capsize=10, 
                capthick=2, markersize=10, label=f'{conf*100:.0f}% CI')

ax.axvline(mean, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(range(len(confidence_levels)))
ax.set_yticklabels([f'{c*100:.0f}%' for c in confidence_levels])
ax.set_xlabel('Battery Life (hours)')
ax.set_ylabel('Confidence Level')
ax.set_title('Confidence Intervals at Different Levels')
ax.legend()
plt.tight_layout()
plt.show()

### 5.2 Confidence Interval for Proportions

In [None]:
# Example: Estimating proportion of defective products
n_samples = 200
n_defective = 12
p_hat = n_defective / n_samples

# Using normal approximation (Wilson score interval is more accurate for small p)
# Standard error for proportion
se_prop = np.sqrt(p_hat * (1 - p_hat) / n_samples)

# 95% CI using normal distribution
z = stats.norm.ppf(0.975)  # z-score for 95% CI
ci_lower = p_hat - z * se_prop
ci_upper = p_hat + z * se_prop

print("95% Confidence Interval for Defect Rate")
print("=" * 45)
print(f"Sample size: {n_samples}")
print(f"Number of defects: {n_defective}")
print(f"Sample proportion: {p_hat:.4f} ({p_hat*100:.2f}%)")
print(f"Standard error: {se_prop:.4f}")
print(f"\n95% CI: ({ci_lower:.4f}, {ci_upper:.4f})")
print(f"        ({ci_lower*100:.2f}%, {ci_upper*100:.2f}%)")

### 5.3 Sample Size Determination

In [None]:
# How large a sample do we need for a desired margin of error?

def sample_size_for_mean(margin_of_error, std_dev, confidence=0.95):
    """Calculate required sample size for estimating a mean."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = ((z * std_dev) / margin_of_error) ** 2
    return int(np.ceil(n))

def sample_size_for_proportion(margin_of_error, p=0.5, confidence=0.95):
    """Calculate required sample size for estimating a proportion."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = (z**2 * p * (1 - p)) / margin_of_error**2
    return int(np.ceil(n))

# Example: Battery life study
print("Sample Size Calculations")
print("=" * 45)

# For mean estimation
desired_margin = 0.2  # hours
estimated_std = 1.2   # hours
n_needed = sample_size_for_mean(desired_margin, estimated_std)
print(f"\nFor mean (margin = ±{desired_margin} hours, σ = {estimated_std}):")
print(f"Required sample size: {n_needed}")

# For proportion estimation
desired_margin_prop = 0.03  # 3 percentage points
n_needed_prop = sample_size_for_proportion(desired_margin_prop)
print(f"\nFor proportion (margin = ±{desired_margin_prop*100}%, p = 0.5):")
print(f"Required sample size: {n_needed_prop}")

---

## Exercises

Practice what you've learned with these exercises.

### Exercise 1: Descriptive Statistics

A scientist measures the length of 50 fish from a lake (in cm). Calculate:
1. Mean, median, and standard deviation
2. Skewness and kurtosis
3. The 10th and 90th percentiles
4. Create a histogram with the mean and median marked

In [None]:
# Fish length data (in cm)
np.random.seed(100)
fish_lengths = np.concatenate([
    np.random.normal(25, 3, 35),  # Main population
    np.random.normal(35, 2, 15)   # Larger fish
])

# Your code here


<details>
<summary>Click to see solution</summary>

```python
# 1. Basic statistics
print("Descriptive Statistics for Fish Lengths")
print("=" * 45)
print(f"Mean: {np.mean(fish_lengths):.2f} cm")
print(f"Median: {np.median(fish_lengths):.2f} cm")
print(f"Standard Deviation: {np.std(fish_lengths, ddof=1):.2f} cm")

# 2. Skewness and kurtosis
print(f"\nSkewness: {stats.skew(fish_lengths):.4f}")
print(f"Kurtosis: {stats.kurtosis(fish_lengths):.4f}")

# 3. Percentiles
p10 = np.percentile(fish_lengths, 10)
p90 = np.percentile(fish_lengths, 90)
print(f"\n10th percentile: {p10:.2f} cm")
print(f"90th percentile: {p90:.2f} cm")

# 4. Histogram
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(fish_lengths, bins=15, edgecolor='black', alpha=0.7)
ax.axvline(np.mean(fish_lengths), color='red', linestyle='--', 
           linewidth=2, label=f'Mean = {np.mean(fish_lengths):.2f}')
ax.axvline(np.median(fish_lengths), color='green', linestyle='-', 
           linewidth=2, label=f'Median = {np.median(fish_lengths):.2f}')
ax.set_xlabel('Fish Length (cm)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Fish Lengths')
ax.legend()
plt.show()
```
</details>

### Exercise 2: Working with Distributions

The time between customer arrivals at a bank follows an exponential distribution with a mean of 3 minutes.

1. Create the distribution object
2. Calculate the probability of waiting more than 5 minutes
3. Calculate the probability of waiting between 2 and 4 minutes
4. Find the median wait time
5. Generate 1000 random samples and verify the theoretical mean

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
# 1. Create exponential distribution with mean = 3 minutes
mean_time = 3
exp_dist = stats.expon(scale=mean_time)

print("Bank Customer Wait Time Analysis")
print("=" * 40)
print(f"Distribution: Exponential with mean = {mean_time} minutes")

# 2. P(X > 5)
prob_more_than_5 = 1 - exp_dist.cdf(5)
print(f"\nP(wait > 5 min): {prob_more_than_5:.4f} ({prob_more_than_5*100:.2f}%)")

# 3. P(2 < X < 4)
prob_2_to_4 = exp_dist.cdf(4) - exp_dist.cdf(2)
print(f"P(2 < wait < 4 min): {prob_2_to_4:.4f} ({prob_2_to_4*100:.2f}%)")

# 4. Median wait time
median = exp_dist.ppf(0.5)
print(f"Median wait time: {median:.2f} minutes")

# 5. Simulate and verify
samples = exp_dist.rvs(size=1000)
print(f"\nSimulation (n=1000):")
print(f"Sample mean: {np.mean(samples):.2f} minutes")
print(f"Theoretical mean: {mean_time} minutes")

# Visualization
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(samples, bins=30, density=True, alpha=0.7, edgecolor='black', label='Samples')
x = np.linspace(0, 15, 100)
ax.plot(x, exp_dist.pdf(x), 'r-', linewidth=2, label='Theoretical PDF')
ax.axvline(median, color='green', linestyle='--', label=f'Median = {median:.2f}')
ax.set_xlabel('Wait Time (minutes)')
ax.set_ylabel('Density')
ax.set_title('Customer Wait Time Distribution')
ax.legend()
plt.show()
```
</details>

### Exercise 3: Hypothesis Testing

A pharmaceutical company claims their new drug reduces blood pressure by an average of 10 mmHg. In a clinical trial, 25 patients showed the following reductions:

1. Perform a one-sample t-test to check if the mean reduction differs from 10 mmHg
2. Calculate and interpret the 95% confidence interval
3. What conclusion would you draw at α = 0.05?

In [None]:
# Blood pressure reductions (mmHg)
np.random.seed(42)
bp_reductions = np.random.normal(loc=8.5, scale=4, size=25)

# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Claimed reduction
claimed_reduction = 10

# Basic statistics
n = len(bp_reductions)
mean_reduction = np.mean(bp_reductions)
std_reduction = np.std(bp_reductions, ddof=1)
sem = stats.sem(bp_reductions)

print("Blood Pressure Reduction Analysis")
print("=" * 45)
print(f"Sample size: {n}")
print(f"Sample mean: {mean_reduction:.2f} mmHg")
print(f"Sample std: {std_reduction:.2f} mmHg")
print(f"Claimed reduction: {claimed_reduction} mmHg")

# 1. One-sample t-test
t_stat, p_value = stats.ttest_1samp(bp_reductions, claimed_reduction)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# 2. 95% confidence interval
ci = stats.t.interval(0.95, df=n-1, loc=mean_reduction, scale=sem)
print(f"\n95% CI: ({ci[0]:.2f}, {ci[1]:.2f}) mmHg")

# 3. Conclusion
print("\n" + "=" * 45)
print("Conclusion:")
if p_value < 0.05:
    print(f"At α = 0.05, we reject the null hypothesis (p = {p_value:.4f}).")
    print(f"The actual reduction ({mean_reduction:.2f} mmHg) is significantly")
    print(f"different from the claimed {claimed_reduction} mmHg.")
else:
    print(f"At α = 0.05, we fail to reject the null hypothesis.")

# Check if 10 is in the CI
if ci[0] <= claimed_reduction <= ci[1]:
    print(f"\nNote: {claimed_reduction} mmHg is within the 95% CI.")
else:
    print(f"\nNote: {claimed_reduction} mmHg is outside the 95% CI.")
```
</details>

### Exercise 4: Chi-Square Test

A researcher wants to know if there's a relationship between exercise frequency and sleep quality. Survey data from 300 people is given below:

1. Perform a chi-square test of independence
2. Calculate the expected frequencies
3. Interpret the results

In [None]:
# Observed frequencies
# Rows: Exercise frequency (Never, 1-2x/week, 3+x/week)
# Columns: Sleep quality (Poor, Fair, Good)
observed = np.array([
    [40, 30, 20],   # Never exercise
    [25, 45, 40],   # 1-2x/week
    [15, 35, 50],   # 3+x/week
])

# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Labels for better output
exercise_levels = ['Never', '1-2x/week', '3+x/week']
sleep_quality = ['Poor', 'Fair', 'Good']

# 1 & 2. Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print("Chi-Square Test: Exercise Frequency vs Sleep Quality")
print("=" * 55)

print("\nObserved Frequencies:")
print(f"{'Exercise':<12} {'Poor':>8} {'Fair':>8} {'Good':>8} {'Total':>8}")
print("-" * 48)
for i, ex in enumerate(exercise_levels):
    print(f"{ex:<12} {observed[i,0]:>8} {observed[i,1]:>8} {observed[i,2]:>8} {sum(observed[i]):>8}")
print("-" * 48)
print(f"{'Total':<12} {observed[:,0].sum():>8} {observed[:,1].sum():>8} {observed[:,2].sum():>8} {observed.sum():>8}")

print("\nExpected Frequencies (if independent):")
print(f"{'Exercise':<12} {'Poor':>8} {'Fair':>8} {'Good':>8}")
print("-" * 40)
for i, ex in enumerate(exercise_levels):
    print(f"{ex:<12} {expected[i,0]:>8.1f} {expected[i,1]:>8.1f} {expected[i,2]:>8.1f}")

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.6f}")

# 3. Interpretation
print("\n" + "=" * 55)
print("Interpretation:")
if p_value < 0.05:
    print(f"At α = 0.05, we reject the null hypothesis of independence.")
    print(f"There IS a significant relationship between exercise frequency")
    print(f"and sleep quality (χ² = {chi2:.2f}, p = {p_value:.4f}).")
    print(f"\nLooking at the data, people who exercise more frequently")
    print(f"tend to report better sleep quality.")
else:
    print(f"At α = 0.05, we fail to reject the null hypothesis.")
    print(f"No significant relationship between exercise and sleep quality.")
```
</details>

### Exercise 5: Confidence Intervals and Sample Size

A quality control engineer wants to estimate the average lifespan of light bulbs.

1. From a sample of 40 bulbs with mean = 1200 hours and std = 150 hours, calculate the 99% CI
2. If we want the margin of error to be at most ±20 hours (95% CI), how many bulbs should we test?
3. How does the margin of error change as confidence level increases?

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Given data
n = 40
mean = 1200
std = 150
sem = std / np.sqrt(n)

print("Light Bulb Lifespan Analysis")
print("=" * 45)
print(f"Sample size: {n}")
print(f"Sample mean: {mean} hours")
print(f"Sample std: {std} hours")
print(f"Standard error: {sem:.2f} hours")

# 1. 99% Confidence Interval
ci_99 = stats.t.interval(0.99, df=n-1, loc=mean, scale=sem)
margin_99 = (ci_99[1] - ci_99[0]) / 2
print(f"\n99% CI: ({ci_99[0]:.2f}, {ci_99[1]:.2f}) hours")
print(f"Margin of error: ±{margin_99:.2f} hours")

# 2. Sample size for margin of error ±20 hours
desired_margin = 20
z = stats.norm.ppf(0.975)  # Using z for large sample approximation
n_required = ((z * std) / desired_margin) ** 2
print(f"\nFor margin of error ≤ ±{desired_margin} hours (95% CI):")
print(f"Required sample size: {int(np.ceil(n_required))} bulbs")

# 3. Margin of error vs confidence level
print("\nMargin of Error at Different Confidence Levels:")
conf_levels = [0.80, 0.90, 0.95, 0.99]
for conf in conf_levels:
    ci = stats.t.interval(conf, df=n-1, loc=mean, scale=sem)
    margin = (ci[1] - ci[0]) / 2
    print(f"{conf*100:>5.0f}% CI: margin = ±{margin:.2f} hours")

# Visualization
fig, ax = plt.subplots(figsize=(10, 5))
margins = [(stats.t.interval(c, df=n-1, loc=mean, scale=sem)[1] - 
            stats.t.interval(c, df=n-1, loc=mean, scale=sem)[0]) / 2 
           for c in np.linspace(0.50, 0.99, 50)]
ax.plot(np.linspace(50, 99, 50), margins, 'b-', linewidth=2)
ax.set_xlabel('Confidence Level (%)')
ax.set_ylabel('Margin of Error (hours)')
ax.set_title('Trade-off: Confidence Level vs Margin of Error')
ax.grid(True, alpha=0.3)
plt.show()
```
</details>

---

## Summary

In this notebook, you learned:

1. **Descriptive Statistics**
   - Central tendency: mean, median, mode
   - Dispersion: variance, standard deviation, IQR
   - Shape: skewness and kurtosis
   - `scipy.stats.describe()` for comprehensive statistics

2. **Probability Distributions**
   - Normal distribution for symmetric continuous data
   - Uniform distribution for equally likely outcomes
   - Exponential distribution for time between events
   - Key methods: `pdf()`, `cdf()`, `ppf()`, `rvs()`, `fit()`

3. **Hypothesis Testing**
   - One-sample t-test: comparing sample mean to population mean
   - Two-sample t-test: comparing two independent groups
   - Paired t-test: comparing paired observations
   - Chi-square test: testing independence between categorical variables

4. **Confidence Intervals**
   - Interpretation: range of plausible values for population parameter
   - Effect of sample size and confidence level
   - Sample size determination for desired precision

---

## Next Steps

Continue your SciPy journey with the next notebook:

**[02_interpolation_fitting.ipynb](02_interpolation_fitting.ipynb)** - Learn about interpolation, curve fitting, and splines for data approximation and smoothing.