# Chapter 32: Random and Statistics

This notebook covers Python's `random` module for pseudorandom number generation and the `statistics` module for descriptive statistics. You will learn how to generate random values, sample from sequences, control reproducibility with seeds, and compute common statistical measures.

## Key Concepts
- **Seeds**: `random.seed()` makes pseudorandom sequences reproducible
- **Random values**: `random()`, `randint()`, `uniform()`, `gauss()`
- **Sequence operations**: `choice()`, `choices()`, `sample()`, `shuffle()`
- **Distributions**: `gauss()`, `expovariate()`, `triangular()`
- **Descriptive statistics**: `mean()`, `median()`, `mode()`, `stdev()`, `variance()`
- **Advanced statistics**: `quantiles()`, `correlation()`, `linear_regression()`

## Section 1: Pseudorandom Number Generation and Seeds

The `random` module generates pseudorandom numbers using a deterministic algorithm. Setting a seed makes the sequence reproducible, which is essential for testing and debugging.

In [None]:
import random

# random() returns a float in [0.0, 1.0)
random.seed(42)
values: list[float] = [random.random() for _ in range(5)]
print("Five random floats [0, 1):")
for i, v in enumerate(values):
    print(f"  [{i}] {v:.6f}")

# Resetting the seed reproduces the same sequence
random.seed(42)
reproduced: list[float] = [random.random() for _ in range(5)]
print(f"\nReproducible: {values == reproduced}")

In [None]:
import random

random.seed(42)

# randint(a, b) returns an integer in [a, b] (inclusive)
dice_rolls: list[int] = [random.randint(1, 6) for _ in range(10)]
print(f"Dice rolls: {dice_rolls}")

# randrange(start, stop, step) -- like range() but returns one random value
even: int = random.randrange(0, 100, 2)
print(f"\nRandom even [0, 100): {even}")

# uniform(a, b) returns a float in [a, b]
temperature: float = random.uniform(20.0, 35.0)
print(f"Random temperature: {temperature:.2f} C")

## Section 2: Choosing and Sampling from Sequences

The `random` module provides several ways to select elements from sequences: single selection, weighted selection, sampling without replacement, and in-place shuffling.

In [None]:
import random

random.seed(42)

colors: list[str] = ["red", "green", "blue", "yellow"]

# choice() picks one element
picked: str = random.choice(colors)
print(f"choice: {picked}")
print(f"picked is in colors: {picked in colors}")

# choices() picks k elements WITH replacement (may repeat)
picks: list[str] = random.choices(colors, k=6)
print(f"\nchoices(k=6): {picks}")

# Weighted choices
weights: list[int] = [10, 1, 1, 1]  # red is 10x more likely
weighted: list[str] = random.choices(colors, weights=weights, k=10)
print(f"\nWeighted (red=10x): {weighted}")

In [None]:
import random

random.seed(42)

# sample() picks k UNIQUE elements (without replacement)
population: list[int] = list(range(100))
selected: list[int] = random.sample(population, k=5)
print(f"sample(k=5): {selected}")
print(f"All unique:  {len(selected) == len(set(selected))}")

# shuffle() randomizes a list IN PLACE
deck: list[str] = ["A", "K", "Q", "J", "10"]
print(f"\nBefore shuffle: {deck}")
random.shuffle(deck)
print(f"After shuffle:  {deck}")

# The same elements are present, just reordered
print(f"Same elements:  {sorted(deck) == ['10', 'A', 'J', 'K', 'Q']}")

## Section 3: Random Distributions

Beyond uniform random numbers, the `random` module can generate values from various probability distributions.

In [None]:
import random

random.seed(42)

# Gaussian (normal) distribution
gaussian_values: list[float] = [random.gauss(mu=0.0, sigma=1.0) for _ in range(10)]
print("Gaussian (mu=0, sigma=1):")
for v in gaussian_values:
    print(f"  {v:+.4f}")

# Triangular distribution (most values near the mode)
tri_values: list[float] = [
    random.triangular(low=0.0, high=10.0, mode=7.0) for _ in range(5)
]
print(f"\nTriangular (low=0, high=10, mode=7):")
for v in tri_values:
    print(f"  {v:.4f}")

In [None]:
import random

random.seed(42)

# Exponential distribution (models time between events)
lambd: float = 1.0 / 5.0  # average 5 minutes between events
wait_times: list[float] = [random.expovariate(lambd) for _ in range(8)]
print("Exponential wait times (avg=5 min):")
for t in wait_times:
    print(f"  {t:.2f} min")

# Beta distribution (values between 0 and 1)
beta_values: list[float] = [random.betavariate(alpha=2, beta=5) for _ in range(5)]
print(f"\nBeta(alpha=2, beta=5):")
for v in beta_values:
    print(f"  {v:.4f}")

## Section 4: Basic Descriptive Statistics

The `statistics` module (Python 3.4+) provides functions for calculating common statistical measures. It works with any iterable of numeric data.

In [None]:
import statistics

data: list[int] = [1, 2, 3, 4, 5]

# Central tendency
print(f"Data: {data}")
print(f"mean:   {statistics.mean(data)}")
print(f"median: {statistics.median(data)}")

# Median with even-length data returns average of two middle values
even_data: list[int] = [1, 3, 5, 7]
print(f"\nData: {even_data}")
print(f"median: {statistics.median(even_data)}")

# Mode -- most frequent value
grades: list[str] = ["A", "B", "A", "C", "A", "B"]
print(f"\nGrades: {grades}")
print(f"mode:   {statistics.mode(grades)}")

In [None]:
import statistics

# Different types of mean
values: list[float] = [1.0, 2.0, 4.0, 8.0]

print(f"Data: {values}")
print(f"Arithmetic mean:  {statistics.mean(values)}")
print(f"Geometric mean:   {statistics.geometric_mean(values):.4f}")
print(f"Harmonic mean:    {statistics.harmonic_mean(values):.4f}")

# Median variants
data: list[int] = [1, 3, 5, 7]
print(f"\nData: {data}")
print(f"median:      {statistics.median(data)}")
print(f"median_low:  {statistics.median_low(data)}")
print(f"median_high: {statistics.median_high(data)}")

## Section 5: Spread and Variability

Standard deviation and variance measure how spread out data is. The `statistics` module provides both sample and population variants.

In [None]:
import statistics

data: list[int] = [2, 4, 4, 4, 5, 5, 7, 9]

# Sample standard deviation and variance (N-1 denominator)
print(f"Data: {data}")
print(f"stdev:    {statistics.stdev(data):.4f}")
print(f"variance: {statistics.variance(data):.4f}")

# Population standard deviation and variance (N denominator)
print(f"\npstdev:    {statistics.pstdev(data):.4f}")
print(f"pvariance: {statistics.pvariance(data):.4f}")

# Verify stdev is approximately 2.0
print(f"\nstdev close to 2.0: {abs(statistics.stdev(data) - 2.0) < 0.2}")

## Section 6: Quantiles and Advanced Statistics

Python 3.8+ added `quantiles()` for splitting data into equal groups, and Python 3.10+ added `correlation()` and `linear_regression()`.

In [None]:
import statistics

data: list[int] = list(range(1, 101))  # 1 to 100

# Quartiles (split into 4 groups)
quartiles: list[float] = statistics.quantiles(data, n=4)
print(f"Quartiles (Q1, Q2, Q3): {quartiles}")

# Deciles (split into 10 groups)
deciles: list[float] = statistics.quantiles(data, n=10)
print(f"Deciles: {deciles}")

# Percentiles using quantiles
percentiles: list[float] = statistics.quantiles(data, n=100)
print(f"\n25th percentile: {percentiles[24]}")
print(f"50th percentile: {percentiles[49]}")
print(f"75th percentile: {percentiles[74]}")

In [None]:
import statistics

# Correlation between two variables (Python 3.10+)
hours_studied: list[float] = [1, 2, 3, 4, 5, 6, 7, 8]
test_scores: list[float] = [52, 58, 65, 70, 74, 80, 85, 90]

r: float = statistics.correlation(hours_studied, test_scores)
print(f"Correlation (hours vs scores): {r:.4f}")
print(f"Strong positive correlation: {r > 0.9}")

# Linear regression (Python 3.10+)
slope, intercept = statistics.linear_regression(hours_studied, test_scores)
print(f"\nLinear regression: score = {slope:.2f} * hours + {intercept:.2f}")

# Predict score for 10 hours of study
predicted: float = slope * 10 + intercept
print(f"Predicted score for 10 hours: {predicted:.1f}")

## Section 7: Practical Patterns

Combining `random` and `statistics` for common real-world tasks like simulations and data analysis.

In [None]:
import random
import statistics

def simulate_dice_rolls(num_rolls: int, seed: int = 42) -> dict[str, float]:
    """Simulate dice rolls and return summary statistics."""
    random.seed(seed)
    rolls: list[int] = [random.randint(1, 6) for _ in range(num_rolls)]
    return {
        "count": float(len(rolls)),
        "mean": statistics.mean(rolls),
        "median": statistics.median(rolls),
        "stdev": statistics.stdev(rolls),
        "min": float(min(rolls)),
        "max": float(max(rolls)),
    }

# Small sample
small: dict[str, float] = simulate_dice_rolls(20)
print("20 rolls:")
for key, value in small.items():
    print(f"  {key:>7}: {value:.2f}")

# Large sample (law of large numbers: mean approaches 3.5)
large: dict[str, float] = simulate_dice_rolls(100_000)
print("\n100,000 rolls:")
for key, value in large.items():
    print(f"  {key:>7}: {value:.4f}")

In [None]:
import random

def generate_password(
    length: int = 16,
    seed: int | None = None,
) -> str:
    """Generate a random password from printable ASCII characters."""
    if seed is not None:
        random.seed(seed)
    chars: str = (
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789"
        "!@#$%^&*"
    )
    return "".join(random.choices(chars, k=length))

# Generate a few passwords
random.seed(42)
for i in range(5):
    print(f"Password {i + 1}: {generate_password()}")

# Note: for real security, use the secrets module instead
import secrets
secure_token: str = secrets.token_urlsafe(16)
print(f"\nSecure token (secrets): {secure_token}")

## Summary

### `random` Module
- **`seed(n)`**: Set the random seed for reproducibility
- **`random()`**: Float in [0.0, 1.0)
- **`randint(a, b)`**: Integer in [a, b] inclusive
- **`uniform(a, b)`**: Float in [a, b]
- **`choice(seq)`**: Pick one element from a sequence
- **`choices(seq, k=n)`**: Pick k elements with replacement
- **`sample(seq, k=n)`**: Pick k unique elements without replacement
- **`shuffle(list)`**: Randomize a list in place
- **`gauss(mu, sigma)`**: Gaussian/normal distribution
- **`expovariate(lambd)`**: Exponential distribution

### `statistics` Module
- **Central tendency**: `mean()`, `median()`, `mode()`, `geometric_mean()`, `harmonic_mean()`
- **Spread**: `stdev()`, `variance()`, `pstdev()`, `pvariance()`
- **Median variants**: `median_low()`, `median_high()`
- **Quantiles**: `quantiles(data, n=4)` for quartiles, percentiles, etc.
- **Regression**: `correlation()`, `linear_regression()` (Python 3.10+)

### Important Notes
- `random` is **not** cryptographically secure -- use `secrets` for security
- `statistics.stdev()` uses **sample** standard deviation (N-1); use `pstdev()` for population
- Always set a seed in tests and simulations for reproducibility
- `sample()` raises `ValueError` if k > population size; `choices()` does not