# ML Statistics Primer – Intuition-Focused Notebook

This notebook is meant as a **learning notebook** packed with explanations, not just code.

It covers the core statistics ideas that show up again and again in ML:

- Distributions, central tendency, and spread (mean, median, variance, standard deviation)
- Correlation and covariance
- Sampling, the Law of Large Numbers (LLN), and the Central Limit Theorem (CLT) intuition
- Simple linear regression and residuals
- Bias–variance and overfitting

Use this notebook to:
- Refresh how statistics concepts connect to ML
- Play with small simulations and visualizations
- Build intuition you can reuse in Kaggle work and real projects.


In [None]:
# ========== 1. Imports & Basic Setup ==========

from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['figure.dpi'] = 100


## 2. Distributions, Mean, Median, Variance, Standard Deviation

In ML, your **features** and **target** variables all have underlying distributions.
Understanding their shape helps you choose:
- Transformations (log, Box–Cox, etc.)
- Models (linear vs tree-based)
- Whether outliers are a problem.

Key quantities:
- **Mean**: average value – sensitive to outliers.
- **Median**: middle value – robust to outliers.
- **Variance / standard deviation**: how spread out the values are.


In [None]:
# ========== 2.1 Simulate a skewed distribution ==========

np.random.seed(42)
n = 1000
data = np.random.exponential(scale=1.0, size=n)  # right-skewed

mean_val = data.mean()
median_val = np.median(data)
var_val = data.var(ddof=1)
std_val = data.std(ddof=1)

print(f'Mean:   {mean_val:.3f}')
print(f'Median: {median_val:.3f}')
print(f'Var:    {var_val:.3f}')
print(f'Std:    {std_val:.3f}')

plt.hist(data, bins=40)
plt.title('Right-skewed distribution (Exponential)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()


**Observation:**

- For a right-skewed distribution, the **mean > median**.
- The standard deviation tells you how wide the histogram is.

In ML:
- Highly skewed features can hurt some models (especially linear/regr),
  and may benefit from log transforms.


## 3. Covariance and Correlation

In ML, we care about how features move together and how they relate to the target.

- **Covariance** measures how two variables vary together but is scale-dependent.
- **Correlation** is a standardized covariance in [-1, 1].
  - +1: perfect positive linear relationship
  - 0: no linear relationship
  - -1: perfect negative linear relationship

Correlation is critical for:
- Feature selection and redundancy checks
- Diagnosing multicollinearity in linear models
- Understanding which features are promising for predicting the target.


In [None]:
# ========== 3.1 Simulate correlated features ==========

np.random.seed(0)
n = 500
x = np.random.normal(loc=0, scale=1, size=n)
noise = np.random.normal(loc=0, scale=0.5, size=n)
y = 2 * x + noise  # strongly linearly related

df_corr = pd.DataFrame({'x': x, 'y': y})
display(df_corr.head())

print('Covariance matrix:')
display(df_corr.cov())
print('Correlation matrix:')
display(df_corr.corr())

plt.scatter(df_corr['x'], df_corr['y'], alpha=0.4)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of x vs y (correlated)')
plt.show()


**Takeaways:**

- Correlation helps you see **linear** relationships quickly.
- A strong correlation with the target often means a feature will be useful in simple models.
- For non-linear models (trees, neural nets), correlation is still helpful but not the only story.


## 4. Sampling, LLN, and CLT Intuition

ML almost always works with **samples** from some larger population (e.g., all future games, all future customers).

Two key ideas:

1. **Law of Large Numbers (LLN)**:
   - Sample mean \( \bar{X} \) converges to true mean \( \mu \) as sample size grows.
   - More data → more stable estimates.

2. **Central Limit Theorem (CLT)**:
   - The distribution of sample means is approximately normal,
     even if the original data are not, for large enough sample size.

These justify many ML practices:
- Using validation metrics from random splits.
- Using confidence intervals on performance metrics.


In [None]:
# ========== 4.1 Demonstrate LLN: sample means converge ==========

np.random.seed(123)
population = np.random.exponential(scale=1.0, size=100000)
true_mean = population.mean()

sample_sizes = [10, 50, 100, 500, 1000, 5000]
approx_means = []

for n in sample_sizes:
    sample = np.random.choice(population, size=n, replace=False)
    approx_means.append(sample.mean())

print('True population mean:', round(true_mean, 3))
for n, m in zip(sample_sizes, approx_means):
    print(f'Sample size {n:4d} -> sample mean {m:.3f}')


As sample size grows, the sample mean gets closer to the true population mean.
This is exactly why **more data usually helps** – not just for training, but also for
more reliable estimates of performance.


## 5. Simple Linear Regression & Residuals

Linear regression assumes:
- \( y \approx \beta_0 + \beta_1 x + \epsilon \)
- Residuals \( \epsilon \) are (ideally) zero-mean, roughly constant variance.

In ML terms:
- We fit a line to predict a continuous target.
- The **residuals** (prediction errors) tell us:
  - How well the model captures the pattern
  - If there is systematic nonlinearity or heteroskedasticity.


In [None]:
# ========== 5.1 Fit a simple linear regression and inspect residuals ==========

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(0)
n = 200
x = np.random.uniform(-3, 3, size=n)
noise = np.random.normal(0, 1, size=n)
y = 1.5 * x + 2 + noise

X = x.reshape(-1, 1)
linreg = LinearRegression()
linreg.fit(X, y)
y_pred = linreg.predict(X)
residuals = y - y_pred

print('Estimated intercept (beta0):', linreg.intercept_)
print('Estimated slope (beta1):   ', linreg.coef_[0])
print('MSE:', mean_squared_error(y, y_pred))

plt.scatter(x, y, alpha=0.4, label='data')
xs = np.linspace(x.min(), x.max(), 100)
ys = linreg.predict(xs.reshape(-1, 1))
plt.plot(xs, ys)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear regression fit')
plt.legend()
plt.show()

plt.scatter(y_pred, residuals, alpha=0.4)
plt.axhline(0)
plt.xlabel('Predicted y')
plt.ylabel('Residual (y - y_pred)')
plt.title('Residual plot')
plt.show()


In a good linear regression fit:

- Residuals should be roughly centered around 0 with no strong patterns.
- If you see **curvature** in residuals vs predictions, that suggests nonlinearity.
- If residual spread grows with predicted values, that suggests non-constant variance.


## 6. Bias–Variance and Overfitting (Very Intuitive)

In ML, we balance:

- **Bias**: error from making the model too simple (underfitting).
- **Variance**: error from the model being too flexible and fitting noise (overfitting).

Typical pattern when increasing model complexity:

- Training error ↓ (gets smaller and smaller)
- Validation error ↓ then ↑ (U-shaped curve)

Your goal: choose complexity where **validation error is minimized**.


In [None]:
# ========== 6.1 Demonstrate under/overfitting with polynomial degree ==========

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

np.random.seed(1)
n = 200
X = np.linspace(-3, 3, n).reshape(-1, 1)
y_true = np.sin(X).ravel()
noise = np.random.normal(scale=0.3, size=n)
y = y_true + noise

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)

degrees = [1, 3, 5, 9]
train_errors = []
valid_errors = []

for d in degrees:
    poly = PolynomialFeatures(degree=d, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_valid_poly = poly.transform(X_valid)

    model = Ridge(alpha=1.0)
    model.fit(X_train_poly, y_train)

    y_train_pred = model.predict(X_train_poly)
    y_valid_pred = model.predict(X_valid_poly)

    train_mse = mean_squared_error(y_train, y_train_pred)
    valid_mse = mean_squared_error(y_valid, y_valid_pred)

    train_errors.append(train_mse)
    valid_errors.append(valid_mse)

print('Degree | Train MSE | Valid MSE')
for d, tr, va in zip(degrees, train_errors, valid_errors):
    print(f'{d:6d} | {tr:9.4f} | {va:9.4f}')

plt.plot(degrees, train_errors, marker='o', label='Train MSE')
plt.plot(degrees, valid_errors, marker='o', label='Valid MSE')
plt.xlabel('Polynomial degree')
plt.ylabel('MSE')
plt.title('Underfitting vs Overfitting demo')
plt.legend()
plt.show()


**What you should see:**

- Very low degree (1) underfits → both train and valid error relatively high.
- Intermediate degree (3 or 5) often best on validation.
- Very high degree (9) may overfit → train error tiny, validation error worse.

This is the core idea behind:
- Choosing tree depth / number of leaves
- Choosing polynomial degree / neural network size
- Using regularization (Ridge, Lasso, weight decay).
