# Statistics in Machine Learning 3 - Shape of Distributions

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Shape of Distributions

Now that you know center (mean/median) and spread (SD/IQR), the third pillar is shape — how the data is distributed.

1. Symmetry vs. Skewness
- Symmetric distribution: left = right (like the normal distribution). Mean ≈ Median.
- Right-skewed (positive skew): long tail on the right. Mean > Median.
- Left-skewed (negative skew): long tail on the left. Mean < Median.

2. Kurtosis
- Describes how “peaked” or “flat” a distribution is compared to normal.
- Leptokurtic: sharp peak, heavy tails.
- Platykurtic: flat, light tails.
- Mesokurtic: normal-like.

3. Why Shape Matters
- Shape affects which measure of center and spread is most appropriate.
- Example:
    - Normal-like → mean & SD are reliable.
    - Skewed/outliers → median & IQR are better.

#### 1. Symmetry vs. Skewness

Think of a balance scale:
- If the left and right sides look the same → symmetric.
- If one side drags out longer (a “tail”) → skewed.

👉 Quick check:
- Symmetric: mean ≈ median.
- Right-skewed: tail stretches right → mean is pulled above the median.
- Left-skewed: tail stretches left → mean is pulled below the median.

Question for you: If you saw a histogram of people’s income, do you expect it to be symmetric, right-skewed, or left-skewed?

#### 2. Kurtosis

This is less about left vs. right, and more about how “tall and skinny” or “flat and spread” the hump is, compared to a normal curve.
- Leptokurtic (positive kurtosis): tall peak, heavy tails → more extreme outliers.
- Platykurtic (negative kurtosis): flat, light tails → fewer outliers.
- Mesokurtic: similar to normal distribution.

👉 Shortcut: Kurtosis ≈ "outlier-proneness."

#### 3. Why Shape Matters

Shape guides which statistics we trust:
- Normal-like data: mean & SD are good summaries.
- Skewed or outlier-heavy data: better to use median & IQR.

Example:
- Test scores in a well-designed exam → symmetric, so mean/SD are fine.
- House prices → right-skewed (few very expensive ones), so median/IQR make more sense.

### 🔹 Symmetry

A distribution is symmetric when the left side mirrors the right side.
- Example: heights of adult men in a population often look roughly symmetric.
- In symmetric data → mean ≈ median ≈ mode.

👉 In practice: If your histogram looks like a “bell” centered in the middle, symmetry is likely.

### 🔹 Skewness

Skewness tells us if the distribution leans left or right — like a kite with one tail longer.

1. Right-skewed (positive skew)
- Long tail stretches to the right (higher values).
- Mean > Median (because extreme high values pull the mean up).
- Examples: income, house prices.

2. Left-skewed (negative skew)
- Long tail stretches to the left (lower values).
- Mean < Median (because extreme low values pull the mean down).
- Examples: exam scores where most did well, but a few got very low marks.

3. Symmetric (zero skew)
- Tails are balanced.
- Mean ≈ Median.
- Example: well-designed test scores, IQ scores.


#### 🔹 Why skewness matters

- Mean is sensitive to skew.
    - In right-skewed data, mean is “too high” compared to most people’s values.
    - In left-skewed data, mean is “too low.”
- Median is more robust. That’s why we often report the median income, not the mean income.

👉 Quick test for you:
If the mean salary in a company is $80,000 but the median salary is $50,000, is the distribution symmetric, right-skewed, or left-skewed?

Here’s a useful rule of thumb you can keep in mind:
- Mean ≈ Median → Symmetric
- Mean > Median → Right-skewed
- Mean < Median → Left-skewed

Let me give you a mental picture of the three cases:
- Symmetric: bell shape, center balanced.
- Right-skewed: “mountain on the left, tail on the right” (like income).
- Left-skewed: “mountain on the right, tail on the left” (like most did well but some bombed an exam).

#### 🔹 Formula for Skewness

$$g_1 = \frac{\frac{1}{n}\sum^n_{i=1}(x_i - \bar{x})^3}{s^3}$$

where:

- $\bar{x}$ is sample mean
- $s$ is sample standard deviation
- Numerator = the third moment (measures asymmetry)
- Denominator = cube of SD (to normalize scale)

🔹 Interpretation

- $g_1 = 0$ → perfectly symmetric
- $g_1 > 0$ → right-skewed
- $g_1 < 0$ → left-skewed

👉 Rule of thumb:

- |skewness| < 0.5 → fairly symmetric
- 0.5 to 1 → moderately skewed
- $> 1$ → highly skewed

🔹 Simple Analogy

Think of skewness as checking if the “weight” of the data is evenly balanced around the mean.

If the weight tilts right → positive skew.

If the weight tilts left → negative skew.

Dataset (salaries in $1,000s):

30,35,40,45,50,55,60,120

Clearly, most are in the 30–60 range, with one big outlier (120).

In [26]:
import numpy as np

data = np.array([30, 35, 40, 45, 50, 55, 60, 120])

# Method 1: Using scipy (correct)
from scipy import stats
skewness_scipy = stats.skew(data)
print(f"Scipy skewness: {skewness_scipy}")

# Method 2: Manual calculation (CORRECT version)
n = len(data)
mean = np.mean(data)
# Third moment about the mean
m3 = np.sum((data - mean) ** 3) / n
# Second moment about the mean (variance)
m2 = np.sum((data - mean) ** 2) / n
# Skewness = third standardized moment
skewness_manual = m3 / (m2 ** 1.5)
print(f"Manual skewness (biased): {skewness_manual}")

# Method 3: Population skewness with unbiased correction 
std = np.std(data, ddof=0)  # population std
m3_sample = np.sum((data - mean) ** 3) / n
skewness_sample = (n / ((n-1) * (n-2))) * np.sum((data - mean) ** 3) / (std ** 3)
print(f"Manual skewness population (unbiased): {skewness_sample}")

# Method 4: Population skewness with biased correction
std = np.std(data, ddof=0)  # population std
m3 = np.sum((data - mean) ** 3) / n
skewness_biased = m3 / (std ** 3)
print(f"Manual skewness population (biased): {skewness_biased}")

# Method 5: Sample skewness with unbiased correction 
std = np.std(data, ddof=1)  # sample std
m3_sample = np.sum((data - mean) ** 3) / n
skewness_sample = (n / ((n-1) * (n-2))) * np.sum((data - mean) ** 3) / (std ** 3)
print(f"Manual skewness sample (unbiased): {skewness_sample}")

# Method 6: Sample skewness with biased correction
std = np.std(data, ddof=1)  # sample std
m3 = np.sum((data - mean) ** 3) / n
skewness_biased = m3 / (std ** 3)
print(f"Manual skewness sample (biased): {skewness_biased}")


Scipy skewness: 1.7255998349149277
Manual skewness (biased): 1.7255998349149277
Manual skewness population (unbiased): 2.6294854627275095
Manual skewness population (biased): 1.725599834914928
Manual skewness sample (unbiased): 2.152201122975111
Manual skewness sample (biased): 1.4123819869524166


In [28]:
import numpy as np
from scipy import stats

data = np.array([30, 35, 40, 45, 50, 55, 60, 120])
n = len(data)
mean = np.mean(data)

print("="*60)
print("SCIPY REFERENCE VALUES")
print("="*60)
print(f"Scipy biased (bias=True):    {stats.skew(data, bias=True):.10f}")
print(f"Scipy unbiased (bias=False): {stats.skew(data, bias=False):.10f}")
print()

print("="*60)
print("MANUAL CALCULATION - BIASED (Population Skewness)")
print("="*60)
print("Formula: m3 / (m2^1.5)")
print("where m3 and m2 are population moments (divide by n)")
print()

# Calculate population moments
m2 = np.sum((data - mean) ** 2) / n  # Second moment (variance)
m3 = np.sum((data - mean) ** 3) / n  # Third moment

# Calculate biased skewness
skewness_biased = m3 / (m2 ** 1.5)

print(f"Mean: {mean}")
print(f"m2 (variance): {m2}")
print(f"m3 (third moment): {m3}")
print(f"Biased skewness: {skewness_biased:.10f}")
print()

print("="*60)
print("MANUAL CALCULATION - UNBIASED (Sample Skewness)")
print("="*60)
print("Formula: [n / ((n-1)(n-2))] × Σ(xi - mean)³ / [Σ(xi - mean)²]^1.5")
print()

# Calculate components
sum_cubed_deviations = np.sum((data - mean) ** 3)
sum_squared_deviations = np.sum((data - mean) ** 2)
correction_factor = n / ((n - 1) * (n - 2))

# Calculate unbiased skewness - CORRECTED
# The denominator should be the sum of squared deviations raised to 1.5
skewness_unbiased = (correction_factor * sum_cubed_deviations) / (sum_squared_deviations ** 1.5)

print(f"n: {n}")
print(f"Correction factor [n/((n-1)(n-2))]: {correction_factor:.10f}")
print(f"Sum of cubed deviations: {sum_cubed_deviations}")
print(f"Sum of squared deviations: {sum_squared_deviations}")
print(f"Unbiased skewness: {skewness_unbiased:.10f}")
print()

print("="*60)
print("VERIFICATION")
print("="*60)
print(f"Biased matches scipy?   {np.isclose(skewness_biased, stats.skew(data, bias=True))}")
print(f"Unbiased matches scipy? {np.isclose(skewness_unbiased, stats.skew(data, bias=False))}")
print()

print("="*60)
print("KEY INSIGHTS")
print("="*60)
print(f"• Biased value:   {skewness_biased:.4f}")
print(f"• Unbiased value: {skewness_unbiased:.4f}")
print(f"• Difference:     {skewness_unbiased - skewness_biased:.4f}")
print(f"• The unbiased estimator is {(skewness_unbiased/skewness_biased - 1)*100:.1f}% larger")
print(f"• Both indicate strong positive skew (outlier at 120)")

SCIPY REFERENCE VALUES
Scipy biased (bias=True):    1.7255998349
Scipy unbiased (bias=False): 2.1522011230

MANUAL CALCULATION - BIASED (Population Skewness)
Formula: m3 / (m2^1.5)
where m3 and m2 are population moments (divide by n)

Mean: 54.375
m2 (variance): 702.734375
m3 (third moment): 32145.99609375
Biased skewness: 1.7255998349

MANUAL CALCULATION - UNBIASED (Sample Skewness)
Formula: [n / ((n-1)(n-2))] × Σ(xi - mean)³ / [Σ(xi - mean)²]^1.5

n: 8
Correction factor [n/((n-1)(n-2))]: 0.1904761905
Sum of cubed deviations: 257167.96875
Sum of squared deviations: 5621.875
Unbiased skewness: 0.1162079376

VERIFICATION
Biased matches scipy?   True
Unbiased matches scipy? False

KEY INSIGHTS
• Biased value:   1.7256
• Unbiased value: 0.1162
• Difference:     -1.6094
• The unbiased estimator is -93.3% larger
• Both indicate strong positive skew (outlier at 120)


In [29]:
import numpy as np
from scipy import stats

data = np.array([30, 35, 40, 45, 50, 55, 60, 120])
n = len(data)
mean = np.mean(data)

print("="*60)
print("SCIPY REFERENCE VALUES")
print("="*60)
print(f"Scipy biased (bias=True):    {stats.skew(data, bias=True):.10f}")
print(f"Scipy unbiased (bias=False): {stats.skew(data, bias=False):.10f}")
print()

print("="*60)
print("MANUAL CALCULATION - BIASED (Population Skewness)")
print("="*60)
print("Formula: m3 / (m2^1.5)")
print("where m3 and m2 are population moments (divide by n)")
print()

# Calculate population moments
m2 = np.sum((data - mean) ** 2) / n  # Second moment (variance)
m3 = np.sum((data - mean) ** 3) / n  # Third moment

# Calculate biased skewness
skewness_biased = m3 / (m2 ** 1.5)

print(f"Mean: {mean}")
print(f"m2 (variance): {m2}")
print(f"m3 (third moment): {m3}")
print(f"Biased skewness: {skewness_biased:.10f}")
print()

print("="*60)
print("MANUAL CALCULATION - UNBIASED (Sample Skewness)")
print("="*60)
print("Formula: sqrt(n(n-1))/(n-2) × [Σ(xi - mean)³/n] / [(Σ(xi - mean)²/n)^1.5]")
print()

# Calculate components
sum_cubed_deviations = np.sum((data - mean) ** 3)
sum_squared_deviations = np.sum((data - mean) ** 2)

# Correct formula for unbiased skewness (Fisher-Pearson)
# G1 = [sqrt(n(n-1)) / (n-2)] * [m3 / m2^1.5]
# where m3 and m2 are sample moments
m3_sample = sum_cubed_deviations / n
m2_sample = sum_squared_deviations / n
adjustment_factor = np.sqrt(n * (n - 1)) / (n - 2)

skewness_unbiased = adjustment_factor * (m3_sample / (m2_sample ** 1.5))

print(f"n: {n}")
print(f"Adjustment factor [sqrt(n(n-1))/(n-2)]: {adjustment_factor:.10f}")
print(f"m3 (third sample moment): {m3_sample}")
print(f"m2 (second sample moment): {m2_sample}")
print(f"m3/m2^1.5: {m3_sample / (m2_sample ** 1.5):.10f}")
print(f"Unbiased skewness: {skewness_unbiased:.10f}")
print()

print("="*60)
print("VERIFICATION")
print("="*60)
print(f"Biased matches scipy?   {np.isclose(skewness_biased, stats.skew(data, bias=True))}")
print(f"Unbiased matches scipy? {np.isclose(skewness_unbiased, stats.skew(data, bias=False))}")
print()

print("="*60)
print("KEY INSIGHTS")
print("="*60)
print(f"• Biased value:   {skewness_biased:.4f}")
print(f"• Unbiased value: {skewness_unbiased:.4f}")
print(f"• Difference:     {skewness_unbiased - skewness_biased:.4f}")
print(f"• The unbiased estimator is {(skewness_unbiased/skewness_biased - 1)*100:.1f}% larger")
print(f"• Both indicate strong positive skew (outlier at 120)")

SCIPY REFERENCE VALUES
Scipy biased (bias=True):    1.7255998349
Scipy unbiased (bias=False): 2.1522011230

MANUAL CALCULATION - BIASED (Population Skewness)
Formula: m3 / (m2^1.5)
where m3 and m2 are population moments (divide by n)

Mean: 54.375
m2 (variance): 702.734375
m3 (third moment): 32145.99609375
Biased skewness: 1.7255998349

MANUAL CALCULATION - UNBIASED (Sample Skewness)
Formula: sqrt(n(n-1))/(n-2) × [Σ(xi - mean)³/n] / [(Σ(xi - mean)²/n)^1.5]

n: 8
Adjustment factor [sqrt(n(n-1))/(n-2)]: 1.2472191289
m3 (third sample moment): 32145.99609375
m2 (second sample moment): 702.734375
m3/m2^1.5: 1.7255998349
Unbiased skewness: 2.1522011230

VERIFICATION
Biased matches scipy?   True
Unbiased matches scipy? True

KEY INSIGHTS
• Biased value:   1.7256
• Unbiased value: 2.1522
• Difference:     0.4266
• The unbiased estimator is 24.7% larger
• Both indicate strong positive skew (outlier at 120)


In [14]:
from scipy import stats

data = [30,35,40,45,50,55,60,120]
skewness = stats.skew(data, bias=False) # sample skewness
print(skewness)

2.152201122975111


In [8]:
import pandas as pd

df = pd.Series([30,35,40,45,50,55,60,120])
skewness = df.skew()  # sample skewness
print(skewness)

2.1522011229751112


In [13]:
import numpy as np
import pandas as pd
from scipy import stats

data = [30, 35, 40, 45, 50, 55, 60, 120]

# Scipy skewness
skew_scipy = stats.skew(data, bias=False)  
print(f"Scipy: {skew_scipy}")

# Pandas skewness
skew_pandas = pd.Series(data).skew()
print(f"Pandas: {skew_pandas}")

# They're actually the same!
print(f"Are they equal? {np.isclose(float(skew_scipy), float(skew_pandas))}")

Scipy: 2.152201122975111
Pandas: 2.1522011229751112
Are they equal? True


🔹 What is Kurtosis?

Kurtosis measures how peaked or flat a distribution is compared to the normal distribution.
It’s not about skew (left vs right), but about the tails and peak.

Think of it like comparing mountains:

Some are tall & sharp → lots of data packed near the center but also heavy tails (extremes).

Some are flat & wide → more spread in the middle, fewer extremes.

🔹 Types of Kurtosis

Mesokurtic → normal-like (baseline).

Bell-shaped, moderate tails.

Example: height distribution.

Leptokurtic → “peaked” with heavy tails.

More extreme outliers.

Example: financial returns (many values near mean, but occasional huge swings).

Platykurtic → “flat” with light tails.

Fewer outliers, data more evenly spread.

Example: uniform-like distributions.

🔹 Formula (classical definition)

$$g2 = \frac{\frac{1}{n} \sum(x_i - \bar{x})^4}{s^4}$$


Numerator: 4th power of deviations (sensitive to extremes).

Denominator: standard deviation to the 4th.

By convention, excess kurtosis $= g_2 - 3$ (so normal distribution = 0).

🔹 Why It Matters

High kurtosis = prone to outliers (risky if you assume normality).

Low kurtosis = fewer outliers, flatter distribution.

Tells us about data reliability: in finance or ML, high kurtosis warns you that rare but extreme values could dominate.

That’s a sharp observation 👍 — let’s refine it a bit so you get the mental picture right:

Leptokurtic

Center peak: taller and sharper (thinner in the middle).

Tails: heavier/longer than normal → more outliers.

Think: a mountain spike with deep valleys → most data near the mean, but a few way out.

Platykurtic

Center peak: flatter, plateau-like.

Tails: lighter/shorter than normal → fewer outliers.

Think: a mesa (flat top hill) → data more evenly spread out, no big extremes.

Mesokurtic (normal)

In-between → bell curve.

So your phrasing is partly correct:

Yes, leptokurtic has a taller peak and longer/heavier tails.

Platykurtic is indeed flatter (plateau), but its tails are shorter/lighter.

🔹 What “heavier tails” means

When statisticians say a distribution has heavier tails, they don’t mean there are more data points in the tails. They mean:

The probability of extreme values is higher than in a normal distribution.

So, if you randomly sample from the distribution, you’re more likely to occasionally get a very large or very small value.

👉 In other words: outliers are rare, but when they happen, they’re more extreme.

🔹 Example analogy

Imagine two cities:

City A (Normal kurtosis): most people are average height, some are tall/short, but very few are very tall or very short.

City B (Leptokurtic): also has mostly average heights, but once in a while, you meet someone extremely tall or extremely short.

So the tails are “heavier” in the sense that the rare values are further out compared to normal.

🔹 Platykurtic comparison

Platykurtic = flatter top, lighter tails.

People’s heights are more spread around the average, but extreme outliers are less likely.

You almost never meet the “giants” or “tiny” people.

🔹 Recap: Kurtosis Types

Mesokurtic (normal-like)

Bell-shaped, tails like the normal distribution.

Example: standardized test scores.

Leptokurtic (high kurtosis, >0 excess)

Tall/narrow peak + heavy tails.

Most values cluster tightly near the mean, but outliers (when they appear) are very extreme.

Example: financial returns (mostly small day-to-day moves, but sometimes huge crashes or spikes).

Platykurtic (low kurtosis, <0 excess)

Flat, plateau-like peak + light tails.

Data spread more evenly, fewer extreme outliers.

Example: uniform-like distributions.

🔹 Why It Matters

High kurtosis: means outlier-prone → need robust statistics (median, IQR, robust regression).

Low kurtosis: fewer surprises, data is more “stable.”

In ML/EDA: kurtosis is a quick way to check tail risk (good for finance, quality control, anomaly detection).

So the mapping goes like this:

Leptokurtic: tall peak, fat tails → more prone to big outliers.

Platykurtic: flat peak, thin tails → fewer outliers.

Mesokurtic: normal curve in between.

🔹 Formula for Sample Kurtosis

for a dataset $x_1, x_2, ..., x_n$


$$g_2 = \frac{\frac{1}{n} \sum^n_{i=1}(x_i - \bar{x})^4}{(\frac{1}{n} \sum^n_{i=1}(x_i - \bar{x})^2)^2} - 3$$

Numerator = 4th central moment (measures “peakedness”)

Denominator = squared variance (normalizes scale)

Subtract 3 → so that normal distribution = 0 (this is called excess kurtosis).

🔹 Interpretation

$g_2 > 0 :$ Leptokurtic (sharper peak, heavier tails).

$g_2 < 0 :$ Platykurtic (flatter peak, lighter tails).

$g_2 = 0 :$ Mesokurtic (normal).

Summary of results for [2,3,3,4,4,4,5,5,6]:

n = 9

Mean = 4.0

Population variance = 1.333333

Population SD = 1.154701

4th central moment = 4.0

Raw kurtosis (no subtraction) = 2.25

Excess kurtosis =2.25−3=−0.75

Excess kurtosis = –0.75

That’s below 0, which means the distribution is platykurtic → flatter peak and lighter tails than normal.

So the interpretation is:

The data are fairly symmetric (mean = 4, it’s balanced around the middle).

But the shape is flatter than a normal bell curve, with fewer extreme values → platykurtic.

💡 Memory aid:

Skewness → left vs right tilt.

Kurtosis → sharp vs flat peak (outlier-proneness).

🔹 Shape Summary

Symmetry vs Skewness

Symmetric → mean ≈ median.

Right-skew → tail stretches right, mean > median.

Left-skew → tail stretches left, mean < median.

Kurtosis

Mesokurtic (≈0) → normal-like.

Leptokurtic (>0) → sharp peak, heavy tails → more extreme outliers.

Platykurtic (<0) → flatter, light tails → fewer outliers.

Why Shape Matters

Shape tells you which summary stats are reliable.

Normal-like → use mean & SD.

Skewed or outlier-prone → use median & IQR.

Heavy tails → be cautious, extremes can dominate.

So now you’ve covered the three pillars of describing data:

Center (mean, median)

Spread (SD, IQR)

Shape (skewness, kurtosis)

Together, these give a complete statistical picture of a dataset before moving on to inferential or modeling work.

🔹 Natural next topics

Probability distributions

Why we model data with distributions (normal, uniform, binomial, etc.).

Normal distribution as the “benchmark” (ties back to mean/SD/skew/kurtosis).

The Normal Distribution in depth

Properties, z-scores (you’ve touched on these), standardization.

Empirical rule (68–95–99.7).

Why so many real-world things are approximately normal.

Sampling distributions & Central Limit Theorem (CLT)

Bridge from descriptive stats → inferential stats.

Key idea: sample means follow a normal distribution even if the data aren’t perfectly normal.

Inferential statistics

Confidence intervals.

Hypothesis testing (p-values, t-tests, etc.).