# Statistics

### What is Statistics?
Statistics is the science of working with data - **collecting**, **summarizing**, analyzing, and interpreting it to understand the real world through numbers.
Example:
- Suppose we collect the marks of all students in a class.
- The **mean** tells us the class’s average performance, the **median** shows the middle score, and the **standard deviation** tells us how spread out the marks are.
That’s statistics - finding meaning in data.

### What is Statistics for Machine Learning?
Statistics for ML helps us **understand the data before we train models.**
A model learns patterns from data, so if we don’t understand our data’s behavior, we risk building biased or misleading models.
We use statistics in ML to study **the center**, **spread**, **shape**, and **relationships** within data.
It forms the foundation for preprocessing, feature scaling, and model evaluation.

### Types of Statistics
There are two main types of statistics:
1. **Descriptive Statistics** - describe or summarize data.
   Example: the average mark of a class is 78.
2. **Inferential Statistics** - draw conclusions about a population from a sample.
   Example: using one class’s marks to estimate the average for the whole university.

In machine learning, we mostly rely on descriptive statistics because we focus on understanding and preparing our dataset, not on generalizing to an unseen population.

## Descriptive Statistics and Distributions

Descriptive statistics helps us summarize and visualize data using numbers and graphs.
We’ll learn how to describe data’s center, spread, and shape.

### Measures of Central Tendency
1. **Mean** - The average value. Good for normal, balanced data.
    Formula: 1/N Σ(xi)
2. **Median** - The middle value when data is sorted. Good for skewed or outlier-rich data.
    Formula: (N is odd) middle value; (N is even) average of two middle values.
3. **Mode** - The most frequent value. Mainly used for categorical data.

| Situation               | Best Measure       | Reasoning                          |
|-------------------------|--------------------|------------------------------------|
| Symmetric Distribution (e.g. height, temperature)   | Mean               | All values contribute equally.     |
| Skewed Distribution (e.g. income, house prices)     | Median             | Less affected by outliers.         |
| Categorical Data (e.g. color, brand)                 | Mode               | Shows most common category & non-numeric.        |

**Common Mistakes**
1. Using mean with skewed data → misleading
2. Using median for categorical → meaningless
3. Using mode for continuous numeric → rarely helpful
4. Forgetting to check distribution before imputing missing values.

### Measures of Spread
1. **Range** - The difference between the maximum and minimum values.
2. **Variance** - The average of the squared differences from the mean.
    i. **Population Variance(σ²)** - Used when you have data for the entire population.
    Formulas: 1/N Σ (xi - x̄)²
   ii. **Sample Variance(s²)** - Used when you have a sample from a larger population.
    Formula: 1/(N-1) Σ (xi - x̄)², here N-1 is Bessel's correction.
3. **Standard Deviation** - The square root of the variance.
    i. **Population Standard Deviation(σ)** - Square root of population variance.
    ii. **Sample Standard Deviation(s)** - Square root of sample variance.

**Variance & SD in ML Pipeline**
1. Feature Scaling:Standardization uses to make features comparable by ensuring that each feature has mean = 0 and standard deviation = 1.
    Formula: z = (x - μ) / σ
2. Regularization:Ridge and Lasso reduce coefficient variance to prevent overfitting.
3. Model Diagnostics:High variance in model performance across folds suggests instability.


**Bias-Variance Trade-off**

In model evaluation, variance also refers to how much a model’s predictions change with different training data.
A high-variance model memorizes the training set (overfitting), while low variance but high bias underfits. So, variance of data and variance of model parameters are different concepts but share the same intuition —instability due to spread.

| Concept                | Variance (σ²)                       | Standard Deviation (σ)         |
|------------------------|-------------------------------------|---------------------------------|
| Definition              | Average of squared deviations       | Square root of variance                         |
| Formula                 | σ² = 1/N Σ (xi - μ)²               | σ = √σ²                         |
| Units                   | Squared units of data               | Same units as data              |
| Interpretation          | Mathematical measure                | Intuitive measure(means decision reasoning impact)               |
| Use Case                | Theoretical/ statistical analysis    | Communication & scaling        |


### Data Distributions
Understanding how data is distributed is crucial for analysis.
1. **Normal Distribution** - Bell-shaped curve, symmetric about the mean.
2. **Skewed Distribution** - Data is not symmetric; can be left or right-skewed.
3. **Bimodal Distribution** - Two peaks in the data.

Visualizations like histograms and box plots help us see these distributions clearly.

## Understanding Percentiles, Quartiles, IQR, and Z-Score in Machine Learning

### Percentiles
Percentile tells us the value below which a given percentage of observations fall.
For example, the 70th percentile is the value below which 70% of the observations may be found.
    Formula: Percentile rank = (number of values below x / total number of values) × 100

### Quartiles
Quartiles divide data into four equal parts:
1. Q1 (25th percentile) - 25% of data below this value.
2. Q2 (50th percentile/median) - 50% of data below this value.
3. Q3 (75th percentile) - 75% of data below this value.

We often use percentiles to handle outliers in data.
For example, when preprocessing numerical features, we can remove values below the 1st percentile or above the 99th percentile to reduce the effect of extreme data points.

### Interquartile Range (IQR)
IQR is the range between Q1 and Q3 and represents the middle 50% of the data.
    Formula: IQR = Q3 - Q1

- A smaller IQR means data points are tightly packed.
- A larger IQR means data has high variability.

**Outlier Detection Rule**:
Any point below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is an outlier.
Here 1.5 is a commonly used multiplier, but it can be adjusted based on the specific dataset and context.

IQR-based filtering is common before training models to prevent extreme values from skewing results, for example, in housing price prediction, or salary datasets, where a few very large values can distort the model’s understanding of normal behavior.

### Z-Score
Z-score measures how many standard deviations a data point is from the mean.
    Formula: Z = (X - μ) / σ
Where:
- X is the value
- μ is the mean
- σ is the standard deviation
- A Z-score of 0 means the value is exactly at the mean.
- A positive Z-score means the value is above the mean.
- A negative Z-score means the value is below the mean.

Z-score normalization is part of standard scaling, which transforms all features to have mean 0 and standard deviation 1.
This is critical for models like KNN, SVM, or Gradient Descent-based models, which are sensitive to feature scale.


|Concept          | Definition                                      | Formula                             | Use Case in ML Pipeline                      |
|-----------------|-------------------------------------------------|-------------------------------------|----------------------------------------------|
| Percentiles     | Value below which a certain percentage of data falls | Percentile rank = (number of values below x / total number of values) × 100 | Handling outliers by trimming extreme values |
| Quartiles       | Values that divide data into four equal parts   | Q1 (25th percentile), Q2 (50th percentile/median), Q3 (75th percentile) | Summarizing data distribution and spread     |
| IQR             | Range between Q1 and Q3 representing middle 50% of data | IQR = Q3 - Q1                        | Outlier detection and removal               |
| Z-Score         | Number of standard deviations a data point is from the mean | Z = (X - μ) / σ                  | Feature scaling for distance-based models    | 

### If standard deviation already tells how far something sits from the mean, why drag in another number like z-score?

Standard deviation gives the distance. If standard deviation says the value is, say, 12 units above the mean.

But that “12” is stuck to the scale of the data.
- If the data is in kilograms, it’s 12 kg.
- If the data is in exam marks, it’s 12 marks.

If the data has a different standard deviation, that “12” might be huge or tiny. So SD is in the same unit as the dataset. As a result data can’t reliably compare across datasets. 

Z-score gives the meaning. Z-score asks: How many standard deviations away from the mean are you?

- Z = +2 means “two standard deviations above average,” whether it is measuring height, income, marks, rainfall, or CPU temperature.
- Z = –1 means “one standard deviation below average,” same interpretation everywhere.

That makes it unitless and comparable anywhere. It puts everything on a universal scale. 

* SD → describes the spread of the entire dataset in original units.
* Z → describes the value’s position relative to the mean, in units of SD.

#### Example
If SD = 12 and Z = 12, then the actual deviation from the mean is:
12 (SD) × 12 (Z) = 144 units above the mean.
So the SD tells the “size” of one step and z-score tells how many steps is moved.

### The problems of IQR and Z-Score

Both IQR and Z-Score are popular methods for detecting outliers, but they have limitations:
1. **IQR Limitations**:
   - Not effective for small datasets: In small datasets, the quartiles may not be well-defined, leading to unreliable IQR calculations.
   - Assumes a symmetric distribution: IQR works best with symmetric distributions; in highly skewed data, it may misclassify normal points as outliers.
   - Depends on Q3 and Q1: If these quartiles are affected by outliers, the IQR method may fail to identify true outliers.
2. **Z-Score Limitations**:
   - Assumes normal distribution: Z-score is most effective when the data follows a normal distribution. In skewed distributions, it may misidentify outliers.
   - Sensitive to mean and standard deviation: Extreme values can distort the mean and standard deviation, leading to misleading z-scores.
   - Not robust for small samples: In small datasets, z-scores can be unreliable due to high variability in mean and standard deviation estimates.

To solve these problems, we use robust methods like the Modified Z-Score or the Median Absolute Deviation (MAD) for outlier detection in skewed or small datasets.

### Modified Z-Score
The Modified Z-Score is a robust alternative to the traditional z-score, especially useful for small or skewed datasets. It uses the median and the Median Absolute Deviation (MAD) instead of the mean and standard deviation.
Formula:
Modified Z = 0.6745 * (X - Median) / MAD
Where:
- X is the value
- Median is the median of the dataset
- MAD is the Median Absolute Deviation

### Median Absolute Deviation (MAD)
MAD is a robust measure of variability that is less affected by outliers than standard deviation.
Formula:
MAD = median(|Xi - Median|)
Where:
- Xi is each value in the dataset
- Median is the median of the dataset

By using the Modified Z-Score and MAD, we can more effectively identify outliers in datasets that are not normally distributed or contain extreme values.

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'Income': [22, 25, 27, 29, 35, 40, 42, 100, 110, 115]})

# IQR 

# Stpes to calculate Quartiles: 
# 1. Sort the data: [22, 25, 27, 29, 35, 40, 42, 100, 110, 115]
# 2. Find the position in the sorted array for the quantile: (n-1) * quantile, where n is the number of data points and quantile is 0.25 for Q1 and 0.75 for Q3.
#   - For Q1: (10-1) * 0.25 = 2.25 → between 2nd and 3rd values (27 and 29)
#   - For Q3: (10-1) * 0.75 = 6.75 → between 6th and 7th values (42 and 100)
# 3. Interpolate between these values if necessary
#   - Q1 = 27 + 0.25 * (29 - 27) = 27.5
#   - Q3 = 42 + 0.75 * (100 - 42) = 87.5

Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
print("Q1:", Q1, "Q3:", Q3, "IQR:", IQR)

lower = Q1 - 1.0 * IQR
upper = Q3 + 1.0 * IQR
print("Lower Bound:", lower, "Upper Bound:", upper)
outliers_iqr = df[(df['Income'] < lower) | (df['Income'] > upper)]      # Outliers should be outside the lower and upper bounds

print("Outliers using IQR:\n", outliers_iqr)

# Z-score 
mean = df['Income'].mean()
std = df['Income'].std()
df['z_score'] = (df['Income'] - mean) / std
outliers_z = df[np.abs(df['z_score']) > 2.5]        # Common threshold is 2.5 or 3
print("Outliers using Z-score:\n", outliers_z)

# Modified Z-score (Robust)
median = df['Income'].median()
mad = np.median(np.abs(df['Income'] - median))
df['mod_z'] = 0.6745 * (df['Income'] - median) / mad
outliers_mz = df[np.abs(df['mod_z']) > 3.5]     # Common threshold is 3.5
print("Outliers using Modified Z-score:\n", outliers_mz)

# Here ww can see that, the outlier detection methods (IQR, Z-score, Modified Z-score) identify the high income values (100, 110, 115) as outliers in the dataset.

Q1: 27.5 Q3: 85.5 IQR: 58.0
Lower Bound: -30.5 Upper Bound: 143.5
Outliers using IQR:
 Empty DataFrame
Columns: [Income]
Index: []
Outliers using Z-score:
 Empty DataFrame
Columns: [Income, z_score]
Index: []
Outliers using Modified Z-score:
    Income   z_score     mod_z
7     100  1.202258  3.665761
8     110  1.466491  4.252283
9     115  1.598607  4.545543


## Distribution Shapes for ML: Symmetric vs Skewed, Long Tails, and IQR Outliers

### Symmetric Distributions
A symmetric distribution looks the same on both sides of its center. The most common example is the Normal Distribution. Here, the mean, median, and mode all lie at the center not need to be the same.

Many ML algorithms, such as Linear Regression, SVMs, and KNN, assume features follow a roughly normal distribution for optimal performance. If data is symmetric, we often don’t need transformations.

### Skewed Distributions
A skewed distribution is not symmetric. It can be positively skewed (right-skewed) or negatively skewed (left-skewed).

In right-skewed data, the frequency of higher values tapers off more slowly than lower values, creating a long tail on the right side, means the bulk of the data is concentrated on the left. In left-skewed data, the opposite occurs, with a long tail on the left side or data concentrated on the right.
In right -skewed distributions, the mean is typically greater than the median, which is greater than the mode (Mean > Median > Mode). In left-skewed distributions, the mean is less than the median, which is less than the mode (Mean < Median < Mode). 

Right-skewed data often appears in income, price, or reaction time datasets. For ML models, skewness can reduce model accuracy and cause bias. To fix this, we often apply log transformation, Box–Cox, or Yeo–Johnson transformations to make data more symmetric.

### Long Tails
A long tail means extreme values stretch far from the center. Long-tailed distributions are common in real-world datasets - like user engagement or social media followers — where a few points dominate the range.

Long-tailed data challenges ML models. Models may overfit to frequent cases and ignore rare but important events — like fraud detection or rare disease prediction. Handling long tails often involves data normalization, resampling, or outlier-aware models.

| Concept            | Description                                      | Example Features in ML                     | Handling Techniques                      |
|--------------------|--------------------------------------------------|--------------------------------------------|------------------------------------------|
| Symmetric Distribution | Data is evenly distributed around the center.    | Height, weight, test scores                | Often no transformation needed.         |
| Skewed Distribution   | Data is unevenly distributed, with a long tail. Which side the data lean toward | Income, house prices, reaction times       | Log transformation, Box-Cox, Yeo–Johnson |
| Long Tails          | Extreme values stretch far from the center or how extended rare or extreme values are.      | User engagement, social media followers        | Normalization, resampling, outlier-aware models |

## Outliers
Outliers are data points that differ significantly from most observations. They can distort model training and degrade performance. A simple yet effective method to detect outliers is using the Interquartile Range (IQR)
#### When to Keep vs. Remove Outliers?
Not all outliers are bad. In some ML problems, like fraud detection or medical diagnosis, outliers represent critical cases. Instead of removing them, we may label or model them separately.

**Example:**
- In a credit card fraud detection dataset, fraudulent transactions are outliers. Removing them would eliminate crucial information needed to train an effective model.
- In a medical dataset, patients with rare diseases may appear as outliers. These cases are vital for diagnosis and treatment, so they should be retained and analyzed separately.

#### Why Mean and SD Fail Sometimes?
Mean and standard deviation work great for symmetric, normally distributed data - like heights or IQ scores in large populations.
But if outliers are present for example, one billionaire in an income dataset - the mean gets pulled toward the extreme.

In [2]:
import numpy as np
data = [25, 27, 28, 29, 30, 31, 35, 5000]  # Example dataset with an outlier 5000
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
print(f"Mean: {mean}, Median: {median}, Standard Deviation: {std_dev}")

# The mean shoots up, but the median stays stable — that’s why we call median a robust measure of central tendency!

Mean: 650.625, Median: 29.5, Standard Deviation: 1643.9115652537396


#### When to Use Median and IQR?
Use Median and IQR when:
1. Data are skewed or have long tails
    - Example: Income, house prices, YouTube views, medical costs.
    - These are right-skewed, and the mean gets dragged upward by a few large values.
    - The median gives a better picture of a “typical” case.
2. There are outliers or extreme values
    - Outliers inflate SD and make the spread look huge even if most data are normal.
    - IQR ignores extremes by focusing on the middle 50%.
3. Need robust statistics for ML preprocessing
    - Algorithms like RobustScaler in sklearn use median and IQR instead of mean and SD to scale data safely.
    - It’s perfect for datasets where outliers shouldn’t dominate scaling.

#### When Mean and SD Are Still Better?
Use Mean and SD when:
- Data are roughly normal (bell-shaped)
- Plan to apply algorithms assuming normality - for instance, Linear Regression, Naive Bayes, or Z-score scaling

**So to summarize**:
- Mean & SD → Best for normal, clean data.
- Median & IQR → Best for skewed, messy, or outlier-heavy data.