# Introduction to Probability and Statistics:
## Manual implementation of descriptive Statistics using Python.
This notebook aims to help with the understanding of the basic formulas to obtain different desciptive values from a dataset. It avoids importing any libraries with the goal of interiorizing the concepts, and to help identify the small differences between theoretical formulas and their implementation in code.

## Dataset and Preprocessing

We begin by defining the dataset and sorting it. Many descriptive statistics
(median, quartiles, IQR) require ordered data.


In [34]:
data = [22, 49, 26, 15, 31, 29, 25, 33, 25, 30, 24, 34, 25, 39, 22, 24, 26, 19, 31, 37]
data_sorted = sorted(data)
n = len(data)

print(data_sorted)

[15, 19, 22, 22, 24, 24, 25, 25, 25, 26, 26, 29, 30, 31, 31, 33, 34, 37, 39, 49]


## Measures of Central Tendency

These statistics describe the "center" of the distribution.


### Mean

The mean is the arithmetic average of the data and is sensitive to outliers.

In [3]:
mean = sum(data) / n
mean


52.75

### Median

The median is the middle value of the ordered dataset and is robust to outliers.


In [6]:
mid = n // 2
if n % 2 == 0:
    median = (data_sorted[mid - 1] + data_sorted[mid]) / 2
else:
    median = data_sorted[mid]

median


54.5

### Mode

The mode is the value (or values) that occur more frequently than all others.
A dataset may have no mode or multiple modes.


In [11]:
from collections import Counter

counts = Counter(data)
max_freq = max(counts.values())
modes = [x for x, freq in counts.items() if freq == max_freq] # This definition acccounts for Multi-Modal datasets

if max_freq == 1 or len(modes) == len(counts): # Here we check that there is actually a mode
    mode = None
else:
    mode = modes

mode


[55]

## Measures of Dispersion

These statistics quantify how spread out the data is.


### Range

The range is the difference between the maximum and minimum values.


In [8]:
data_range = data_sorted[-1] - data_sorted[0]
data_range

24

### Variance (Population)

Variance measures the average squared deviation from the mean.


In [9]:
pop_variance = sum((x - mean)**2 for x in data) / n
pop_variance

47.1875

### Variance (Sample)

When the data represents a *sample* from a larger population, variance is
computed using Bessel’s correction.
We divide by \( n - 1 \) instead of \( n \) because estimating the mean from
the sample uses up one degree of freedom, and this correction makes the
variance an unbiased estimator of the population variance.


In [12]:
# Sample Variance (Bessel's correction)
samp_variance = sum((x - mean)**2 for x in data) / (n - 1)
samp_variance


51.47727272727273

**Key idea:**  
One degree of freedom is lost because the deviations must sum to zero once the
sample mean is fixed.


### Standard Deviation

The standard deviation is the square root of the variance and is expressed
in the same units as the data.


In [13]:
std_dev = pop_variance ** 0.5
std_dev

6.869315832017043

### Mean Absolute Deviation (MAD)

MAD is the average of the absolute deviations from the mean and is less
sensitive to extreme values than variance.


In [16]:
def mean_absolute_deviation(data):
    n = len(data)
    mean = sum(data)/n
    return sum(abs(x - mean) for x in data) / n

mad = mean_absolute_deviation(data)
mad


5.625

## Relative Dispersion

Relative dispersion measures describe variability *in relation to the size of the data values*.
They are useful when comparing how spread out different datasets are, especially when the datasets are on different scales.


### Coefficient of Variation (CoV)

The coefficient of variation compares the standard deviation to the mean.
It tells us how large the variability is relative to the average value.

The CoV is only meaningful when the mean is positive and the data is measured
on a ratio scale, because it relies on dividing by the mean.


In [18]:
assert mean > 0, "Coefficient of Variation requires a positive mean"
cov = std_dev / mean
print(f"Coefficient Variation: {cov:.2f} or {cov*100:.2f}%")

Coefficient Variation: 0.13 or 13.02%


## Measures of Position (Order Statistics)


### Quartiles

Q1 and Q3 are the medians of the lower and upper halves of the ordered data.


In [22]:
# helper function to obtain the median of a sorted list:
def median_of(sorted_data):
    m = len(sorted_data)
    mid = m // 2
    if m % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:
        return sorted_data[mid]

if n % 2 == 0:
    lower_half = data_sorted[:n//2]
    upper_half = data_sorted[n//2:]
else:
    lower_half = data_sorted[:n//2]
    upper_half = data_sorted[n//2 + 1:]

Q1 = median_of(lower_half)
Q3 = median_of(upper_half)
print(f"Q1: {Q1}")
print(f"Median: {median}")
print(f"Q3: {Q3}")

Q1: 47.0
Median: 54.5
Q3: 57.0


### Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of the data and is robust to outliers.


In [21]:
IQR = Q3 - Q1
IQR

10.0

## Robust Measures of Central Tendency

These measures reduce the influence of extreme values.


### Trimmed Mean

The trimmed mean removes a fixed percentage of the smallest and largest values
before computing the mean.


In [26]:
def trimmed_mean(data, trim_percent):
    data_sorted = sorted(data)
    n = len(data)
    k = int(n * (trim_percent / 100)) # by using int(), we are only taking the integer part, so for non integer "trims", we are just removing whole records.

    if 2 * k >= n:
        raise ValueError("Trim percentage too large")

    trimmed = data_sorted[k : n - k]
    return sum(trimmed) / len(trimmed)

trimmedMean = trimmed_mean(data, 10)
trimmedMean

52.7

## Weighted Mean

A weighted mean assigns different importance to observations.


In [28]:
def weighted_mean(data, weights):
    n = len(data)
    w_mean = 0

    for i in range(len(data)) :
        w_mean += data[i] * (weights[i]/sum(weights))

    return w_mean

weights = [1, 3, 2, 4, 2, 3, 5, 1, 2, 4, 1, 6]
print(f"Mean: {mean}")
print(f"Weighted Mean: {weighted_mean(data, weights)}")

Mean: 52.75
Weighted Mean: 55.294117647058826


## Outlier Detection

Outliers are identified using Tukey's 1.5 × IQR rule.


In [35]:
def outliers(data):
    data_sorted = sorted(data)
    Q1, Q3 = median_of(data_sorted[:n//2]), median_of(data_sorted[n//2:])
    iqr = Q3 - Q1

    lower_fence = Q1 - 1.5 * iqr
    upper_fence = Q3 + 1.5 * iqr

    outliers = []
    inliers = []

    for x in data:
        if lower_fence <= x <= upper_fence:
            inliers.append(x)
        else:
            outliers.append(x)

    return outliers, inliers

outliers, inliers = outliers(data)
print(f"Outliers: {outliers}")
print(f"inliers: {inliers}")



Outliers: [49]
inliers: [22, 26, 15, 31, 29, 25, 33, 25, 30, 24, 34, 25, 39, 22, 24, 26, 19, 31, 37]


## Final Summary Function

This function aggregates all the descriptive statistics computed in this notebook.


In [44]:
def describe_stats(data, trim_percent=0):
    data_sorted = sorted(data)
    n = len(data)

    # Central tendency
    mean = sum(data) / n

    # Median
    mid = n // 2
    if n % 2 == 0:
        median = (data_sorted[mid - 1] + data_sorted[mid]) / 2
    else:
        median = data_sorted[mid]

    # Mode
    from collections import Counter
    counts = Counter(data)
    max_freq = max(counts.values())
    modes = [x for x, f in counts.items() if f == max_freq]
    mode = None if max_freq == 1 or len(modes) == len(counts) else modes

    # Dispersion
    data_range = data_sorted[-1] - data_sorted[0]

    pop_variance = sum((x - mean)**2 for x in data) / n
    pop_std = pop_variance ** 0.5

    mad = sum(abs(x - mean) for x in data) / n

    cov = pop_std / mean if mean > 0 else None

    # Quartiles & IQR
    def median_of(sorted_data):
        m = len(sorted_data)
        mid = m // 2
        if m % 2 == 0:
            return (sorted_data[mid - 1] + sorted_data[mid]) / 2
        return sorted_data[mid]

    if n % 2 == 0:
        lower = data_sorted[:n//2]
        upper = data_sorted[n//2:]
    else:
        lower = data_sorted[:n//2]
        upper = data_sorted[n//2 + 1:]

    Q1 = median_of(lower)
    Q3 = median_of(upper)
    IQR = Q3 - Q1

    # Outliers (Tukey rule)
    lower_fence = Q1 - 1.5 * IQR
    upper_fence = Q3 + 1.5 * IQR

    outliers = [x for x in data if x < lower_fence or x > upper_fence]

    # Trimmed mean (optional)
    trimmed_mean = None
    if trim_percent > 0:
        k = int(n * (trim_percent / 100))
        if 2 * k < n:
            trimmed = data_sorted[k:n-k]
            trimmed_mean = sum(trimmed) / len(trimmed)

    stats = {
        "Mean": mean,
        "Median": median,
        "Mode": mode,
        "Range": data_range,
        "Variance": pop_variance,
        "Std_dev": pop_std,
        "Mad": mad,
        "CoV": cov,
        "Q1": Q1,
        "Q3": Q3,
        "IQR": IQR,
        "Outliers": outliers,
        "Trimmed Mean": trimmed_mean}

    for stat, value in stats.items():
        print(f"{stat}: {value}")

describe_stats(data, 10)

Mean: 28.3
Median: 26.0
Mode: [25]
Range: 34
Variance: 55.71
Std_dev: 7.463913182774837
Mad: 5.83
CoV: 0.26374251529239706
Q1: 24.0
Q3: 32.0
IQR: 8.0
Outliers: [49]
Trimmed Mean: 27.75
