# Intoduction to Probability and Statistics:
## Manual inplementation of descriptive Statistics using Python.
This notebook aims to help with the understanding of the basic formulas to obtain different desciptive values from a dataset. It avoids importing any libraries with the goal of interiorizing the concepts, and to help identify the small differences between theoretical formulas and their implementation in code.

Dataset:

In [32]:
data = [45,55,43,60,54,59,55,49,52,65,41,55]
data

[45, 55, 43, 60, 54, 59, 55, 49, 52, 65, 41, 55]

Sorted Dataset:
Important for later computation of different values.

In [33]:
data.sort()
data

[41, 43, 45, 49, 52, 54, 55, 55, 55, 59, 60, 65]

## 1. Mean, Median, Standard Deviation and Variance:

In [34]:
n = len(data)

# Mean:
mean = sum(data)/n

# Median:
# Important to note that since python indexes start at 0, finding the middle datapoints for even sets differs from the correct mathemathical formula.
mid = n // 2
if n % 2 == 0:
    median = (data[mid - 1] + data[mid])/ 2
else:
    median = data[mid]

# Variance:
    # Popultation Variance:
variance = sum((num - mean)**2 for num in data)/n
    # Sample Variance: (Bessel's correction)
# variance = sum((num - mean)**2 for num in data)/(n - 1)

#Standard Deviation:
std_dev = variance ** 0.5

print(f"Mean: {mean}\nMedian: {median}\nVariance: {variance:.2f}\nStandard Deviation: {std_dev:.2f}")

Mean: 52.75
Median: 54.5
Variance: 47.19
Standard Deviation: 6.87


In [35]:
import numpy as np

# Convert list to NumPy array
np_data = np.array(data)

# NumPy equivalents (population versions)
np_mean = np.mean(np_data)
np_median = np.median(np_data)
np_variance = np.var(np_data, ddof=0)
np_std_dev = np.std(np_data, ddof=0)

print("Manual vs NumPy comparison\n")

print(f"Mean: {mean} | NumPy: {np_mean}")
print(f"Median: {median} | NumPy: {np_median}")
print(f"Variance: {variance:.2f} | NumPy: {np_variance:.2f}")
print(f"Standard Deviation: {std_dev:.2f} | NumPy: {np_std_dev:.2f}")

Manual vs NumPy comparison

Mean: 52.75 | NumPy: 52.75
Median: 54.5 | NumPy: 54.5
Variance: 47.19 | NumPy: 47.19
Standard Deviation: 6.87 | NumPy: 6.87


## 2. Mode and Coefficient Variation:

In [None]:
from collections import Counter

# Mode:
counts = Counter(data)
max_freq = max(counts.values())

modes = [x for x, freq in counts.items() if freq == max_freq] # This definition acccounts for Multi-Modal datasets

if max_freq == 1 or len(modes) == len(counts):   # Here we check that there is actually a mode
    mode = None
else:
    mode = modes 

# Coefficient Variation:
assert mean > 0, "Coefficient of Variation requires a positive mean"
CoV = std_dev / mean

print(f"Mode: {mode}\nCoefficient Variation: {CoV:.2f} or {CoV*100:.2f}%")


Mode: [55]
Coefficient Variation: 0.13 or 13.02%


## 3. Range and Inter Quartile Range

In [50]:
# Data is assumed to be sorted
# Range:
Range = data[-1] - data[0]


#IQR:
# Quartiles computed as medians of lower and upper halves
def median_of(sorted_data):
    m = len(sorted_data)
    mid = m // 2
    if m % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:
        return sorted_data[mid]

# Split dataset into halves
if n % 2 == 0:
    lower_half = data[:n//2]
    upper_half = data[n//2:]
else:
    lower_half = data[:n//2]      # exclude median
    upper_half = data[n//2 + 1:]

# Calculate Quartiles
Q1 = median_of(lower_half)
Q3 = median_of(upper_half)

# Calculate IQR
IQR = Q3 - Q1

print(f"Range: {range}")
print(f"Inter Quartile Range: {IQR}")


Range: 24
Inter Quartile Range: 10.0


## 4. Mean Absolute Deviation

In [51]:
def mean_absolute_deviation(data):
    n = len(data)
    mean = sum(data)/n
    return sum(abs(x - mean) for x in data) / n

print(f"Mean Absolute Deviation: {mean_absolute_deviation(data)}")


Mean Absolute Deviation: 5.625
