# NumPy Stats Toolkit — Examples

This notebook demonstrates how to use each function in the `stats_toolkit.py` module for basic statistical analysis using only NumPy.


In [1]:
import numpy as np
import stats_toolkit as st

In [2]:
data = np.array([4, 8, 6, 5, 3, 9, 7, 6, 5, 27, 11, 34, 25, 49, 18, 41, 37, 39, 23, 20, 29, 47, 2, 17, 22, 1, 33, 30, 44, 36, 46, 10, 0, 38, 50, 13, 14, 43,  8, 26, 19, 24, 21, 35, 31, 9, 45, 42, 28, 32, 40, 6, 7, 16, 3, 5, 12,  4, 15, 1, 17, 48, 20, 10, 2, 25, 34, 49, 11, 14, 6, 22, 13, 26, 24, 27, 7, 0, 32, 21,16, 12, 15, 28, 35, 30, 19,  8, 18, 23,29, 31, 38, 36, 43, 33, 45, 41, 39, 40
])

matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])


## Mean

The **mean** (or average) is the central value of a dataset. It is calculated by summing all the values and dividing by the number of elements. It represents the "typical" value in a distribution.

$$
\mu = \frac{1}{n} \sum_{i=1}^{n} x_i
$$


**Where:**
- $ n $ = total number of values in the population  
- $ x_i $ = each individual value in the dataset  


In [3]:
print(f'Mean: {st.mean(data)}') 

Mean: 22.68


## Population Variance

The **population variance** measures how much each value in the dataset differs from the mean, considering **all** values in the population. It quantifies the overall spread of the data by calculating the average of the squared differences from the mean.

$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
$$

**Where:**
- $ n $ = total number of values in the population  
- $ x_i $ = each individual value in the dataset  
- $ \mu $ = the population mean (average of all values)  


In [4]:
print(f'Population variance: {st.population_variance(data)}')

Population variance: 204.23760000000004


## Sample variance

The **sample variance** is used when you only have a subset (sample) of the full population. It adjusts for the fact that you're estimating the variance by dividing by $n - 1$ instead of $n$, which corrects the bias.

$$
s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

In [5]:
print(f'Sample variance: {st.sample_variance(data)}')

Sample variance: 206.30060606060613


## Standard Deviation

**Standard deviation** is a measure of how spread out the values in a dataset are around the mean. It represents the average distance of each data point from the mean, providing insight into the variability or dispersion in the data.

It is calculated as the square root of the variance, which makes it easier to interpret since it has the same units as the original data.

$$
\sigma = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 }
$$

**Where:**
- $ n $ = number of data points  
- $ x_i $ = each individual value in the dataset  
- $ \mu $ = the mean of the dataset  





In [6]:
print("Standard Deviation:", st.standard_deviation(data))

Standard Deviation: 14.291172100286248


## Z-Scores

A **z-score** indicates how many standard deviations a data point is from the mean of the dataset. It is a way to standardize values, allowing for comparison across different scales or distributions.

Z-scores are useful for detecting outliers, comparing values from different distributions, and preparing data for machine learning algorithms that are sensitive to feature scales.

$$
z = \frac{x - \mu}{\sigma}
$$

Where:
- $ x $ is the data point  
- $ \mu $ is the mean  
- $ \sigma $ is the standard deviation



In [7]:
print("Z-scores:", st.z_scores(data))

Z-scores: [-1.30710063 -1.02720756 -1.16715409 -1.23712736 -1.37707389 -0.95723429
 -1.09718083 -1.16715409 -1.23712736  0.30228451 -0.81728776  0.79209738
  0.16233798  1.84169639 -0.32747489  1.28191025  1.00201718  1.14196372
  0.02239145 -0.18752836  0.44223105  1.70174985 -1.44704716 -0.39744816
 -0.04758182 -1.51702043  0.72212411  0.51220431  1.49183005  0.93204392
  1.63177658 -0.88726102 -1.58699369  1.07199045  1.91166965 -0.67734122
 -0.60736796  1.42185678 -1.02720756  0.23231125 -0.25750162  0.09236471
 -0.11755509  0.86207065  0.58217758 -0.95723429  1.56180332  1.35188352
  0.37225778  0.65215085  1.21193698 -1.16715409 -1.09718083 -0.46742142
 -1.37707389 -1.23712736 -0.74731449 -1.30710063 -0.53739469 -1.51702043
 -0.39744816  1.77172312 -0.18752836 -0.88726102 -1.44704716  0.16233798
  0.79209738  1.84169639 -0.81728776 -0.60736796 -1.16715409 -0.04758182
 -0.67734122  0.23231125  0.09236471  0.30228451 -1.09718083 -1.58699369
  0.65215085 -0.11755509 -0.46742142 -0.7

## Min-Max Normalization

**Min-max normalization** is a technique used to scale numerical data to a fixed range, typically **between 0 and 1**. This transformation preserves the relationships between values but shifts and scales them so that the smallest value becomes 0 and the largest becomes 1.

This method is especially useful in **machine learning algorithms** that are sensitive to the scale of features, such as K-nearest neighbors (KNN), support vector machines (SVM), and gradient descent-based models.

$$
x_{\text{normalized}} = \frac{x - \min(x)}{\max(x) - \min(x)}
$$

Where:
- $ x $ is the original value  
- $ \min(x) $ is the minimum value in the dataset  
- $ \max(x) $ is the maximum value in the dataset

After normalization:
- The minimum value becomes **0**
- The maximum value becomes **1**
- All other values are proportionally scaled between 0 and 1


In [8]:
print("Min-Max Normalization:\n", st.min_max_normalize(data))

Min-Max Normalization:
 [0.08 0.16 0.12 0.1  0.06 0.18 0.14 0.12 0.1  0.54 0.22 0.68 0.5  0.98
 0.36 0.82 0.74 0.78 0.46 0.4  0.58 0.94 0.04 0.34 0.44 0.02 0.66 0.6
 0.88 0.72 0.92 0.2  0.   0.76 1.   0.26 0.28 0.86 0.16 0.52 0.38 0.48
 0.42 0.7  0.62 0.18 0.9  0.84 0.56 0.64 0.8  0.12 0.14 0.32 0.06 0.1
 0.24 0.08 0.3  0.02 0.34 0.96 0.4  0.2  0.04 0.5  0.68 0.98 0.22 0.28
 0.12 0.44 0.26 0.52 0.48 0.54 0.14 0.   0.64 0.42 0.32 0.24 0.3  0.56
 0.7  0.6  0.38 0.16 0.36 0.46 0.58 0.62 0.76 0.72 0.86 0.66 0.9  0.82
 0.78 0.8 ]


## Quantiles

**Quantiles** are cut points that divide a dataset into equal-sized, ordered subgroups based on the distribution of values. They help describe the **spread and shape** of the data by showing where values fall relative to others in the distribution.

Common types of quantiles include:
- **Quartiles**: divide the data into four parts (25% each)
- **Percentiles**: divide the data into 100 parts (1% each)
- **Deciles**: divide the data into 10 parts (10% each)

Quantiles are especially useful for:
- Identifying the **center** (like the median)
- Describing the **spread** of the data
- Detecting **outliers** using the interquartile range (IQR)

### Interquartile Range (IQR)
The IQR is the range between the 25th percentile (Q1) and the 75th percentile (Q3):

$$
\text{IQR} = Q3 - Q1
$$

This shows the range where the middle 50% of the data lies.

**Where:**
- $ Q1 $ = 25th percentile (first quartile)  
- $ Q3 $ = 75th percentile (third quartile)  
- IQR = range of central data, often used to detect outliers


In [9]:
print("Quantiles (25%, 50%, 75%):", st.quantiles(data))


Quantiles (25%, 50%, 75%): [ 9.75 22.   34.25]


In [None]:
print("Skewness:", st.skewness(data))


In [None]:
print("Kurtosis:", st.kurtosis(data))