# **Day03 of Machine Learning**

## **Estimates of Variability** 
Variability, also known as dispersion, measures whether the data values are tightly clustered or spread out. \
Hence, **Estimates of variability** (also called measures of dispersion or spread) describe how spread out or scattered the data values are around the central tendency (mean, median, or mode). These estimates help to understand the range, consistency, and reliability of the data. Understanding variability is essential because it gives insight into the diversity of the data and how well the central measures represent the dataset.

The main estimates of variability are:

#### **1. Mean Absolute Deviation** 
The Mean Absolute Deviation is a measure of variability that tells us how spread out the data is, on average, from the mean (or another central point like the median). It is particularly useful for understanding how much the data points deviate from a central value in absolute terms, meaning we disregard the direction (positive or negative) of the deviation.
- The difference between a data point and the mean (or another central point). It can be positive or negative.
- It gives an idea of the average distance between each data point and the mean, offering a clear picture of the dataset's variability.
- It is often simpler and less sensitive to extreme values (outliers) compared to the standard deviation.

$$
    \text{Mean Absolute Deviation} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|
$$

In [21]:
import statistics
import numpy as np
import pandas as pd
import scipy.stats as stats

In [22]:
marks = [77, 59, 64, 85, 75, 68, 80, 73, 59]

In [23]:
arr = np.array([18, 26, 31, 9, 10, 26, 22, 36, 20])

In [24]:
rock = pd.read_csv("datasets/rockR.csv", index_col=0)

In [25]:
#using numpy
arr_mean = np.mean(arr)

mean_absolute_deviation = np.mean(np.abs(arr - arr_mean))
mean_absolute_deviation

6.888888888888889

In [26]:
#using pandas dataframe
rock_mean_abs_dev = np.mean(np.abs(rock - np.mean(rock, axis=0)), axis=0)
rock_mean_abs_dev

area     2133.846354
peri     1330.352646
shape       0.064423
perm      397.125000
dtype: float64

&nbsp;

#### **2. Median Absolute Deviation** 
The Median Absolute Deviation (MAD) is a robust measure of the variability or dispersion in a dataset. Unlike the mean absolute deviation or the standard deviation, which are sensitive to outliers, the median absolute deviation focuses on the median (instead of the mean), making it more resistant to the influence of extreme values. It is widely used in robust statistics as a reliable indicator of the spread when data contains outliers.
- It is not influenced by extreme values, making it more reliable for skewed data or datasets with outliers.
- When the data doesn’t follow a normal distribution or contains heavy tails, it is a more appropriate measure than the standard deviation.
- It can be used to detect outliers. Values that deviate significantly (usually more than a few MADs) from the median can be considered outliers.
$$
    \text{Median Absolute Deviation (MAD)} = \text{median} \left( |x_i - \text{median}(x)| \right)
$$

In [27]:
#using scipy
mad = stats.median_abs_deviation(arr)
mad

4.0

In [28]:
stats.median_abs_deviation(rock)

array([1.87950e+03, 1.33993e+03, 4.50255e-02, 1.18800e+02])

&nbsp;

#### **3. Variance** 
Variance is the average of the squared differences between each data point and the mean. It measures how much the data points deviate from the mean on average.
- It gives more weight to values far from the mean because it squares the differences. 
- Larger variances indicate that data points are more spread out from the mean.
- The units of variance are the square of the original data units, making interpretation less intuitive.

$$
    \text{Population Variance}(\sigma^2) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

$$
    \text{Sample Variance} (s^2) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

In [29]:
#using numpy
np.var(arr)

71.33333333333333

In [30]:
#using numpy
np.var(rock, axis=0)

area     7.052981e+06
peri     2.006953e+06
shape    6.826414e-03
perm     1.876914e+05
dtype: float64

In [31]:
#using pandas
rock.var()

area     7.203045e+06
peri     2.049654e+06
shape    6.971657e-03
perm     1.916848e+05
dtype: float64

&nbsp;

#### **4. Standard Deviation** 
Standard deviation (SD) is the square root of variance, bringing the units back to the same scale as the original data. It shows how much, on average, each data point deviates from the mean.
- It is more interpretable than variance because it has the same units as the original data.
- A smaller standard deviation indicates that data points are clustered closely around the mean, while a larger standard deviation indicates that data points are spread out.

$$
    \text{Population Standard Deviation} (\sigma) = \sqrt{\text{Population Variance}} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
$$

$$
    \text{Sample Standard Deviation} (s) = \sqrt{\text{Sample Variance}} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

&nbsp;

In [32]:
#using numpy
np.std(arr)

8.445906306213285

In [33]:
#using numpy
np.std(rock, axis=0)

area     2655.744958
peri     1416.669535
shape       0.082622
perm      433.233616
dtype: float64

In [34]:
#using pandas
rock.std()

area     2683.848862
peri     1431.661164
shape       0.083496
perm      437.818226
dtype: float64

&nbsp;

#### **5. Range** 
The range is a basic measure of dispersion in statistics that represents the difference between the highest and lowest values in a dataset. It gives an idea of how spread out the data points are but doesn’t provide detailed information about the distribution of values between the extremes.
- It only accounts for the two extreme values (the maximum and minimum) in a dataset. As a result, it ignores the distribution of the other data points.
- The range is highly sensitive to outliers because even one unusually high or low value can dramatically change the range.
- While the range gives a sense of the spread of data, it does not provide any insight into the shape or variability of the distribution between the minimum and maximum values.

$$
    Range = Max - Min
$$

In [35]:
#using numpy
arr_range = np.max(arr) - np.min(arr)
arr_range

27

In [36]:
#using pandas
rock_range = rock.max() - rock.min()
rock_range

area     11196.000000
peri      4555.578000
shape        0.373795
perm      1293.700000
dtype: float64

&nbsp;

#### **3. IQR - Interquartile Range** 
The Interquartile Range (IQR) is a robust measure of statistical dispersion that indicates the spread of the middle 50% of the data. It measures the difference between the first quartile (Q1) and the third quartile (Q3), thus showing where the central portion of the dataset lies and ignoring the extreme values (outliers).
- Unlike the range, which is sensitive to extreme values, the IQR focuses on the middle 50% of the data and completely ignores the highest and lowest quarters. This makes it a more reliable measure when outliers are present.
- The IQR is based on quartiles, which divide the data into four equal parts. The IQR captures the spread of the central two quartiles. This helps analysts understand where the "bulk" of the data lies, giving insights into the distribution's concentration.

$$
    IQR = Q_3 - Q_1
$$

In [37]:
#using numpy
q1 = np.percentile(arr, 0.25)
q3 = np.percentile(arr, 0.75)

arr_iqr = q3 - q1
arr_iqr

0.040000000000000924

In [38]:
#using scipy 
stats.iqr(rock)

4342.75646875

In [39]:
#using pandas
rock_q1 = rock.quantile(0.25)
rock_q3 = rock.quantile(0.75)

In [40]:
rock_iqr = rock_q3 - rock_q1
rock_iqr

area     3564.250000
peri     2574.615000
shape       0.100408
perm      701.050000
dtype: float64