# Measures of Spread
---

## Import Python Libraries

In [3]:
# import Python libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats

---

## Left Align Cell Contents

In [4]:
%%html
<style>
table {float:left}
</style>

---

**Spread** is how and by how much the data is spread out around its center.

- range
- interquartile range (IQR)
- variance
- standard deviation

---

## Range

Range is the difference between the largest and smallest values in a data set.

In [12]:
golf_scores = [66,67,67,68,68,68,68,69,69,69,69,70,70,71,71,72,73,75]
print(f'Number of golf scores {len(golf_scores)}')
print(f'Range of golf scores is {golf_scores[-1] - golf_scores[0]}')

Number of golf scores 18
Range of golf scores is 9


---

## Interquartile Range (IOR)

Imagine splitting a data set into half at the median. After that imagine splitting the halves at the median of each half to get quarters. Each of those quarters would be a quartile or 25% of the data.

- **Q1** = Lower half median, represents the bottom 25% of the data
- **Q2** = Median of the data set, represents bottom 50% of the data
- **Q3** = Upper half median, represents bottom 75% of the data

The interquatile range or **IQR** is the difference between the median of the upper half of the data and the median of the lower half or **Q3 - Q1**.

Calculating IOR for an **even** amount of data:
1. Split the data into 2 halves.
2. Calculate the median for each half: 1) lower half is Q1, 2) upper half is Q3.
3. The IOR is Q3 - Q1.

Calculating IQR for an **odd** amount of data:
1. Find the median of the data which is just the number in the middle of the data set.
2. Split the data into 2 halves: 1) lower half - all the data up to but not including the meidan, 2 upper half - all the data from but not including the median.
3. Calculate the median of the 2 halves: 1) lower half is Q1, 2) upper half is Q3.
4. The IOR is Q3 - Q1.


Example:

Golf scores: 66,67,67,68,68,68,69,69,69,69,70,70,71,71,72,73,75.  
There are 18 golf scores, so split the data into 2 halves of 9 scores each.  
Lower half: 66,67,67,68,68,68,69,69,69.  
Median of lower half, or Q1, is 68.  
Upper half: 69,69,70,70,71,71,72,73,75.  
Median of upper half, or Q3, is 71.  
IQR = 71 - 68 = 3.  

## Using scipy stats

In [13]:
# Calculate the interquartile range
iqr_value = scipy.stats.iqr(golf_scores)

print(f"The interquartile range (IQR) of the data set is: {iqr_value}")

The interquartile range (IQR) of the data set is: 2.75


Note:  
scipy.stats.iqr doesn't follow the algorithm described above.  
Instead it does np.percentile(x, 75) - np.percentile(x, 25) which is not exclusive of the median, it is inclusive.  
Which is why the result differs from the above manual calculation.

---

## Outliers

Outliers are points in the data that are very far from the mean.  

The technical definition of an outlier uses the 1.5 IQR rule:

- low outlier is any value less than Q1 - 1.5(IQR)
- high outlier is any value greater than Q3 + 1.5(IQR)

---

## Variance

The formula for the **population variance \($ \sigma^2 $ \)** is given by:

$ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 $

where:
- \($ \sigma^2 $ \) is the population variance
- \($ N $ \) is the number of observations in the population
- \($ x_i $ \) represents each individual observation
- \($ \mu $ \) is the population mean


The formula for the **sample variance \($ s^2 $ \)** is given by:

$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 $

where:
- \($ s^2 $ \) is the sample variance
- \($ n $ \) is the number of observations in the sample
- \($ x_i $ \) represents each individual observation
- \($ \bar{x} $ \) is the sample mean


Variance is how far the data is spread from the mean.  
In both formulas the difference of the observation and the mean (or the distance from the observation to the mean) is calcuated. This is the deviation of the observation from the mean.  
Squaring the deviations ensures a positive result so that observations a negative distance from the mean do not cancel the effect of observations a positive distance from the mean.  
The units of variance will be the square of the units of the observations, which in general will not make sense in the context of the data.

---

## Standard Deviation

The formula for the **population standard deviation \($ \sigma $ \)** is given by:

$ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2} $

where:
- \($ \sigma $ \) is the population standard deviation
- \($ N $ \) is the number of observations in the population
- \($ x_i $ \) represents each individual observation
- \($ \mu $ \) is the population mean


The formula for the **sample standard deviation \($ s $ \)** is given by:

$ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} $

where:
- \($ s $ \) is the sample standard deviation
- \($ n $ \) is the number of observations in the sample
- \($ x_i $ \) represents each individual observation
- \($ \bar{x} $ \) is the sample mean


Standard deviation is a measure of how much the data varies from the mean. It is the square root of the variance.  
The units of the standard deviation will be the same units as the observations.  
The larger the value of the standard deviation the more the data varies, or is spread, from the mean.

---