# Technical Walkthrough for Measures of Dispersion

Situation 1: Characteristic range for flower petal lengths.


Let's revisit our list of petal lengths for members of the Iris Virginica subspecies. In a previous lesson, we computed measures of central tendency for this data (i.e. the mean and median). This data was symmetrically distributed without outliers and a clear value of central tendency. In this case, the all the measures of dispersion (standard deviation, MAD and IQR) should be on the same scale. However, as they measure different things it is unreasonable to expect they are all the same. Let's go ahead and calculate all of these. First, we start the variance -- which we will use to compute the standard deviation:

In [5]:
virginica_petal_length = [6.0, 5.1, 5.9, 5.6, 5.8, 6.6, 4.5, 6.3, 5.8, 6.1, 5.1, 5.3, 5.5, 5.0, 5.1, 5.3, 5.5, 6.7, 6.9, 5.0, 5.7, 4.9, 6.7, 4.9, 5.7, 6.0, 4.8, 4.9, 5.6, 5.8, 6.1, 6.4, 5.6, 5.1,
5.6, 6.1, 5.6, 5.5, 4.8, 5.4, 5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1]

We now write a function that takes a list of numbers as input and returns the variance of the sample as output. Using the corrected formula:

$$ s_{corr}^2 = \frac{\sum_i^N (x_i - \bar{x})^2}{N-1}$$

In [6]:
def get_variance(sample):

    # First, calculate the sample mean 
    sample_mean = sum(sample)/len(sample)
    
    sum_of_squares = 0
    for length in sample:
        # Now, calculate the sum of squares by subtracting the sample mean
        # from each length, squaring the result, and adding it to the total
        sum_of_squares += (length - sample_mean)**2
        
    # Divide the sum of squares by the number of items in the sample -1 to calculate variance 
    variance = sum_of_squares/(len(sample)- 1)
    
    return variance


Applying this function to our list of petal lengths:

In [7]:
get_variance(virginica_petal_length)

0.30458775510204084

#### Calculating variance with Numpy
The following can also be accomplished readily by using Numpy. The steps are:
- make sure you have numpy imported
- convert the list into an array 
- call the .var(ddof = 1) method on your array. This calculates the corrected sample  variance. The `ddof = 1` argument is what provides the $N-1$ correction to the variance.

In [8]:
import numpy as np

virginica_array = np.array(virginica_petal_length)
virginica_array.var(ddof = 1)

0.30458775510204084

Now let's calculate the sample standard deviation.  The corrected formula base off of the variance is:

$$ s_{corr}^2 = \sqrt{\frac{\sum_i^N (x_i - \bar{x})^2}{N-1}}$$

First, let's write a function that builds of our `get_variance` function:

In [9]:
# write function for the standard deviation

def get_stddev(sample):

    stdev = (get_variance(sample))**0.5 # this takes the square root of the corrected sample variance

    return stdev

Applying this function on our petal lengths yields our corrected sample standard deviation:

In [10]:
get_stddev(virginica_petal_length)

0.5518946956639834

Using Numpy is also fairly straightforward:
- just call the .std(ddof = 1) method on our petal length array

In [11]:
virginica_array.std(ddof = 1)

0.5518946956639834

Now let us calculate the MAD:

$$ MAD = \frac{\sum_i^N |x_i - \bar{x}|}{N}$$

We can write a function that takes on a form similar to our `get_variance` function. You will need to use the Python abs() function for this:

In [44]:

def get_mad(sample):

    # First, calculate the sample mean 
    sample_mean = sum(sample)/len(sample)
    
    sum_of_devs = 0
    for length in sample:
        # find deviation by subtracting the sample mean
        # ftake the absolute value of the deviation and adding it to the total
        sum_of_devs += abs(length - sample_mean) 
        
    # Divide the sum of deviations by the number of items in the sample to calculate MAD  
    mad = sum_of_devs/(len(sample))
    
    return mad


Extracting the MAD for our petal lengths:

In [45]:
get_mad(virginica_petal_length)

0.43999999999999995

This is a comparable range to the standard deviation.  

Now let's calculate the IQR. In principle, we could find a function that sorts the data, splits it into quarters and then extracts the range of the middle half. We are not going to take this route here. Instead we will show you two methods to extract the IQR using `numpy` and `scipy`.

Method 1:
- This relies on calculating the 25th and 75th percentiles (Q1 and Q3 respectively) using `numpy` and subtracting them yielding the IQR:
$$ IQR = Q3 - Q1$$
- The `numpy` function np.percentile() is used to calculate percentiles as follows:

In [14]:
# calculating the 25th percentile (Q1)
Q1 = np.percentile(virginica_petal_length , 25)
Q1

5.1

We can do the same for the 75th percentile (Q3):

In [15]:
Q3 = np.percentile(virginica_petal_length , 75)
Q3

5.875

The IQR then is:

In [16]:
IQR_petals = Q3 - Q1
IQR_petals

0.7750000000000004

Method 2:
This involves import `scipy`, a powerful scientific computing package built off of `numpy` and using the IQR method from scipy's stats submodule:

In [17]:
from scipy.stats import iqr
iqr(virginica_petal_length)

0.7750000000000004

Remember that the IQR is the *full characteristic range* not just a measure of deviation from the value of central tendency. The IQR then seems to be in rough concordance with range estimates from the MAD and standard deviation.

Situation 2: Characteristic range for weekly sales of air conditioner units.

Recall that this data contained large outliers. There were two weeks, in particular, where the store made considerably more on air conditioner sales than other weeks. We will now see how presence of outliers affect the different measures of dispersion:

In [18]:
weekly_sales = [5333, 4211, 35204, 4409, 5884, 5100, 4964, 3786, 5494, 5344, 5233, 5713,
                4481, 4325, 5105, 4167, 5011, 5088, 5732, 4623, 6217, 5492, 4189, 6435,
                5270, 5720, 5437, 6572, 4203, 26110, 4105, 3993, 5546, 4751, 4139, 4431,
                4271, 5537, 6177, 4439, 5394, 6525, 4512, 3940, 4078, 5430, 5112, 5255, 5379,
                5493, 6659, 4826]


Computing the standard deviation:

In [20]:
np.array(weekly_sales).std(ddof = 1)

5104.157200459102

Computing MAD:

In [21]:
get_mad(weekly_sales)

1979.286982248521

Computing the IQR:

In [22]:
iqr(weekly_sales)

1113.75

There are pretty large differences here – particularly between the standard deviation and the rest of the measures of dispersion. It's useful to look at a histogram of weekly sales.

![image](weekly_sales_outlier.png)

The standard deviation is influenced heavily by the outliers. An inspection of the main cluster of values shows that an estimation of the characteristic range at ~ 5100 is much too high. The MAD is more reasonable but still is a bit high as an estimate of characteristic deviation from the value of central tendency. This is due to the fact that MAD is still influenced by the outliers at high values. The IQR estimates the middle range of the bulk of the data. The IQR seems reasonable upon inspecting the bulk of the distribution.

Now let us remove outliers and see what happens to the three measures of dispersion:


In [33]:
weekly_sales_no_outliers = [elem for elem in weekly_sales if elem < 20000]

Calculating the standard deviation with outliers removed:

In [37]:
np.array(weekly_sales_no_outliers).std(ddof = 1)

760.6934001157049

This is a substantial reduction in the standard deviation! It is certainly more characteristic of the dispersion about the mean than the previous value when outliers were included.

Calculating the MAD with outliers removed:

In [35]:
get_mad(weekly_sales_no_outliers)

627.9680000000001

The MAD (Mean Absolute Deviation) also sees a substantial drop from its previous value when outliers were included. This drop is not as drastic as in the case of the standard deviation. Note that now the MAD and standard deviation are fairly similar in value.

Calculating the IQR with outliers removed:

In [36]:
iqr(weekly_sales_no_outliers)

1079.25

The IQR remains pretty stable and doesn't change much at all. This jives with the notion that the IQR is largely insensitive to distribution outliers. 