### Lab 6 - Measures of spread (Range, Variance, and Standard Deviation)

In this lab, we will learn how some different ways to measure the *spread* of the data: range, variance, and standard deviation.

### 6.1 Properties of the variance
First, we'll look at some properties of the variance with made-up data.

Import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The following code creates a new dataframe called `data` containing the values 3, 5, 2, 7, 7.

In [2]:
data = pd.DataFrame([3,5,2,7,7])

Display the dataframe `data`.

In [3]:
data

Unnamed: 0,0
0,3
1,5
2,2
3,7
4,7


<details> <summary>Hint:</summary>
    We display this dataframe the same ways as dataframes made from CSV file:
    <code>data</code>
</details>

Is this what you expected to display?

To compute the sample variance of `data`, run the code below.

In [5]:
data.var()

0    5.2
dtype: float64

What is the sample variance if all the data is the same number?  For example, if the data is 2,2,2,2,2.

We can verify our guess with the code below.

In [8]:
data = pd.DataFrame([2,2,2,2,2])
data.var()

0    5.2
dtype: float64

Let's return to the original made-up data set.  Can you change, add, and/or remove numbers to increase its sample variance?   Try making the changes below. The sample variance of the original data is 5.2.

In [9]:
data = pd.DataFrame([3,5,2,7,7])
data.var()

0    5.2
dtype: float64

Now, try changing, adding, and/or removing numbers to decrease the variance.  The sample variance of the original data is 5.2.

In [None]:
data = pd.DataFrame([3,5,2,7,7])
data.var()

The above exercises help you understand how the sample variance is affected by changes in the data.

### 6.2 Reading the data
We'll now look at range, variance, and standard deviation using the 2019 Green Taxi Trip dataset from Lab 4.

Load the data from the CSV file into a dataframe called `taxi`.

Check that the dataframe was created properly by displaying it.

Our measures of spread can only be used with quantitative data.  Which columns contain quantitative data?

### 6.3 Mean, median, and shape of trip distances

We will look at the `trip_distance` column, which is the distance of the taxi trip in miles.  First let's compute the mean and median, which we learned how to compute in Lab 5.

What is the mean trip distance?

<details> <summary>Answer:</summary>
    <code>taxi["trip_distance"].mean()</code>
</details>

What is the median trip distance?

<details> <summary>Pattern:</summary>
    <code>dataframe_name["column_name"].median()</code>
</details>

Are the mean and median significantly different?  Why do you think this is the case?  It might be helpful to plot the histogram of the trip distances, to remember their distribution.  Do so below:

<details> <summary>Pattern:</summary>
    <code>dataframe_name["column_name"].hist(bins = 80)</code>
</details>

The long trips are increasing the mean, but not the median.

### 6.4 Range, variance, and standard deviation

The first measure of spread is the *range*, which is the calculated as *range = max data value - min data value*

We computed max and min data values in Lab 5.  Can you figure out how to compute the range for the trip distance data?  You can store values in variables if you want.

<details> <summary>Answer:</summary>
    <code>taxi["trip_distance"].max() - taxi["trip_distance"].min()</code><br>
    or<br>
    <code>max_dist = taxi["trip_distance"].max()
min_dist = taxi["trip_distance"].min()
max_dist - min_dist</code>
</details>

The second measure of spread is *variance*, which measures how spread out the data is from the mean.

The formula for variance changes a little, depending on whether the data is a *sample* or a *population*.  By default, Python computes the sample variance.  To compute the sample variance of the trip distance, type `taxi["trip_distance"].var()` below and run it.  

The *standard deviation* is the square root of the (population) variance or the sample variance.  By default, Python computes the sample standard deviation.  To compute the sample standard deviation of the trip distance, type `taxi["trip_distance"].std()` below and run it.

### 6.5 Chebyshev's Inequality

We will use the variance and standard deviation throughout the course. One direct application of the standard deviation is that 75% of the data values are within 2 standard deviations of the mean, and 89% of the data values are within 3 standard deviations of the mean.  This rule is called *Chebyshev's Inequality*.

So if we save the (sample) standard deviation of the trip distances in the variable `sigma` and the mean of the trip distances in the variable `mu`, then 75% of the data is greater than `mu - 2*sigma` but less than `mu + 2*sigma`.  Compute `mu - 2*sigma` and `mu + 2*sigma` for the trip distances.  Does this interval make sense?

<details> <summary>Answer:</summary>
    <code>mu = taxi["trip_distance"].mean()
sigma = taxi["trip_distance"].std()
mu-2*sigma
mu + 2*sigma</code>
</details>

#### Challenges:
- What is the range, sample variance, and sample standard deviation of the number of passengers?
- What is the range, sample variance, and sample standard deviation of the fare amount?
- Use Chebyshev's inequality to determine the interval that contains 94% of the trip distance data.