# Lab 8: Correlation, Variance of Sample Means

Welcome to Lab 8!

This lab is to be taken on **Friday 12/18. To allow you enough time to complete it, it is due on Tue 12/22 at 11:00pm**.

**Note: If you do not attend the lab in person on Friday 12/18, and do not seek exemption for attendance, you will NOT get a grade for this lab even if you complete and submit it** 

In today's lab, we will cover two relatively orthogonal concepts. First, we will investigate the variance of sample means, found in [Section 14.5](https://umass-data-science.github.io/190fwebsite/textbook/14/5/variability-of-the-sample-mean/) of our textbook. We will also get some hands-on practice with understanding the association between two variables, which you can read more about in [Section 15.1](https://umass-data-science.github.io/190fwebsite/textbook/15/1/correlation/).

In [3]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

import otter
grader = otter.Notebook()

# 1. How Faithful is Old Faithful? 

(Note: clever title comes from [here](http://web.pdx.edu/~jfreder/M212/oldfaithful.pdf).)

Old Faithful is a geyser in Yellowstone National Park in the central United States.  It's famous for erupting on a fairly regular schedule. [See a video here](https://www.youtube.com/watch?v=wE8NDuzt8eg)

Some of Old Faithful's eruptions last longer than others.  When it has a long eruption, there's generally a longer wait until the next eruption.

If you visit Yellowstone, you might want to predict when the next eruption will happen, so you can see the rest of the park and come to see the geyser when it happens.  Today, we will use a dataset on eruption durations and waiting times to see if we can make such predictions accurately with linear regression.

The dataset has one row for each observed eruption.  It includes the following columns:
- **duration**: Eruption duration, in minutes
- **wait**: Time between this eruption and the next, also in minutes

Run the next cell to load the dataset.

In [4]:
faithful = Table.read_table("faithful.csv")
faithful

We would like to use linear regression to make predictions, but that won't work well if the data aren't roughly linearly related.  To check that, we should look at the data.

**Question 1:** Make a scatter plot of the data.  It's conventional to put the column we will try to predict on the vertical axis and the other column on the horizontal axis.

<!-- BEGIN QUESTION -->



In [5]:
...

<!-- END QUESTION -->

**Question 2:** Look at the scatter plot. Are eruption duration and waiting time roughly linearly related?  Is the relationship positive, as we claimed earlier?  You may want to consult [the textbook chapter 15](https://umass-data-science.github.io/190fwebsite/textbook/15/1/correlation/) for the definition of a linear association.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

We're going to continue with the provisional assumption that they are linearly related, so it's reasonable to use linear regression to analyze this data.

We'd next like to plot the data in standard units.  Recall that, if `nums` is an array of numbers, then

    (nums - np.mean(nums)) / np.std(nums)

...is an array of those numbers in standard units. This is called normalizng your data to zero mean and unit varriance.

**Question 3:** Compute the mean and standard deviation of the eruption durations and waiting times.  **Then** create a table called `faithful_standard` containing the eruption durations and waiting times in standard units.  (The columns should be named `"duration (standard units)"` and `"wait (standard units)"`.

In [6]:
duration_mean = ...
duration_std = ...
wait_mean = ...
wait_std = ...

faithful_standard = ...

faithful_standard

In [None]:
grader.check("q1.3")

**Question 4:** Plot the data again, but this time in standard units.

<!-- BEGIN QUESTION -->



In [8]:
...

<!-- END QUESTION -->

You'll notice that this plot looks exactly the same as the last one!  The data really are different, but the axes are scaled differently.  (The method `scatter` scales the axes so the data fill up the available space.)  So it's important to read the ticks on the axes.

**Question 5:** Among the following numbers, which would you guess is closest to the correlation between eruption duration and waiting time in this dataset?

* -1
* 0
* 1

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 6:** Compute the correlation `r`.  *Hint:* Use `faithful_standard`.  Section [15.1](https://umass-data-science.github.io/190fwebsite/textbook/15/1/correlation/) explains how to do this.

In [9]:
r = ...
r

In [None]:
grader.check("q1.6")

# 2. Variability of the Sample Mean

By the Central Limit Theorem, the probability distribution of the mean of a large random sample is roughly normal. The bell curve is centered at the population mean. Some of the sample means are higher, and some lower, but the deviations from the population mean are roughly symmetric on either side, as we have seen repeatedly. Formally, probability theory shows that the sample mean is an unbiased estimate of the population mean.

In our simulations, we also noticed that the means of larger samples tend to be more tightly clustered around the population mean than means of smaller samples. In this section, we will quantify the variability of the sample mean and develop a relation between the variability and the sample size.

Let's take a look at the salaries of employees of the City of San Francisco in 2014. The mean salary reported by the city government was about $75463.92.

In [11]:
salaries = Table.read_table('sf_salaries_2014.csv').select("salary")
salaries

In [12]:
salary_mean = np.mean(salaries.column('salary'))
salary_mean

In [13]:
salaries.hist('salary', bins=np.arange(0, 300000+10000*2, 10000))

**Question 1:** Clearly, the population does not follow a normal distribution. Keep that in mind as we progress through these exercises.

Let's take random samples and look at the probability distribution of the sample mean. As usual, we will use simulation to get an empirical approximation to this distribution.

We will define a function `simulate_sample_mean` to do this, because we are going to vary the sample size later. The arguments are the name of the table, the label of the column containing the variable, the sample size, and the number of simulations.

<!-- BEGIN QUESTION -->



In [14]:
"""Empirical distribution of random sample means"""

def simulate_sample_mean(table, label, sample_size, repetitions):
    
    means = make_array()

    for i in np.arange(repetitions):
        new_sample = ...
        new_sample_mean = ...
        means = ...

    sample_means = Table().with_column('Sample Means', means)
    
    # Display empirical histogram and print all relevant quantities – don't change this!
    sample_means.hist(bins=20)
    plt.xlabel('Sample Means')
    plt.title('Sample Size ' + str(sample_size))
    print("Sample size: ", sample_size)
    print("Population mean:", np.mean(table.column(label)))
    print("Average of sample means: ", np.mean(means))
    print("Population SD:", np.std(table.column(label)))
    print("SD of sample means:", np.std(means))

<!-- END QUESTION -->

Verify with your neighbor or TA that you've implemented the above function correctly. If you haven't implemented it correctly, the rest of the lab won't work properly, so this step is crucial.

**Question 2:** In the following cell, we will create a sample of size 100 from the salaries table and graph it using our new `simulate_sample_mean` function.

In [15]:
simulate_sample_mean(salaries, 'salary', 100, 10000) 
plt.xlim(50000, 100000)

In the following two cells, simulate the mean of a random sample of 400 salaries and 625 salaries, respectively. In each case, perform 10,000 repetitions of each of these processes. Don't worry about the `plt.xlim` line – it just makes sure that all of the plots have the same x-axis. 

In [20]:
simulate_sample_mean(salaries, 'salary', 400, 10000)
plt.xlim(50000, 100000)

In [21]:
simulate_sample_mean(salaries, 'salary', 625, 10000)
plt.xlim(50000, 100000)

Write your conclusions about what you just saw in the below cell.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3:** Assign the variable `bootstrap_sampled_SD` to the integer corresponding to your answer to the following question:

When I increase the number of bootstrap samples that I take, for a fixed sample size, the SD of my sample mean will...

1. Increase
2. Decrease
3. Stay about the same
4. Vary widly

In [22]:
bootstrap_sampled_SD = ...

In [None]:
grader.check("q2.3")

Below, we'll look at what happens when we take a fixed sample, then bootstrap from it with different numbers of resamples. How does the distribution of the resampled means change?

In [24]:
simulate_sample_mean(salaries, 'salary', 100, 500)
plt.xlim(50000, 100000)

In [25]:
simulate_sample_mean(salaries, 'salary', 100, 1000)
plt.xlim(50000, 100000)

In [26]:
simulate_sample_mean(salaries, 'salary', 100, 5000)
plt.xlim(50000, 100000)

In [27]:
simulate_sample_mean(salaries, 'salary', 100, 10000)
plt.xlim(50000, 100000)

What did you notice about the sample means of the four bootstrapped samples above? Discuss with your neighbors. If you're unsure of your conclusion, ask your TA.

**Question 4:** Let's think about how the relationships between population SD, sample SD, and SD of sample means change with varying sample size. Which of the following is true? Again, assign the variable `pop_vs_sample` to the integer corresponding to your answer.

1. Sample SD gets smaller with increasing sample size, SD of sample means gets smaller with increasing sample size
2. Sample SD gets larger with increasing sample size, SD of sample means stays the same with increasing sample size
3. Sample SD becomes more consistent with population SD with increasing sample size, SD of sample means gets smaller with increasing sample size
4. Sample SD becomes more consistent with populatoin SD with increasing sample size, SD of smaple means stays the same with increasing sample size

In [39]:
pop_vs_sample = ...

In [None]:
grader.check("q2.4")

Let's see what happens: First, we calculate the population SD so that we can compare the SD of each sample to the SD of the population.

In [28]:
pop_sd = np.std(salaries.column("salary"))
pop_sd

Let's then how a small sample behaves. Run the following cells multiple times to see how the SD of the sample changes from sample to sample. Adjust the bins as necessary.

In [29]:
sample_10 = salaries.sample(10)
sample_10.hist("salary")
print("Sample SD: ", np.std(sample_10.column("salary")))
simulate_sample_mean(sample_10, 'salary', 10, 1000)
plt.xlim(5,120000)
plt.ylim(0, .0001);

In [30]:
sample_200 = salaries.sample(200)
sample_200.hist("salary")
print("Sample SD: ", np.std(sample_200.column("salary")))
simulate_sample_mean(sample_200, 'salary', 200, 1000)
plt.xlim(5,100000)
plt.ylim(0, .00015);

In [31]:
sample_1000 = salaries.sample(1000)
sample_1000.hist("salary")
print("Sample SD: ", np.std(sample_1000.column("salary")))
simulate_sample_mean(sample_1000, 'salary', 1000, 1000)
plt.xlim(5,100000)
plt.ylim(0, .00025);

Let's illustrate this trend. Below, you will see how the average absolute error of SD from the population changes with sample size (N).

In [34]:
# Don't change this cell, just run it!
sample_n_errors = make_array()
for i in np.arange(10, 200, 10):
    sample_n_errors = np.append(sample_n_errors, np.average([abs(np.std(salaries.sample(i).column("salary"))-pop_sd)
                                                      for d in np.arange(100)]))
Table().with_columns("Average absolute error in SD", sample_n_errors, "N", np.arange(10, 200, 10)).plot("N", "Average absolute error in SD")

You should notice that the distribution of means gets spiker, and that the distribution of the sample increasingly looks like the distribution of the population as we get to larger sample sizes. 

Is there a relationship between the sample size and absolute error in standard deviation? Identify this relationship – if you're having trouble, take a look at [Section 14.5](https://umass-data-science.github.io/190fwebsite/textbook/14/5/variability-of-the-sample-mean/) in our textbook.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()