# Class 20 Warmup - Unpaired t-test by simulation

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

Question 5 from the Mt. St. Helens mini-project asked you to compare the recovery of two different plots to see if, on average, one plot location has recovered more than another. You were told:

"Independent samples, unpaired, t-test\
Assumptions, differs from paired test used above and in Lab 07. The main difference is that we are comparing two different groups of plots as compared to the question 4 test which was applied to the same plots undergoing a 'treatment' of time passage following the eruption (paired)."

**You were directed to use the formula-based approach. But how would we tackle this using simulation?**

## Step One: Load the data

In [None]:
# Read the data
datafile = "../../../Mini Project II/data/MSH_STRUCTURE_PLOT_YEAR.csv"
MSH_YEAR = Table.read_table(datafile)
MSH_YEAR.show(2)

## Step Two: Pick two plots to compare

In [None]:
# Compare plots
MSH_YEAR.group("PLOT_NAME", np.mean)

It is pretty obvious that, for example, the percent cover at BUCA is much greater than for ABPl, but some of the plots are much closer. To make this interesting, let's pick to plot that at first glance have similar percent cover to see if the diffent is statistically significant. 

Let's pick plot1 = ABPL and plot2 = LAHR

In [None]:
# Extract the data for each plot into a table
plot1 = MSH_YEAR.where("PLOT_NAME", "ABPL")
plot2 = MSH_YEAR.where("PLOT_NAME", "LAHR")

In [None]:
# Check the years and number of subplots for each plot
plot1.group("YEAR").show()

In [None]:
plot2.group("YEAR").show()

ABPL has more subplots. LAHR spans more years. How do we perform a comparison?

The first year we have data for both plots is 1995, and the last year we have data for both is 2009. So we will compare the two plots between 1996 and 2009. Let's test for a different in the average percent cover. The same approach would work for richness.

## Step Three: Calculate our test statistic
Our test test statistic is the difference in the averge percent cover between 1996 and 2009 for APBL versus LAHR.

In [None]:
# Restrict the data to common years
plot1 = plot1.where("YEAR", are.between_or_equal_to(1996, 2009))
plot2 = plot2.where("YEAR", are.between_or_equal_to(1996, 2009))

In [None]:
plot1.hist("COVER_%", bins=np.arange(0, 60, 5))

In [None]:
plot2.hist("COVER_%", bins=np.arange(0, 60, 5))

In [None]:
plot1_mean_cover = np.mean(plot1.column("COVER_%"))
plot2_mean_cover = np.mean(plot2.column("COVER_%"))
test_statistic = abs(plot1_mean_cover - plot2_mean_cover)

print("The average percent cover for ABPL was: ", plot1_mean_cover)
print("The average percent cover for LAHR was: ", plot2_mean_cover)
print("The test statistic is the difference in means: ", test_statistic)

## Step Four: Formulate our Hypotheses

### Null Hypothesis
The difference in the percent cover between the two plots is due to random variation.

### Alternative Hypothesis
The difference in percent cover is too large to be random. The two plot have a statistically significant different in percent cover during the recovery period of 1996-2009.

## Step Five: Simulate the Null Hypothesis
If the null hypothesis is true, if would not matter if a plot was labelled ABPL or LAHR because there is no difference. So to test this we need to randomly permute the labels while keeping the number of plots in each category the same. The we calculate the difference in the means between the two relabeled groups. We do this over and over to build up a distribution of difference in the mean under the null hypothesis.

In [None]:
# Select just the columns we need
plot1 = plot1.select("PLOT_NAME", "COVER_%")
plot2.show(3)

In [None]:
plot2 = plot2.select("PLOT_NAME", "COVER_%")
plot2.show(3)

In [None]:
# Combine into one table
plot1_plot2 = plot1.append(plot2)

# We should get the same result as above
plot1_plot2.group("PLOT_NAME", np.mean)

In [None]:
def difference_of_means(table, label):
    """
    Takes: the name of a table and returns mean difference in cover_% of the two groups
    """
    means_table = table.group(label, np.mean)
    means = means_table.column("COVER_% mean")
    return means.item(1) - means.item(0)

In [None]:
# Test the function
difference_of_means(plot1_plot2, "PLOT_NAME")

In [None]:
def one_simulated_difference_of_means(tbl, label):
    """
    Returns: Difference between mean after shuffling labels
    """
    
    # array of shuffled labels
    shuffled_labels = tbl.sample(with_replacement=False).column(label)
    
    # table of grades and shuffled labels
    shuffled_table = tbl.select('COVER_%').with_column(
        'Shuffled Label', shuffled_labels)
    
    return difference_of_means(shuffled_table, 'Shuffled Label') 

In [None]:
# Test our functions
one_simulated_difference_of_means(plot1_plot2, "PLOT_NAME")

In [None]:
# Run the simulation
differences = []
repetitions = 3000
for i in np.arange(repetitions):
    new_difference = one_simulated_difference_of_means(plot1_plot2, "PLOT_NAME")
    differences.append(new_difference)
difference = np.array(differences)

In [None]:
# Plot the simulation results
Table().with_column('Difference Between Group Means', differences).hist(bins=30)
ax = plt.gca()
ax.plot(test_statistic, 0,  marker='^', markersize=40, mec='red')
ax.set_title('Prediction Under the Null Hypothesis');

In [None]:
p_value = np.count_nonzero(differences > test_statistic) / repetitions
p_value