## Lab 9: Simpson's Paradox, Multiple Testing and Review ##

In today's lab, we're doing a few things. First, we'll practice using two-way tables to investigate a surprising phenomenon. Next, we'll run through an example of multiple testing to see why it is important. Finally, as a review for the final, we'll walk through a basic simulation example. 

As usual, **run the cell below** to prepare the lab.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
import pandas as pd

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Simpson's Paradox ##

Suppose we've collected data on patients from 2 different hospitals, hospital A and hospital B. For each patient, we know whether or not they were in good or bad condition when the arrived, and we know whether or not they survived. Run the following cell to load and see the data.

In [None]:
hospitals = pd.read_csv('hospital_survival.csv')
hospitals

#### Question 1.1 ####

Make a two-way table that classifies hypothetical hospital patients by to the hospital that treated them and whether they survived or died (hint: you can add an [`aggfunc`](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) argument to a pivot table to do this quickly)

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
hospitals.pivot_table(index='status', columns='hospital', aggfunc='count')
</pre>
</details>

#### Question 1.2 ####

Based on this table, which of the 2 hospitals have a higher survival rate?

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
Hospital B - 80.8 survival rate vs. 78.9 at hospital A
</pre>
</details>

#### Question 1.3 ####

Now, make a three-way table that further separates patients by their condition. A three-way table is just like a two-way table, except now we are cutting the data along three dimensions instead of two. (hint: the pivot function can do this easily by passing a list of the values you want along the columns instead of a single column. The function is expecting an extra column to aggregate over, so we are adding a dummy column for you)

In [None]:
hospitals['total'] = 1
hospitals.pivot_table(...)


<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
hospitals.pivot_table(index='status', columns=['hospital', 'condition'], aggfunc='count')
</pre>
</details>

#### Question 1.4 #### 

Now what do you observe? Does hospital B have a better survival rate for either condition? Why does it have a better survival rate overall?

In [None]:
# put your computations here

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
Should find that hospital B has worse survival rates for both categories of patients, but has a higher rate overall because it has more patients in good condition.
</pre>
</details>

This phenomenon is known as Simpson's paradox, which is a phenomenon in probability and statistics in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. See [wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox) for more information.


## 2. Multiple Testing ##

Suppose we're studying 10,000 different fad diets, and we are interested in whether or not they are effective. For each of these fad diets, we can compute a p-value comparing the average weight loss between a treatment group and a control group, where we are testing the hypothesis that the treatment group loses more weight than the control group. 

We don't actually have this data, but we are going to run a simulation that guarantees that none of these 'diets' are effective - we are going to load a small sample of bodyweight data, randomly choose the control and the treatment groups, and use a two-sided t-test to compute a p-value testing the hypothesis that the samples are the same. Since the procedure is just randomly choosing samples, none of the differences between the weights should actually be significant. This implies that the null hypothesis is true for all diets. 

Run the following code - it produces an array of the 10,000 p-values described above.

In [None]:
from scipy import stats
# if you don't have this library, run `pip install scipy` in your terminal

np.random.seed(420)
population = pd.read_csv('bodyweight.csv')['Bodyweight']
alpha = 0.05
N = 12
m = 10000
pvals = []
for i in range(m):
    control = np.random.choice(population, N)
    treatment = np.random.choice(population, N)
    t, pval = stats.ttest_ind(treatment,control)
    pvals = np.append(pvals, pval)

#### Question 2.1 ####
Make a histogram of the simulated p-values. What do you observe?

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>

The distribution of the p-values should be nearly uniform

</details>

#### Question 2.2 ####
Suppose we reject any hypothesis with a p-value less than .05. How many of the hypotheses in our simulation would we reject?

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>
One example of how to do this is below, you can implement the function with any any
<pre>
sum(pvals<.05) = 512
</pre>
</details>

#### Question 2.3 ####
What are some potential issues if we were to incorrectly reject this many hypotheses?

In [None]:
...

## Review - Number of Heads in 100 Tosses ##

It is natural to expect that in 100 tosses of a coin, there will be 50 heads, give or take a few.

But how many is “a few”? What’s the chance of getting exactly 50 heads? Questions like these matter in data science not only because they are about interesting aspects of randomness, but also because they can be used in analyzing experiments where assignments to treatment and control groups are decided by the toss of a coin.

In this example we will simulate the number of heads in 100 tosses of a coin. The histogram of our results will give us some insight into how many heads are likely.

Let’s get started on the simulation, following the steps above.


#### Question 3.1 ####
Create a "coin" object that we can use to simulate.

In [None]:
coin = ...


<details><summary><button>Click here to reveal the answer!</button></summary>
    
coin = ['Heads', 'Tails']

</details>

#### Question 3.2 ####

Simulate 100 tosses of the coin, assuming it is fair, and count the number of heads in the result

In [None]:
outcome = ...
num_heads = ...

<details><summary><button>Click here to reveal the answer!</button></summary>
    
outcome = np.random.choice(coin, 100)

num_heads = np.count_nonzero(ten_tosses == 'Heads')

</details>

#### Question 3.3 ####

We now want to simulate flipping a coin 100 times to estimate the number of heads that appear in the long run. How many repetitions we want is up to us. The more we use, the more reliable our simulations will be, but the longer it will take to run the code. Python is pretty fast at tossing coins. Let’s go for 10,000 repetitions. That means we are going to do the following 10,000 times:

 - Toss a coin 100 times and count the number of heads.

That’s a lot of tossing! It’s good that we have Python to do it for us.

Complete the following code to run the full simulation.

In [None]:
# An empty array to collect the simulated values
heads = []

# Repetitions sequence
num_repetitions = 10000
repetitions_sequence = ...

# for loop
for i in repetitions_sequence:
    
    # simulate one value
    outcomes = ...
    num_heads = ...
    
    # augment the collection array with the simulated value
    heads = ... 

<details><summary><button>Click here to reveal the answer!</button></summary>

heads = []

num_repetitions = 10000
repetitions_sequence = np.arange(num_repetitions)

for i in repetitions_sequence:
    
    # simulate one value
    outcomes = np.random.choice(coin, 100)
    num_heads = np.count_nonzero(outcomes == 'Heads')
    
    # augment the collection array with the simulated value
    heads = np.append(heads, num_heads)  
</details>

#### Question 3.4 ####

We now want to analyze the results. Create a table with a row for each repetition containing the repetition number and the number of heads found in that run of the simulation. Then, make a histogram of the results and comment.

In [None]:
simulation_results = ...

<details><summary><button>Click here to reveal the answer!</button></summary>

simulation_results = pd.DataFrame({
    'Repetition': np.arange(1, num_repetitions + 1),
    'Number of Heads': heads
  }
)

simulation_results.hist('Number of Heads', bins = np.arange(30.5, 69.6, 1))

</details>


Great job! :D You're finished with lab 9!

**Acknowledgement**: The materials for this lab are based on the [data8](http://data8.org/) course at UC Berkeley and [this book](http://genomicsclass.github.io/book/pages/multiple_testing.html) for the multiple testing example.