# Comparing Two Samples

In this workbook, you will learn how to test whether two populations are the same, by simulation and by theory.

In [None]:
from symbulate import *
%matplotlib inline

In [None]:
import numpy as np

import pandas as pd
pd.set_option("display.max_rows", 15)

In [None]:
data = pd.read_csv("/data/harris.csv")
data

## By Simulation

Let's test whether the beginning salaries of men and women are the same, using simulation. This simulation-based test is usually called a **permutation test**.

Suppose that salaries of men and women are really the same and that any difference is just due to chance. (This is the null hypothesis.) Then, the `Sex` variable is arbitrary as far as salaries are concerned. We can shuffle the `Sex` variable to get the distribution of differences between the salaries of men and women, assuming the null hypothesis is true.

**Step 1.** To shuffle the `Sex` column, we can simply put that column into a box model and pull out all the tickets without replacement. The set of tickets we get will be the same every time, but the order will be different.

In [None]:
model = BoxModel(box=list(data["Sex"]), size=len(data), replace=False)
model.sim(100)

**Step 2.** Let's define a function that, given a sex column, returns the difference in average beginning salary between men and women. 

(Note the use of vectorization and boolean masking!)

In [None]:
def calculate_difference(sex):
    sex = np.array(sex)
    return data[sex == "Male"]["Bsal"].mean() - data[sex == "Female"]["Bsal"].mean()

diffs = RV(model, calculate_difference)
sims = diffs.sim(10000)

**Step 3.** Let's plot the distribution of the differences and locate the observed difference on this distribution to obtain a $p$-value.

In [None]:
sims.plot(type="bar", bins=30)

In [None]:
obs_diff = calculate_difference(data["Sex"])
obs_diff

In [None]:
sims.count_geq(obs_diff) / 10000

## By Theory

Under the null hypothesis that the populations are the same, the observed difference $\bar X_1 - \bar X_2$ approximately follows a 

$$\textrm{Normal}\left(0, \sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}\right)$$ 

distribution. Let's repeat the above analysis using this theoretical approximation. Do we get the same answer?

**Step 1.** Calculate the standard error.

In [None]:
m = data[data["Sex"] == "Male"]["Bsal"]
f = data[data["Sex"] == "Female"]["Bsal"]

se = np.sqrt(m.var() / m.count() + f.var() / f.count())

**Step 2.** Simulate from the normal distribution, and compare the observed difference to this distribution to obtain a $p$-value.

In [None]:
Z = RV(Normal(mean=0, sd=se))
sims = Z.sim(10000)
sims.plot(type="bar", bins=30)

sims.count_geq(obs_diff) / 10000

## Now You Try It!

The main contention in the Harris Bank lawsuit was that men and women were treated differently in terms of salary _increases_. Test the null hypothesis that salary _increases_ did not differ between men and women. Try doing this by simulation and by theory.