# Class 14 Warm-up: Populations and Sampling continued... Hypothesis Testing

This activity will build on the one from Class 13, Tuesday's class.

If we have a sammple of only 20 height measurements, can we show a statistically significant difference between the average height of men and women?

In [None]:
import numpy as np
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline

## Weight and Height Data
Let's load our table again.

In [None]:
population = Table.read_table("./data/weight-height.csv")
population.show(5)

Let's just focus on height.

In [None]:
pop_ht = population.select("Gender", "Height")
pop_ht.show(5)

## Plot the distribution by gender for the full 10,000 samples

In [None]:
male_heights = pop_ht.where("Gender", "Male").column("Height")
female_heights = pop_ht.where("Gender", "Female").column("Height")

all_heights = pop_ht.column("Height")
bins = np.linspace(np.min(all_heights), np.max(all_heights), 30)

plt.figure(figsize=(8, 5))

# Plot overlapped histograms with transparency
plt.hist(male_heights, bins=bins, color="steelblue", alpha=0.6, label="Male", edgecolor="white")
plt.hist(female_heights, bins=bins, color="tomato", alpha=0.6, label="Female", edgecolor="white")

plt.title("Height Distribution by Gender")
plt.xlabel("Height (inches)")
plt.ylabel("Count")
plt.legend()
plt.tight_layout()
plt.show()

## Challenge #1: Describe in words what the this plot shows.

## Sampling
Data table have a built-in method for random sampling, but there ia trick to ensure you are all working with the same "random" sample. We just need to seed the random number generator, which tells it where to start in it's random sequence.

Even if you run the cell below multiple times, you'll get the same sample.

In [None]:
np.random.seed(42)  # you can pick any integer seed, but don't change it
pop_ht_sample = pop_ht.sample(20)  # reproducible sample
pop_ht_sample

In [None]:
# Split the sample by gender
men_sample = pop_ht_sample.where("Gender", "Male").column("Height")
women_sample = pop_ht_sample.where("Gender", "Female").column("Height")

A sample of 20 is really to small to make a separate histogram of men's and womens's heights, so let's compare them with a scatter plot. We add a little bit of jitter to the x-values so the points with the same height don't plot of top of each other.

In [None]:
    # Scatter individual points for each group with slight jitter
    jitter = 0.07
    plt.scatter(np.full(len(men_sample), 1) + np.random.uniform(-jitter, jitter, len(men_sample)),
                men_sample, color="steelblue", alpha=0.8, edgecolors="white", linewidths=0.5)
    plt.scatter(np.full(len(women_sample), 2) + np.random.uniform(-jitter, jitter, len(women_sample)),
                women_sample, color="tomato", alpha=0.8, edgecolors="white", linewidths=0.5)

    plt.xticks([1, 2], [f"Male (n={len(men_sample)})", f"Female (n={len(women_sample)})"])
    plt.ylabel("Height (inches)")
    plt.title("Sample (n=20) Height Distributions by Gender — Violin + Points")
    plt.tight_layout()
    plt.show()

## Challenge #2: 
**What does this plot tell you about the sample data?**

**Do you think it is representative of the whole population?**

## Create the test statistic
The test statistic will be the difference in the mean height of the men and the women.

In [None]:
test_statistic = np.mean(male_heights) - np.mean(female_heights)
test_statistic

So for our sample, the men are on average 5.3 inches taller than the women. Could this be just a matter of random variation?  After all, both the men and women range widely in height, and there are certainly women taller than some of the men.

## Challenge #3: Hypothesis Test Thought Experiment

State the null hypothesis and the alternative hypothesis.

NULL Hypothesis:


Alternative Hypothesis:

## Simulating the Null Hypothesis
**Here is the key idea -- pay attention!**

*If Null Hypothesis is true, then it doesn't matter if we change the male and female labels. The distbution of height by gender will be the same as long as we keep the number of males and females the same.*

So, to simulate this we just keep shuffling the gender labels and computing the difference in means between the new "males" and "females."

In [None]:
# Here is an example of how we shuffle the labels
gender_labels = pop_ht_sample.column("Gender")
new_gender_labels = pop_ht_sample.sample(with_replacement=False).column ("Gender")
pop_ht_sample_shuffled = pop_ht_sample.with_columns("Shuffle Gender", new_gender_labels)
pop_ht_sample_shuffled.show(5)

**To perform the simulation we:**
* Loop many times
* shuffle the gender labels each time
* calculate the test statistic each time
* Store the result
  
This builds up the distibution of mean height differences under the Null that we can then use to calculate the p-value.

In [None]:
num_simulations = 5000
sim_test_statistics = []
for i in np.arange(num_simulations):
    new_gender_labels = pop_ht_sample.sample(with_replacement=False).column("Gender")
    pop_ht_sample_shuffled = pop_ht_sample.with_columns("Shuffle Gender", new_gender_labels)
    male_shuffled = pop_ht_sample_shuffled.where("Shuffle Gender", "Male")
    female_shuffled = pop_ht_sample_shuffled.where("Shuffle Gender", "Female")        
    mean_male_ht = np.mean(male_shuffled.column("Height"))
    mean_female_ht = np.mean(female_shuffled.column("Height"))
    sim_test_statistics.append(mean_male_ht - mean_female_ht)

In [None]:
# Show the first ten simulated average height differences
sim_test_statistics[0:10]

In [None]:
plt.hist(sim_test_statistics, bins=30)
plt.scatter(test_statistic, 0, color="red", s=300, marker="^");

## Challenge #5: Conclusions
**What is your p-value?**

**Can you reject the Null Hypothesis**