# Hypothesis testing
Elements of Data Science

## Hypothesis Testing Learning Goals
Develop and test an hypothesis
- Hypothesis
    - testable hypothesis
    - statistic
- Simulation: Sample the distribution
    - Repeat and collect outcomes
    - Iteration: 
        `for i in np.arange(samples)`
- Examine resulting distribution of outcomes
    - Probability distribution
    - Uncertainty
- p-test

In [None]:
import numpy as np
from datascience import *

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

## A digression -- python formatting

You know Python has a few formatting rules. Indentation is critical for defining structure in loops, functions, and conditional statements, for example. Beyond that, you can pretty much do as you like. Follwq formatting conventions, however, will make your code more readable.

Black is a Python tool for checking formatting. You can actually install and run it to check your code, or you can simply paste some code you have written into the [Black Playground](https://black.vercel.app) to see the suggested format.

In [None]:
# First Example: Paste into the playground. Note: the code will run as written

a=2
b=3
c=a+b
print( c)

In [None]:
# Paste the foratted code below:



In [None]:
# Second example: Again, the code as written will run without error. But long lines are hard to read.
# Create a list
CST_departments = ['Earth & Environmental Science','Biology','Chemistry','Physics','Computer & Information Sciences','Mathematics']

# Now loop over each element and print it.
for dept in CST_departments:
    print(dept)

In [None]:
# Paste the foratted code below:



In [None]:
# Formatting complex statements
from datascience import *
data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full_census_table = Table.read_table(data)
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2020')
partial_census_table = partial_census_table.relabeled('SEX', 'GENDER').relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2020', '2020')
partial_census_table.show(3)

In [None]:
# Paste the foratted code below:



Black makes some opinionated choices. You do not have to follow all of its formatting suggestions, but you should pick a consist way to format your code and Black uses what mant pythonists consider to be best practice.

## End of digression...

## LAB 06 TIPS

In [None]:
modifier = 11
num_observations = 7

def simulate_observations(modifier, num_observations):
    """Produces an array of 7 simulated modified die rolls"""
    ...

observations = ...
observations

In [None]:
# How to make an empty array and append values
test = make_array()
np.append(test, 5)
test

# Back to Ground Hogs

## Hypothesis Testing

#### Ground Hog


In [None]:
groundhogdata = Table.read_table('../Lab05/GroundHogData/summarizedGroundhogData_20210326.csv')
groundhogdata

In [None]:
yes_late = groundhogdata.where('shadowPres', 'yes').where('earlyOrLate', 'late')
no_early = groundhogdata.where('shadowPres', 'no').where('earlyOrLate', 'early')

correct_tbl = yes_late.append(no_early)

In [None]:
num_correct = correct_tbl.num_rows / groundhogdata.num_rows * 100
print(f"The groundhogs were correct {num_correct:0.1f} percent of the time.")

## Is 51.1% is correct "significantly" better than random guessing?

To answer this question we need to simulate guessing randomly 520 times (the size of our data set) and see what fractionof the time we do this much better than 50-50 by guessing.

In [None]:
shadow_options = make_array('yes', 'no')
spring_options = make_array('late', 'early')
num_observations = groundhogdata.num_rows
num_simulations = 1500
num_observations

In [None]:
# Simulate by guessing -- this is a single trial.

right = 0 
wrong = 0

for obs in range(num_observations):
    shadow = np.random.choice(shadow_options)
    spring = np.random.choice(spring_options)
    
    if shadow == 'yes' and spring == 'late':
        right += 1
    elif shadow == 'no' and spring == 'early':
        right += 1
    else:
        wrong += 1

simulated_num_correct = right / num_simulations
simulated_num_correct

In [None]:
# Create a function so we can generate many trials.
def sim_ground(repeats):
    correct_obs = []
    for i in np.arange(repeats):
        right = 0 
        wrong = 0
        for obs in range(num_observations):
            shadow = np.random.choice(shadow_options)
            spring = np.random.choice(spring_options)
            if shadow == 'yes' and spring == 'late':
                right += 1
            elif shadow == 'no' and spring == 'early':
                right += 1
            else:
                wrong += 1
        simulated_num_correct = right / num_observations
        correct_obs.append(simulated_num_correct)
    return correct_obs        

In [None]:
# Plot the histogram of 1500 trials
plt.hist(sim_ground(num_observations),bins=np.arange(0.4,0.6,.01),color = "skyblue", ec="red")
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground.png')
plt.show()

In [None]:
hogsimdata = Table().with_columns("Correct",sim_ground(num_simulations))
hogsimdata

In [None]:
# The red dot shows the groundhog prediction accuracy
hogsimdata.hist(0,bins=np.arange(0.4,0.6,.01))
plt.scatter(num_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

In [None]:
# What is the probabilty of doing better than the groundhogs with random guessing?
np.count_nonzero(hogsimdata.column('Correct') >= num_correct/100) / num_simulations

#### Essex Ed

In [None]:
groundhogdata.where('hogID',"EE")

In [None]:
correct_tbl.where('hogID',"EE").sort('year')

In [None]:
num_correct = correct_tbl.where('hogID',"EE").num_rows / groundhogdata.where('hogID',"EE").num_rows * 100
num_correct

In [None]:
hogsimdata.hist(0,bins=np.arange(0.4,0.6,.01))
plt.scatter(num_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

In [None]:
np.count_nonzero(hogsimdata.column('Correct') >= num_correct/100) / num_simulations

In [None]:
num_observations = 7
hogsimdata = Table().with_columns("Correct",sim_ground(num_simulations))
hogsimdata

In [None]:
#hogsimdata.hist(0,bins=np.arange(0.1,0.9,.01))
hogsimdata.hist(bins=15)
plt.scatter(num_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

# Acknowledgement in the paper
This study was the result of a truly laboratory-driven effort that started over speculation on a Friday at the campus pub and has led to a cumulative effort of students in the Community Ecology and Energetics Laboratory at Lakehead University over several years. 