# STA130 Tutorial 5: Hypothesis Testing

|<img src="https://pictures.abebooks.com/inventory/md/md31377899338.jpg" alt="Scientific Revolusions" style="width: 300px; height: 450px;"/>|<img src="https://i.ytimg.com/vi/Yn8cCDtVd5w/maxresdefault.jpg" alt="Kuhn Cycle" style="width: 800px; height: 450px;"/>|
|-|-|
| | |

# Today's agenda (5 minutes)

- Lecture/Example: Stella McStat's "Wheel of Destiny"

    - **Parameters** versus **Statistics**, **independent** samples, etc.

- Hypothesis test walk-through (emphasize and explain vocab)

- Group Presentation

- Q&A/Vocabulary list (if time allows)


# The Wheel of Destiny (5 minutes)

Stella McStat had been running a small-time gambling operation on campus for several months during her first year at U of T. It was disrupted during COVID, but now that courses seem to be reliably back to in-person formats, Stella is getting things back up and running.

| <img src="stella2.png" style="height: 450px;"/> |  <img src="fair.png" style="height: 450px;"/> |
|-|-|
| | |

<sub><sup>Adapted from Lawton, L. (2009) An Exercise for Illustrating the Logic of Hypothesis Testing, Journal of Stat. Education, 17(2).</sup></sub>


# The Wheel of Destiny

The format: For each spin of the wheel, two gamblers take part. For a toonie each (\\$2 Canadian), Stella sells one a red ticket and one a black ticket  (i.e., total \\$4). Then Stella spins the Wheel of Destiny. The person who holds the colour on which the spinner stops gets \\$3.50 (Stella keeps \\$0.50 per spin for running the game and providing snacks).

| <img src="stella2.png" style="height: 450px;"/> |  <img src="fair.png" style="height: 450px;"/> |
|-|-|
| | |

<sub><sup>Adapted from Lawton, L. (2009) An Exercise for Illustrating the Logic of Hypothesis Testing, Journal of Stat. Education, 17(2).</sup></sub>


# The Wheel of Destiny

Stella just bought a new spinner, the critical piece of equipment for this game. She's heard some mixed reviews about the manufacturer she has purchased from. Before she beings using this spinner, she wants to make sure that it is, in fact, fair—she wants both colours to come up equally often. Because of the set-up of the game, Stella has no incentive to cheat and wants the game to be as fair as possible.

| <img src="stella2.png" style="height: 450px;"/> |  <img src="fair.png" style="height: 450px;"/> |
|-|-|
| | |

<sub><sup>Adapted from Lawton, L. (2009) An Exercise for Illustrating the Logic of Hypothesis Testing, Journal of Stat. Education, 17(2).</sup></sub>


# The Wheel of Destiny

Everything she can examine about the wheel seems fine; there is the same number of sectors of each colour and they each have the same area. BUT! Stella has a great idea and decides to come to YOU, her statistical guru, and ask you to verify that the new spinner is fit to use.

| <img src="stella2.png" style="height: 450px;"/> |  <img src="fair.png" style="height: 450px;"/> |
|-|-|
| | |

<sub><sup>Adapted from Lawton, L. (2009) An Exercise for Illustrating the Logic of Hypothesis Testing, Journal of Stat. Education, 17(2).</sup></sub>


## Tutorial Activity [Groups of 3 or 4] (15 minutes): submit answers to the following questions/prompts 

- How can we examine the wheel for fairness?
    - What's a *null hypothesis* here? 
        - Hint: the alternative hypothesis is just "$H_1: H_0 \text{ is False}$"
- What's "data" here? 
    - What's a sample?
        - Hint: the population would be every spin result ever 
    - What's the difference betweeen a **parameter** and a **statistic**?
    - What's the difference between a **dependent** and **independent** sample?
- How would/could we go about conducting a simulation-based hypothesis test here?
    - Hint: what is a p-value?
    - *Write and submit as a group (with all your names) the process of using simulation to create a p-value for this problem*

## Answers Review (10 minutes)

- How can we examine the wheel for fairness?

$$
\begin{align}
H_0 &{}: {} p_r = p_b = 0.5 \quad\text{probability of a spin coming up red or black is equal}\\
H_1 &:{} p_r \neq p_b \neq 0.5 \quad\text{probability of a spin coming up red isn't same as black} \quad\text{(alternative hypothesis)}
\end{align}$$


- We could perform actual spins and calculate the proportion of landing red
    - The observed proportions of red would be our **statistic**
    - The actual true chance of a red spin would be our **parameter**
    - If each spin result doesn't affect the others the spins are **independent**  
    
    
- How would/could we conduct a simulation-based hypothesis test here?
    - p-value: [TYPE OUT DEFINITION WITH HELP FROM CLASS]
    - *Demonstrate/Discuss the simulation code creating a p-value!*

# Code I (15 minutes)
- Discuss how line 6 specifies a **parameter** while line 10 calculates a **statistic**
    - Try out some different parameter values 
    - Differentiate this exercise from specifying a *null hypothesis*
- Discuss/Explore what changing the seed in line 4 does
- Discuss/Explore question on line 3
- Discuss `replace=True` on line 7 and **independence**

In [None]:
# Simulating an observed test statistic
import numpy as np
spins = 100  # chose 100 spins, pros/cons of doing more/less?
np.random.seed(1) # Experiment with changing this
# What if the spinner wasn't actually truthfully fair and slightly favored red (say 51-49)...
spin_results = np.random.choice(['Red', 'Black'], p=[0.51, 0.49], # Truth we don't know:
                                size=spins, replace=True)         # not a null hypothesis
# convert spin results to [1/True]'s for red and [0/False]'s for black
observed_test_stat = (spin_results == 'Red').sum() / spins
print('Our test statistic is ' + str(observed_test_stat) + ', meaning we spun ' + str(int(observed_test_stat * spins)) + ' reds out of ' + str(spins) + ' spins.')

# Intermission (5 minutes)

Let's now simulate the sampling distribution of the proportion of "Red" spins in a sample under a 50/50 assumption for the null hypothesis. 
> This is a nonparametric simulation-based analysis as there are no distributional assumptions about distribution of the data (such as normality): the only an assumption is that of the parameter $p=0.5$ that the chance of getting "Red" is the same as getting "Black" in this data.

- Do you see why we've changed `p=[0.51, 0.49]` on the previous slide to `p=[0.5, 0.5]` on the next slide?
<!-- Previously we imagined the spinner was not actually fair; but, this is different than a null hypothesis assumption that the spinner is fair, which is the assumption we use to generate our sampling distribution on the next slide --> 
- Are the simulated draws from the sampling distribution on the next slide independent? 
<!-- Yes, each simulated sample does not affect the other simulated samples --> 

# Code II (10 minutes)

In [None]:
import pandas as pd
import plotly.express as px
num_samples,sample_size = 1000,100  # What are each of these? Do the choices here change results? 
simulated_sample_proportions = []
np.random.seed(seed=1)
for i in range(num_samples):
    sample = np.random.choice(['Red', 'Black'], size=sample_size, p=[0.5, 0.5], replace=True)
    simulated_sample_proportions += [(sample == 'Red').sum() / sample_size]
px.histogram(pd.DataFrame({'Proportion of Red': simulated_sample_proportions}), x='Proportion of Red', 
             title=str(num_samples)+' draws from the Sampling Distribution of the "Proportion of Red" (for a sample of size n='+str(sample_size)+')')

# Code III (15 minutes)

In [None]:
#observed_test_stat=0.54
num_as_or_more_extreme = \
  (abs(np.array(simulated_sample_proportions) - 0.5) >= abs(observed_test_stat - 0.5)).sum()
p_value = num_as_or_more_extreme / num_samples
print(str(num_as_or_more_extreme) + ' sample statistics were as or more extreme than our observed test statistic of ' + str(observed_test_stat) + '\ngiving us a p-value of ' + str(p_value))

<p align="center">
  <img src="https://www.jcpcarchives.org/userfiles/values-of-p-Inference.jpg" />
</p>

[WRITE OUT SOME CONCLUSIONS AS A CLASS FOR DIFFERENT SETTINGS]
<!-- Since our p-value is more than 0.10, we have no evidence against the null hypothesis of the probability of spinning red is 0.5 -->
- num_samples,sample_size = 1000,100; observed_test_stat=0.51
- num_samples,sample_size = 1000,100; observed_test_stat=0.54
- num_samples,sample_size = 10000,1000; observed_test_stat=0.51
- num_samples,sample_size = 10000,1000; observed_test_stat=0.54

In [None]:
    - We'll consider a theoretical **nonparametric** test alternative
    - We'll consider a theoretical **parametric** test alternative    

The above graph is a *sampling distribution*, which is the distribution of our test statistic(proportion of red) through repeated sampling of a population. This differs from looking at the distribution for a given population, or the *data distribution* (usually looks like a curve opposed to a histogram) 

Calculating a p-value (the probability of a test statistic being as or more extreme than the observed test statistic if the null hypothesis was true) will help us determine whether we have evidence against the null hypothesis.  

## Discussion 2: Different Types of P-Values (30 minutes)
### One sample t-test
- Parametric null hypothesis assumption

- Estimate standard deviation ${\sigma}$ of null hypothesis with the sample standard deviation ${\hat\sigma}$

    - Hat symbolizes estimation
<br/><br/>
- Allows us to theoretically derive sampling distribution of test statistic, so p-value derived *theoretically*

    - P-value has same definition, but quantified by integrating area under theoretical distribution curve, rather than counting values as in our simulation based approach
    
    - Relies on parametric assumption

    - How theoretical derivation works for later stats courses

Exp. Our null hypothesis for the "Wheel of Destiny" problem becomes  
$
\begin{align}
H_0 &: {} \mu = 8 \quad\text{(null hypothesis)}\\
&\color{white}{:} \text{ } \sigma \text{ estimated as sample standard deviation } \hat{\sigma} \\
&\color{white}{:} {} \text{ and the population is normally distributed}\\
H_1 &:{} H_0 \text{ is } \text{False}
\end{align}$
- Need all the assumptions in the null hypothesis to hold for the p-value to make sense
- Different from simulaton approach, where we just needed the assumptions in the null hypothesis to form a sampling distribution, from which a p-value can be calculated from.   

In [None]:
# From "Wheel of Destiny Example"
from scipy import stats
stats.ttest_1samp(numeric_spin_results, 0.5)

We observe a theoritical t-test based p-value of 0.705... which again provides no evidence against the null hypothesis of the probability of spinning red is 0.5

### Median Sign Test

- Null hypothesis about the median

- By assumption of median, half of the sample will be greater than and half will be less than the null hypothesis assumed median

    - Implies binomial sampling distribution (a probability distribution that takes on one of two independent values)
    
    - Nonparametric test, no parametric assumption as we know the sampling distribution! 
    
- "Sign test" puts negative sign on samples less than median and positive sign on samples more than median, should be a 50-50 split. 
We can derive a sampling distribution from this test, calculate
    


Exp. Our null hypothesis for the "Wheel of Destiny" problem becomes  
$
\begin{align}
H_0 &: {} Median = 0.5\\
H_1 &:{} H_0 \text{ is } \text{False}
\end{align}$


In [None]:
# sign test for theoretical p-value. Relies on knowledge of binomial, cumulative distribution and sign test
(stats.binom(n=spins,p=0.5).cdf(spins_greater_than-1))*2

Our theoretical p-value is 0.681... which again provides no evidence against the null hypothesis of the probability of spinning red is 0.5

### Wilcoxon Signed Ranked Test
- Nonparametric sign test (no distribution assumptions)

- Compare medians of two dependent samples to see if they are statistically significantly different from each other

    - Dependent samples signifies subjects in one group provide information about subjects in the other group
<br/><br/>
- The "signed ranked test" ranks the differences between the two observations (if your two samples have size n, rank 1 given to smallest difference while rank n given to largest difference). Then applies negative sign when the difference is negative (and positive sign when difference positive)
    
    - Tests whether the mean of these signed ranks is 0
<br/><br/>

Exp. Is there a difference in the 6th grade and 7th grade exam scores for a group of students?  
$
\begin{align}
H_0 &: {} \text{median}_{grade 6} - \text{median}_{grade 7} = 0\\
H_1 &:{} H_0 \text{ is } \text{False}
\end{align}$

In [None]:
import random
random.seed(1)
# Assume score populations are given by below code
grade_6_exam_scores = [random.randint(30, 70) for i in range(0, 100)]
grade_7_exam_scores = [random.randint(50, 80) for i in range(0, 100)]
diff = [grade_7_exam_scores[i]-grade_6_exam_scores[i] for i in range(len(grade_6_exam_scores))]  # difference in scores between grade 7 and grade 6
diff  # most students appear to have scored better in grade 7

In [None]:
stats.wilcoxon(diff)

We observe a p-value < 0.001... which provides very strong evidence against the null hypothesis that there is no difference between grade 6 and grade 7 test scores of the same students

## Discussion 3 (10 minutes)
#### Similarities between all 3 tests? Differences?

<!-- - **Statistical Inference for a Single Sample Proportion**: -->

<!--   1. State null hypothesis $H_0$ and alternative hypothesis $H_1$ -->

<!--   2. Calculate a *test statistic* based on one observed *random sample* -->

<!--   3. *Simulate* samples under $H_0$ and calculate the statictic for the simulated sample. Repeat many times. (this constitutes our estimated *sampling distribution*) -->

<!--   4. Evaluate the evidence against $H_0$ by calculating the *p-value* -->

<!--   5. Make a conclusion (based on the strength of evidence against $H_0$) -->

# Group Presentation (30 min) :

*I will post the following material on Quercus in advance for students to access*

You are an amateur data detective that reads studies in your free time and checks whether the researchers conform to academic integrity. Here are some studies that you recently found. Identify the hypotheses (One-Group or Two-Group) of the researchers and describe how you could validate the findings using a simulation test. Make a presentation on your hypothesis test.

### SCENARIOS

(a) A health survey asked 200 individuals aged 20-45 living in Toronto to report the number minutes they exercised last week. Researchers were interested in determining whether the average duration of exercise differed between people who consume cannabis and those who do not consume cannabis. Assume the researchers who conducted this study found that people who used cannabis exercised, on average, 20 minutes per week. In contrast, people who did not use cannabis exercised 40 minutes per week, on average. The researchers reported a p-value of 0.249.

(b) A study was conducted to examine whether a mother's exposure to alcohol while she was pregnant affects the odds of developemntal disabilities. The researchers used a birth registry of all children born in Ontario in 2018, which included approximately 130,000 births. The researchers found that 0.6% of mothers reported that their babies have developmental disabilities and 56% of babies born to mothers who consumed alcohol at least once during pregnancy. In contrast, 55% of healthy babies were born to mothers who were not exposed to alcohol. The researchers reported a p-value of 0.50.

(c) A group of researchers wanted to find out whether completing a degree in Statistics is a worthy inverstment for an individual. They conducted a survey of graduates from the University of Toronto and included 1,000 recent graduates who completed their Bachelor’s degree in the last five years in the study; 80% of the respondents were female and 20% were male.The median reported income of a UofT graduate was \\$76,000. Among statistics graduates, the median reported income was \\$84,000. The researchers reported a p-value of 0.014.

(d) A team of researchers were interested in understanding millennial’s views regarding housing affordability in Toronto. The team interviewed 850 millennials currently living in Toronto. 84% reported that they felt housing prices were unaffordable in the city. Suppose the researchers were interested in testing whether this proportion was different from a study published last year, which found that 92% of millennials reported that housing costs were unaffordable. The researchers reported a p-value of 0.023.

(e) Suppose a drug company was interested in testing a new weight-loss drug. They enrolled 20,000 participants and assigned 10,000 to take their new drug, SlimX, and 10,000 to take a placebo. The researchers found that over 2 months, participants who took SlimX lost, on average, 5 lbs. In comparison, the control group lost 4.5 lbs during the same time. The researchers reported a p-value of <0.0001.

### In your presentation, it is important to:  

- Give an introduction of the question
- State the hypotheses $H_0$ and $H_A$ 
- Specify the assumption(s) of your hypothesis test, if any
- Describe the procedures for data collection (sampling) and simulation
- Include and explain at least two terms from the vocabulary
- Summarize your findings and interpret the results 
- Provide limitations (optional but good practice) (e.g. sample size, study design issue)

## Some things to keep in mind  

- The introduction should state the overall purpose of the presentation   

- Each main point should be “signaled.” Use *signal phrases* or *signposts* to transition between your points.

- Every group member must speak during the presentation or marks will be deducted for participation.


## Vocabulary (If Time)
- statistical inference 
- population 
- random sample 
- sampling distribution 
- simulation 
- parameter 
- simulation statistic 
- test statistic 
- p-value

#### This Week's Vocab (15-20 min) :

- Vocabs required for the group presentation 
- Type 1 and 2 errors (also written as Type I and II errors)
- Comparing two population means/medians/proportions
- One-group hypothesis tests
- sample
- for() loops
- diff()