# STA130 Tutorial 6 (with \<Your Favorite TA\>): <br>Two Sample Hypothesis Testing

# Today's agenda (5 minutes)

- Two-sample Hypothesis testing lecture/example: Gratitude Journaling

    - **Parameters** versus **Statistics**, **independent** samples, etc.

- Group Presentations

- Gender Disparity Readings

### Gratitude Journaling (5 minutes) *[click down...]*
**Skip over the videos and links for now**  

There are multiple studies exploring the relationship between gratitude journaling and life satisfaction, mental health, adjusting to university and more. An example of gratitude journaling is writing down three things that you are thankful for at the end of the day. We might all be able to benefit from this science-backed way to live more satisfying lives. To learn more, check out this 10-minute video [An antidote to dissatisfaction](https://www.youtube.com/watch?v=WPPPFqsECz0) from the Kurzgesagt YouTube channel.  To hear a little bit more about mental health more generally, check out this 20-minute TED talk [This could be why you're depressed or anxious](https://www.ted.com/talks/johann_hari_this_could_be_why_you_re_depressed_or_anxious?language=en) by Johann Hari.

The data this week is synthetic, but is loosely based on a study of students in their first year of a five-year college in Turkey. Specifically, 61 Turkish students in their first year of college were randomly assigned to either practice gratitude journaling for three weeks (treatment group, n=30) or to write down a factual account of their day for three weeks (control group, n=31). At the end of the three week period, their life satisfaction (on a scale from 0 to 100) was measured as was their adjustment to university life (on a scale from 48 to 336, where scores over 144 are considered to indicate successful adjustment while scores lower than this indicate difficulties with adjusting). To read more about the study, you can read the original [article](https://link.springer.com/article/10.1007/s10447-017-9289-8).

In [None]:
import pandas as pd
gratitude = pd.read_csv("gratitude.csv", encoding="ISO-8859-1"); gratitude

In [None]:
print(gratitude.isnull().sum())  # check for missing data
gratitude = gratitude.dropna()  # technically no change as no missing data

## Tutorial Activity I: Thinking about two-sample hypothesis testing (5 minutes)
0. How could we examine the effectiveness of gratitude journaling?
1. What is the null hypothesis in two-sample context? Hint: Involves statistics for multiple groups
2. (*As class*) Discuss how `LINE 2` and `LINE 4` in the code below calculates our test statistic, and how it differs from a one-sample problem

In [None]:
import numpy as np
adjust_college_means = gratitude.groupby('treatment')["adjust_college"].mean()  # LINE 2
print(adjust_college_means)
observed_test_statistic = np.diff(adjust_college_means)  # LINE 4
observed_test_statistic_numeric = observed_test_statistic[0]  # convert to number
print('\nOur observed test statistic is ' + str(observed_test_statistic_numeric))

#### Sample Answer:
- Two sample hypothesis test!
- We can use the `treatment` and `control` groups to form our two samples

> Let's work with the means ($\mu$) for each group

$$
\begin{align}
H_0 &{}: {} \mu_{treatment} = \mu_{control} \quad\text{mean adjust_college of control and treatment are equal}\\
H_1 &:{} \mu_{treatment} \neq \mu_{control} \quad\text{mean adjust_college of control and treatment are not equal}
\end{align}
$$

Our test statistic would still be a single value, but now the calculated **difference** between the means of each group

## Tutorial Activity II: Data Distribution and Taking Samples (10 minutes)
The distplot below helps us visualize the *data distribution* of `adjust_college` scores for each group.  

3. Does it appear gratitude journalling is effective? Why?

In [None]:
import plotly.figure_factory as ff
gratitude_data = [gratitude.adjust_college[gratitude.treatment=='treatment'],  # Group data by treatment type
             gratitude.adjust_college[gratitude.treatment=='control']]
group_labels = ['treatment', 'control']
fig = ff.create_distplot(gratitude_data, group_labels, bin_size = 8)
fig.update_layout(xaxis_title = "adjust_college"); fig.show()

4. How do we take samples from the population?  
   - Would this be independent or dependent sampling?  
   - What would a *sampling distribution* look like for `adjust_college` scores?  

#### Sample Answer:
- By the assumption of the null hypothesis, there is no difference between $\mu_{treatment}$ and $\mu_{control}$
    - Treatment type gives no indication on `adjust_college` scores
    - To get an *independent* sample, we can shuffle these labels randomly in our given population

In [None]:
np.random.seed(130)
gratitude_copy = gratitude.copy()  # work on a copy to preserve original data
gratitude_copy.treatment = gratitude.treatment.sample(frac=1).values  # .values is crucial so 
# values don't become unshuffled on assignment
gratitude_copy.head()  # notice how treatment column is different!

In [None]:
gratitude_copy.groupby('treatment').size()  # verify only treatment labels were shuffled, 
# we still have 31 control and 30 treatment

In [None]:
def A(): np.random.seed(1)
def B(): return 1000
def C(): return []
def D(): return gratitude.treatment.sample(frac=1).values
def E(): return gratitude_copy.groupby('treatment')["adjust_college"].mean()
def F(x: pd.core.series.Series): return np.diff(x)[0]

## Simulating Sampling Distribution (10 minutes)
The nonparametric permutation test for two samples resembles the one-sample permutation test, with exceptions to how our sample is taken and our calculation for the test statistic (under assumption null hypothesis is true)

As a class, fill in the `...` in the code block below with the letters below (including the `()` after the letter) to complete the sampling distribution simulation
- `A(): np.random.seed(1)`
- `B(): 1000`
- `C(): []`
- `D(): gratitude.treatment.sample(frac=1).values`
- `E(): gratitude_copy.groupby('treatment')["adjust_college"].mean()`
- `F(): np.diff(...)[0]`

In [None]:
try:  # ignore the try-except, just so all cells can be run at beginning
    ...
    difference_in_means = ...
    for i in range(...):
        gratitude_copy.treatment = ...
        difference_in_means += [...] 
except TypeError: pass

In [None]:
# SOLUTION
A()
difference_in_means = C()
for i in range(B()):
    gratitude_copy.treatment = D()
    difference_in_means += [F(E())]

In [None]:
np.random.seed(1); num_samples = 1000; difference_in_means = []  # record test statistic
for i in range(num_samples):
    gratitude_copy.treatment = gratitude.treatment.sample(frac=1).values  # same shuffle as above
    difference_in_means += [np.diff(gratitude_copy.groupby('treatment')["adjust_college"].mean())[0]]  # treatment-control

In [None]:
import plotly.express as px
fig = px.histogram(pd.DataFrame({'Difference in Means': difference_in_means}), x='Difference in Means')
fig.update_layout(title='$$\\text{Sampling Distribution of } \mu_{treatment} - \mu_{control}$$'); 
fig.add_vline(x=np.quantile(difference_in_means,(0.05, .95))[0], line_dash = 'dash', line_color = 'red', 
              annotation_text = '90% confidence interval')
fig.add_vline(x=np.quantile(difference_in_means,(0.05, .95))[1], line_dash = 'dash', line_color = 'red', 
              annotation_text = '90% confidence interval')
fig.show()
print("90% confidence interval for the difference in means", np.quantile(difference_in_means,(0.05, .95)))

## Follow up on Graph
- Data should appear centered at 0 - as expected under assumption of null hypothesis!
- We can tie our simulated sampling distribution back to confidence intervals!
    5. What does the 90% confidence interval tell us?
    6. How can we shrink the range of this interval?

### Sample Solution
- For a 90% confidence interval, we are 90% confident that the true difference in means is captured in the interval of the difference in means from -16.060 to 15.988
- To shrink this interval's range, we can tighten the interval, resulting in a lower confidence level
- Alternatively we could take more samples (more data, tighter estimate, better inference). Diminishing returns concern?

In [None]:
from typing import Any
def G(): return np.array(difference_in_means)
def H(): return observed_test_statistic_numeric
def I(x: Any): return abs(x)

## P-value calculation and null hypothesis conclusion (10 minutes)
As a class, fill in the `...` in the code block below with the letters below (including the `()` after the letter) and any other required functions to complete the p-value calculation for our observed test statistic of 37.779...
- `G(): np.array(difference_in_means)`
- `H(): observed_test_statistic_numeric`
- `I(): abs(...)`

In [None]:
num_as_or_more_extreme = ...
p_value = ...

In [None]:
# SOLUTION
num_as_or_more_extreme = (I(G()) >= I(H())).sum()
p_value = num_as_or_more_extreme / B()
print(str(num_as_or_more_extreme) + ' sample statistics out of ' + str(B()) + 
    ' were as or more extreme than our observed test statistic of ' + str(H()) + 
    '\ngiving us a p-value of ' + str(p_value))

In [None]:
# our observed test stat was 37.7796...
num_as_or_more_extreme = \
  (abs(np.array(difference_in_means)) >= abs(observed_test_statistic_numeric)).sum()
p_value = num_as_or_more_extreme / num_samples
print(str(num_as_or_more_extreme) + ' sample statistics out of ' + str(num_samples) + 
    ' were as or more extreme than our observed test statistic of ' + str(observed_test_statistic_numeric) + 
    '\ngiving us a p-value of ' + str(p_value))

### Conclusion?
<img src="https://www.jcpcarchives.org/userfiles/values-of-p-Inference.jpg" style="height: 700px; width: 1400px"/>

## Some theoretical parametric and non-parametric test alternatives from last week ... but in two sample context! (5 minutes)

### Two-sample ttest
- Null hypothesis: means of two independent samples are the same
- Parametric, each independent sample is drawn from a theoretical normal population

In [None]:
fig = ff.create_distplot(gratitude_data, group_labels, bin_size = 8, curve_type = 'normal')  # add a normal distribution curve
fig.update_layout(xaxis_title = "adjust_college"); fig.show()

In [None]:
from scipy import stats
stats.ttest_ind(gratitude.adjust_college[gratitude.treatment=='treatment'],
               gratitude.adjust_college[gratitude.treatment=='control'])

## Nonparametric Median test
> Now working with medians instead of $\mu$

$$
\begin{align}
H_0 &{}: {} median_{treatment} = median_{control} \quad\text{median adjust_college of control and treatment are equal}\\
H_1 &:{} median_{treatment} \neq median_{control} \quad\text{median adjust_college of control and treatment are not equal}
\end{align}
$$


In [None]:
stats.median_test(gratitude.adjust_college[gratitude.treatment=='treatment'], 
                  gratitude.adjust_college[gratitude.treatment=='control'])

(As class) Consider the null hypothesis and respective p-values for these tests. Do they align with our conclusion for our nonparametric permuatation test?

#### Sample Solution
- Both p-values are very small, provide good evidence against their respective null hypothesis
- We have strong evidence that the mean/median `adjust_college` of control and treatment groups are different

## Vocabulary Review (10 min max, skip if all clear)

- statistical inference   
- population    
- assumptions    
- random sample  
- independent, dependent sample   
- sampling distribution   
- one and two-sample hypothesis tests
- permutation tests  
- simulation  
- parameter  
- simulation statistic   
- test statistic  
- p-value  
- parametric, nonparametric  

## Code/Functions Review (5 min max, skip if all clear)
- `numpy.random.seed()`  
- `numpy.random.choice()`  
- `(sample == 'H').sum()`  
- `(abs(...) >= abs(...)).sum()`  
- `numpy.diff()`  
- `dataframe.groupby()`  
- `dataframe.column.sample(frac=1).values`  
- `series.mean()`, `series.median()`, `series.std()`  

# MOVE TO ANOTHER WEEK?
## Gender Disparity Paper
*I will post the following material on Quercus in advance for students to access*  

Prior to this tutorial, students should have reviewed the videos on Plagiarism and Hedging
Please read the following article: Ross et al. (2022) “Women are Credited Less in Science than are Men”.
This article is available on quercus(**LINK LATER**) and the [paper online.](https://www.nature.com/articles/s41586-022-04966-w.) Consider reading everything up to ["Ethical approval"](https://www.nature.com/articles/s41586-022-04966-w#Sec7) section. Everything after is more information about the author's study that you're welcome to read at your own interest.

(If you don’t yet know how to get access to a journal article through UT library, it’s a good time to ask for
help from your TA or your peers.)

Once you have watched the videos and read the artcile, you will prepare a 10-min group presentation providing
a summary of the above study.



# MOVE TO ANOTHER WEEK?
### Make sure you explain the following, at a minimum:
- Objective: What are the authors interested in studying? Why is it important to conduct this study?
- Methods: What type of study design is used? Who are the participants? What statistical tests are used?
- Results: What are the main findings of the study? Make sure you support any statements with facts (e.g. proportions, p-values, etc.), and present the results in the context of the study.
- Conclusions: What are the most important take away messages to you? Are there any important limitations or future directions?

### Some things to keep in mind
- The introduction should state the overall purpose of the presentation
- Each main point should be “signaled.” Use signal phrases or signposts to transition between your points.
- Every group member must speak during the presentation or marks will be deducted for participation.


# Group Presentation (Remaining Time) :
> Presentations begin at the start of the next tutorial! Use this time to introduce scenarios, form groups, come up with plan...

*I will post the following material on Quercus in advance for students to access*

You are an amateur data detective that reads studies in your free time and checks whether the researchers conform to academic integrity. Here are some studies that you recently found. Identify the hypotheses (One-Group or Two-Group) of the researchers and describe how you could validate the findings using a simulation test. Make a presentation on your hypothesis test.

### 5 SCENARIOS: (continued below) *[click down...]*

(a) A health survey asked 200 individuals aged 20-45 living in Toronto to report the number minutes they exercised last week. Researchers were interested in determining whether the average duration of exercise differed between people who consume cannabis and those who do not consume cannabis. Assume the researchers who conducted this study found that people who used cannabis exercised, on average, 20 minutes per week. In contrast, people who did not use cannabis exercised 40 minutes per week, on average. The researchers reported a p-value of 0.249.

(b) A study was conducted to examine whether a mother's exposure to alcohol while she was pregnant affects the odds of developemntal disabilities. The researchers used a birth registry of all children born in Ontario in 2018, which included approximately 130,000 births. The researchers found that 0.6% of mothers reported that their babies have developmental disabilities and 56% of babies born to mothers who consumed alcohol at least once during pregnancy. In contrast, 55% of healthy babies were born to mothers who were not exposed to alcohol. The researchers reported a p-value of 0.50.

(c) A group of researchers wanted to find out whether completing a degree in Statistics is a worthy inverstment for an individual. They conducted a survey of graduates from the University of Toronto and included 1,000 recent graduates who completed their Bachelor’s degree in the last five years in the study; 80% of the respondents were female and 20% were male.The median reported income of a UofT graduate was \\$76,000. Among statistics graduates, the median reported income was \\$84,000. The researchers reported a p-value of 0.014.

(d) A team of researchers were interested in understanding millennial’s views regarding housing affordability in Toronto. The team interviewed 850 millennials currently living in Toronto. 84% reported that they felt housing prices were unaffordable in the city. Suppose the researchers were interested in testing whether this proportion was different from a study published last year, which found that 92% of millennials reported that housing costs were unaffordable. The researchers reported a p-value of 0.023.

(e) Suppose a drug company was interested in testing a new weight-loss drug. They enrolled 20,000 participants and assigned 10,000 to take their new drug, SlimX, and 10,000 to take a placebo. The researchers found that over 2 months, participants who took SlimX lost, on average, 5 lbs. In comparison, the control group lost 4.5 lbs during the same time. The researchers reported a p-value of <0.0001.

## In your presentation, it is important to:  

- Give an introduction of the question
- State the hypotheses $H_0$ and $H_A$ 
- Specify the assumption(s) of your hypothesis test, if any
- Describe the procedures for data collection (sampling) and simulation
- Include and explain at least two terms from the vocabulary
- Summarize your findings and interpret the results 
- Provide limitations (optional but good practice) (e.g. sample size, study design issue)

## Some things to keep in mind  

- The introduction should state the overall purpose of the presentation   

- Each main point should be “signaled.” Use [*signal phrases*](STA130_Oral_Presentation_useful_phrases.docx) or [*signposts*](Signposting.pdf) to transition between your points.

- Every group member must speak during the presentation or marks will be deducted for participation.