# **Exercise Session 6 - Solutions**
# Developed by Biljana Jonoska Stojkova, PhD
# Revised by Johnson Chen

## **Lecture 6 - Hypothesis Testing: Basic Concepts and Basic Tests (t-test, Paired t-test) and Their Assumptions**

Today's exercise will focus on answering a predefined research question and determining which of the statistical concepts discussed in today's lecture are applicable to our Copper study dataset and research questions. 

On Day 5, we presented a data-driven story, with various teams exploring different aspects of the data. Some teams created plots to answer the primary and secondary research questions, while others explored the bivariate relationships of the structural variables in the dataset.

So far we have seen exploratory plots about the primary and secondary research questions and have an exploratory analysis story. Now, it is time to enrich our data-driven story from the graphs by inserting numbers into the story. We need to identify which statistical methods would be appropriate to enrich our data-driven story with numerical insights.

We will keep the same teams as those on the previous days. Each team will continue working on the research question and variables assigned to you on Day 2. Please open the Exercise Session2.ipynb to refresh your memory on the primary and secondary statistical hypotheses for the Copper study. Each team member will have to answer the questions and upload their Jupiter Notebook on Canvas.


### **Today's Learning Goal:**

- Determine which statistical concepts discussed today can add scientific rigor to answering the primary and secondary statistical hypotheses of the Copper study.

- Assess whether the t-test or paired t-test is appropriate for handling the complexities in the Copper dataset and the primary and secondary statistical hypotheses. Will it enrich our story from the Day 5?

- If any of the t-test or paired t-test is appropriate, justify whether the dataset violates the assumptions of these tests.
     
- Consider whether data transformation would be appropriate to apply the t-test, e.g., aggregation? If so, explain how the data can be aggregated and at which level (e.g., at stanchion, vehicle, or city level).

- Make sure you upload your Jupyter Notebook at the end of the day.

### **Tasks for all teams**


Continue working within your team to complete the following tasks:

-  Decide whether a t-test or paired t-test is appropriate for answering your research question.

- Justify your decision based on the research question of interest and data structures.

- Discuss which assumptions may be violated and how useful and trustworthy the results would be.

#### **Assumptions of Two-Sample t-test:**

At1. Compares two uncorrelated samples (comparison groups).

At2. The outcome variable of interest is normally distributed within each comparison group.

At3. The outcome variable of interest has equal variance within each of the two comparison groups (related to equal sample size within each group).

At4. Independent observations: Both samples (comparison groups) are simple random samples from their respective populations and are independent of each other.

#### **Assumptions of Paired t-test:**

Apt1. Compares two paired, uncorrelated samples (comparison groups).

Apt2. The difference in the outcome variable of interest between the paired groups is normally distributed.

Apt3. There are no significant outliers in the difference in the outcome variable of interest between the two groups.

Apt4. Independent observations: Both samples (comparison groups) are simple random samples from their respective populations.

Let's load the dataset again and have a look:

In [None]:
#Run this code
library(tidyverse)
ds=read_csv("../data/CopperData.csv")
head(ds)

**Teams 1 - 4:**

On Day 3, you presented the primary research question and statistical hypothesis for the Copper Study. Today you will assess whether two-sample or paired t-tests are appropriate to answer the primary statistical hypothesis for the Copper Study. Start by assessing Assumption 1 (At1 and Apt1) of each two-sample and paired t-test.

**Q1.** How many comparison groups are there in the Copper Study? State the group names clearly in your answer below.

**Q2.** Are the comparison groups uncorrelated or paired?

Type your answer in the cells below:

**Answer 1.** Two, Copper and Control Group

**Answer 2.** Paired, each Stanchion with Copper is matched to a Stanchion with Control within a Vehicle. 

**Teams 5 - 7:**

On Day 3, you presented the primary research question and statistical hypothesis for the Copper Study. Today you will assess whether two-sample or paired t-tests are appropriate to answer the primary statistical hypothesis for the Copper Study. Start by assessing Assumption 2 (At2 and Apt2) of each two-sample and paired t-test.

**Q1.** Is the outcome variable of interest normally distributed? (Run the cell code below.)

**Q2.** Is the difference in the outcome variable of interest normally distributed? (Run the cell code below.)

Type your answer in the cells below:

In [None]:
# Run this code for At1 in Q1
g1 = ds %>% ggplot(aes(x=logCFU)) + geom_histogram() + facet_wrap(~Group)+ labs(title="Distribution of logCFU for each group")
g1

In [None]:
# Run this code for Apt1 in Q2
library(dplyr)
#create new variable new_ID that assigns rownumber (1,2,3) for each sample within Pairing ID
ds <- ds %>%
	group_by(StanchionID, Pairing) %>%
	mutate(new_ID = row_number())
# create two data sets, one for each group
copper_sample <- ds %>% filter(Group == "Copper") %>% select(VehicleNumber, City, StanchionID, Pairing, logCFU,new_ID)
control_sample <- ds %>% filter(Group == "Control") %>% select(VehicleNumber, City, StanchionID, Pairing, logCFU,new_ID)
#join the data sets by Pairing ID and the new_ID that assigns rownumber (1,2,3) for each sample within Pairing ID
joined_ds <- inner_join(copper_sample, control_sample, by = c("VehicleNumber", "City","Pairing","new_ID"), suffix = c("_Copper", "_Control"))

# Calculate the difference of logCFU between Copper and Control
diff_dataset <- joined_ds %>%
	mutate(diff_logCFU = logCFU_Copper - logCFU_Control)
 

g2 = diff_dataset %>% ggplot(aes(x=diff_logCFU)) + geom_histogram() +labs(title="Distribution of difference logCFU between the groups")
g2

**Answer 1.** Approximately yes.

**Answer 2.** Approximately yes.

**Teams 8 - 10:**

On Day 3, you presented the primary research question and statistical hypothesis for the Copper Study. Today you will assess whether two-sample or paired t-tests are appropriate to answer the primary statistical hypothesis for the Copper Study. Start by assessing Assumption 3 for each of the two-sample and paired t-tests (At3 and Apt3, respectively).

**Q1.** Does the outcome variable of interest have equal variance within each of the two comparison groups (related to equal sample size within each group, At3)? (Run the cell code below.)

**Q2.** Are there any significant outliers in the difference in the outcome variable of interest between the two groups (Apt3)? (Run the cell code below.)

Type your answer in the cells below:

In [None]:
# Run this code for At1 in Q1
g3 = ds %>% ggplot(aes(y=logCFU,x=Group)) + geom_boxplot() + labs(title="Boxplots of logCFU for each group, check for approximately equal variance assumption")
g3

In [None]:
# Run this code for Apt1 in Q2
library(dplyr)
#create new variable new_ID that assigns rownumber (1,2,3) for each sample within Pairing ID
ds <- ds %>%
	group_by(StanchionID, Pairing) %>%
	mutate(new_ID = row_number())
# create two data sets, one for each group
copper_sample <- ds %>% filter(Group == "Copper") %>% select(VehicleNumber, City, StanchionID, Pairing, logCFU,new_ID)
control_sample <- ds %>% filter(Group == "Control") %>% select(VehicleNumber, City, StanchionID, Pairing, logCFU,new_ID)
#join the data sets by Pairing ID and the new_ID that assigns rownumber (1,2,3) for each sample within Pairing ID
joined_ds <- inner_join(copper_sample, control_sample, by = c("VehicleNumber", "City","Pairing","new_ID"), suffix = c("_Copper", "_Control"))

# Calculate the difference of logCFU between Copper and Control
diff_dataset <- joined_ds %>%
	mutate(diff_logCFU = logCFU_Copper - logCFU_Control)
 

g5 = diff_dataset %>% ggplot(aes(x=diff_logCFU)) + geom_histogram() +labs(title="Distribution of difference logCFU between the groups")
g5

**Answer 1.** Approximately yes.


**Answer 2.** Likely not influential outliers.

**Teams 11 - 13:**

On Day 3, you presented the primary research question and statistical hypothesis for the Copper Study. Today you will assess whether two-sample or paired t-tests are appropriate to answer the primary statistical hypothesis for the Copper Study. Start by assessing Assumption 4 for each of the two-sample and paired t-tests (At4 and Apt4, respectively).

**Q1.** Are observations in both samples (comparison groups) simple random samples from their respective populations and are independent of each other (At4)?

**Q2.** Are the paired observations independent (Apt4)?

Discuss within your team and type your answers in the cells below (YES/NO, and a short explanation why YES/NO).

**Answer 1.** Yes

**Answer 2.** No, these are clustered within vehicle (and each vehicle clustered within city).

**Teams 14 - 18:**

On Day 3, you presented the secondary research question and statistical hypothesis for the Copper Study. Today you will assess whether two-sample or paired t-tests are appropriate to answer the secondary statistical hypothesis for the Copper Study. Start by assessing Assumption 1 (At1 and Apt1) of each two-sample and paired t-test.

**Q1.** How many comparison groups are there in the Copper Study? State the group names clearly in your answer below.

**Q2.** Are the comparison groups uncorrelated or paired?

Type your answers in the cells below:

**Answer 1.**  Two Groups, Copper and Control group.

**Answer 2.** Paired.

#### **Bonus:**

**Q1.** Can t-test or paired t-test help us assess the effect of the Group on `log 10 CFU` across all vehicles and cities (secondary statistical hypothesis)?

**Q2.** Can t-test or paired t-test help us assess the effect of the Group on `log 10 CFU` across all vehicles for each city (secondary statistical hypothesis)?

**Q3.** Is there a possibility to transform the data so that t-test or paired t-test are appropriate for our statistical hypotheses given the nested structures? What are the implications of doing such transformations of the data (e.g., would we introduce bias or lose statistical power due to information loss)?

**Q4.** Write down possible statistical hypotheses that can be tested with t-test or paired t-test along with necessary data transformations and implications of bias and information loss (hint: run the R code with plots below).


**Answer 1. (YES/NO, Why YES/NO?)**

No, because the observations are not independent (clustering of samples within stanchion, stanchion clustered within pairs, pairs clustered within vehicles, vehicles clustered within citieis).


**Answer 2. (YES/NO, Why YES/NO?)**

No. We could potentially run two paired t-tests for each city, however, the independent observations assumption is still violated within city, as we still have samples clustered within stanchions, stanchions within pairs and pairs within vehicles). 


**Answer 3.** 
Not exactlywith this data structure and the primary and secondary research questions. We could aggreagte the resepated measures of samples within stanchion (take mean of `logCFU` over the three samples for each stanchion), which would create two paired groups (Control/Copper pairs) for each stanchion. However, pairs are still nested within vehicles and cities. This would introduce biased estimates of paired t test. Also, by aggregating the `logCFU` between samples within stanchion and taking mean, we introduce information loss, as the variability in `logCFU` between samples within stanchions is lost, this leads to lost statsitical power (less likely to produce statsitically significant results).  


In [None]:
# Run this code for Q4
g6 = ds %>% ggplot(aes(x=factor(Group), y=logCFU)) + geom_boxplot() + facet_wrap(~City+VehicleNumber) +labs(title="The effect of the Group on log 10 CFU by City and VehicleNumber", x="City",y="log 10 CFU")
g6

**Answer 4:** The mean log 10 CFU for Copper products is lower by 1 (on the log 10 scale) than the mean log 10 CFU for Controls, for each vehicle in each of the Canadian cities (Vancouver and Toronto) separately. We would need to aggregate `logCFU` overall samples within each stanchion. Since the paired t test will be run for each vehcile and each city, the aggreagted `logCFU` on the stanchion level are now independent, so the biases introdcued due to the independence assumption are no longer an issue for this research question. However, we induce information loss when we aggrgate the `logCFU` on stanchion level, so these tests will operate under reduced statistical power (less likely to produce statsitically significant results).

**Upload your work from Lecture 6 Exercise session**

**Note.** Jupiter Notebook is acceptable for Class participation mark. 
          Please make sure you save your JupiterNotebook with Answers.

- Each student will upload the Jupiter Notebook on Canvas Course 1: https://canvas.ubc.ca/courses/144703::

 `[Lecture_6_Exercise_Session 6]_[TeamNumber]_[student name].ipynb`
eg., `Lecture_6_Exercise_Session 6_Team21_Biljana_Jonoska_Stojkova.ipynb`

- Please write at the title who was responsible for writing each paragraph. 

Navigate to the Assignments section on Canvas Course 1, and upload the Jupiter document on Canvas under:
`Class Participation\Lecture 6 -  Hypothesis testing, basic concepts, basic tests (t-test, paired t-test) and their assumptions` 

