# **Exercise Session 2**
# Developed by Biljana Jonoska Stojkova, PhD
# Revised by Johnson Chen

## **Lecture 2 - Experimental Designs and Statistical Problem Formulation**

Today, we will continue exploring the Copper study to translate the research question defined on Day 1 into a statistical problem. The statistical problem needs to be defined using variable names and should account for the structure in the data. On Day 1, you learned that this study has a very rich and complex data structure, and the structural variables in the dataset need to be included in the statistical hypotheses.

Each student will have the opportunity to practice formulating research questions. The purpose of this exercise session is to guide you through a scientific process and help you identify the basic statistical concepts behind the study, including:

- **Primary Research Question (Day 1)**
- **Study Design Type (Day 1)**
- **Limitations from the Study Design and Analysis Methods (Day 1)**
- **Statistical Hypothesis Formulation (Day 2)**
- **Sample Size Considerations (Day 2)**
- **Presenting Your Study Protocol (Day 3)**

This exercise will be conducted in three parts. You will work within your team, each of you will have a role to play. The final part of the exercise will involve a small presentation of your findings.

## **Today's Learning Goal:**

- Use the clarified research questions and Study Design from Day 1.
- Clearly formulate the statistical hypothesis.
- Discuss sample size and effect size.



<img src="../images/ProblemFormulation-StatsMethodologie.drawio.png">

## **Refresher info**:
### **Study: One-Year Trial Evaluating the Durability and Antimicrobial Efficacy of Copper in Public Transportation Systems**

### **Introduction**

The main objective of this study is to test the effect of three Copper products after 12 months of use on public transit. Copper is known for its biocidal properties against microorganisms. During the COVID pandemic, epidemiological measures were taken to reduce the spread of transmissible diseases. A mining company that produces Copper funded this research study to assess the usability of Copper on public transit.

The application of copper (Cu) alloys to high-touch surfaces could help reduce the risk of cross-contamination; however, little is known about the durability and efficacy of engineered copper surfaces after prolonged use.

Three different commercially available Cu alloy products, ranging from 80 to 91.3% Cu content, were installed on high-touch surfaces in buses and trains (SkyTrain) in Vancouver, as well as subway cars, streetcars, and buses in Toronto, and monitored over the course of one year. The primary objective of this study was to establish the antimicrobial efficacy and durability of Cu alloy surfaces over 12 months of use in public transit vehicles located in two Canadian cities.

For more details read published scientific article in Nature - Scientific Reports:
[https://www.nature.com/articles/s41598-024-56225-9](https://www.nature.com/articles/s41598-024-56225-9) 

Please note that the data set we will use in this course uses simulated data, that follow the descirptions of the data in the published article.


### **Primary Research Question**

Does the use of Copper products reduce the biomass on public transit in Vancouver and Toronto after 12 months?

### **Study Design**

Stanchions (handrails on public transit) are highly touchable surfaces frequently used by passengers.

In this study, copper products were randomly installed on 110 stanchions across three buses and four trains (SkyTrain) in Vancouver, and three buses, two subway cars, and two streetcars in Toronto. Each copper-coated stanchion was paired with a control stanchion placed nearby to ensure close proximity for comparison. Bacterial counts (Colony Forming Units, CFU) were measured every two months after peak morning routes. A Petrifilm Plate Reader Advanced imager was used to collect and process the microbial samples in the microbiology lab to obtain CFU numbers. Three replicate samples were taken from each stanchion for both copper-coated and control surfaces.

There is extensive literature supporting CFU as a reliable measure of biomass on highly touchable surfaces in public transit, so this will be used as the primary outcome measure for this study.

The 12-month trial was conducted in collaboration with the Toronto Transit Commission (TTC) and Vancouver TransLink (TL). A total of 14 vehicles were used.

Microbial samples were collected one to three hours after the last passenger departed from the transit vehicle and prior to cleaning.

Although cleaning protocols were changing over the study period, microbial samples for copper and control stanchions were taken simultaneously, as the main objective was to compare the copper-coated surfaces to the control surfaces after 12 months.

Read more here: https://asda.stat.ubc.ca/Workshops/asda.stat.ubc.ca/Workshop/2024-07-VSP_Course1/ExerciseStudyDesign.html

**Teams 1 - 11:**

Each team member is assigned a role: a researcher or a statistician.

**Researcher:**

1. Pull up the document with the research question and basic study design details discussed on Day 1 (Jupyter Notebook from Day 1).

2. Write down the research questions from Day 1 in the cells below.

**Primary Research Question - Statistician**:


**Secondary Research Question - Statistician**:


**Statistician:**

1. Introduce the concept of statistical problem formulation and help the researcher formulate a data-driven problem based on the research question.

2. Use the answers from your team's researcher from Day 1 on study design to formulate the primary statistical problem.

### **Task 1, Statistician:**

- Identify the statistical problem type from today's lecture. A list of 5 statistical problem types is given here. The problem type relevant to the primary research question may be a combination of 2 statistical problem types. Type the answer in the cell below.

### **List of Most Common Statistical Problem Types:**

    A. Hypothesis test, typically if one variable affects the outcome of interest.

    B. Quantification (estimation) of the strength of associative relationships between the variables.

    C. Exploring, quantifying, and identifying which factors are affecting the outcome of interest.

    D. Quantification (estimation) of the strength of causal relationships between the variables.

    E. Predicting the outcome of interest.

#### **Task 1 Answer, Statistician:** 



Choose from the list ..

#### **Task 2, Statistician:**

Explain to the researcher why you think the chosen statistical problem type is appropriate for the primary research question.

Your primary research question objective is to assess (or quantify) the effect of Copper on biomass reduction, so the statistical problem type is ... , based on the list of the most common statistical problem types.

#### **Task 2 Answer, Statistician:**


**Statistician:**

I will provide an example of what kind of statement this type of statistical problem will produce:

Example: The mean log 10 CFU is lower in the Copper Group compared to the Controls by d amount, across all cities and vehicles.

Explain that the estimate of d is actually the quantified effect of Copper on biomass on transit. Since d is an estimate, there is uncertainty around this estimate, and this is known as a confidence interval. The best practice is to report the estimate of d along with its confidence interval:

Example: The mean log 10 CFU is lower in the Copper Group compared to the Controls by d (CI95%: d - 2 x StdError, d + 2 x StdError) amount, across all cities and vehicles.

In [None]:
#Run this code to explore the effect size of interest
library(tidyverse)


ds=read_csv("../data/CopperData.csv")
# mutate these variables from double to character type
ds <- ds %>% mutate(StanchionID = as.character(StanchionID),
			    	Pairing = as.character(Pairing),
					VehicleNumber = as.character(VehicleNumber))

head(ds)

group_means <- ds %>%
  group_by(Group) %>%
  summarize(mean_logCFU = mean(logCFU))

# Identify the positions for the bracket
upper_mean <- max(group_means$mean_logCFU)
lower_mean <- min(group_means$mean_logCFU)


g5 = ds %>% ggplot(aes(x=factor(Group), y=logCFU,fill=Group)) + geom_boxplot() +labs(title="The effect of the Group on log 10 CFU", x="Group",y="log 10 CFU") + geom_hline(data = group_means, aes(yintercept = mean_logCFU, color = Group), linetype="dashed",size=3)
g5 <-g5+annotate("label", x = 1.5, y = 2.5, label = sprintf('Mean log 10 CFU difference between Copper and Control group'), color='red', size=4)
g5

**Researcher:**

Yes! That is exactly the kind of statement I would like to make at the end of the analysis.

I would also like to confirm the statistical significance of the quantified effect of Copper on biomass reduction.

**Statistician:**

Alright! We actually have not discussed statistical significance yet. A statistically significant result means that we have enough observations in the dataset (or statistical power) to detect a practically significant effect size. 

So I will need to ask you, what is the practically significant effect size for this study?

#### **Task, Statistician:** 

Hint. Explain to the researcher what the effect size in the context of the primary research question means. Choose the correct answer from the list below:

       A. The mean log CFU in Copper group is lower by X amount than that of the Control group, across all vehicles and across two Canadian cities (Vancouver and Toronto).
        
       B. The mean log CFU in Copper group is lower by X amount than that of the Control group, across all vehicles in each of the Canadian cities (Vancouver and Toronto) separately.

       C. The mean log CFU in Copper group is lower by X amount than that of the Control group, for each vehicle in each of the Canadian cities (Vancouver and Toronto) separately.

       D. The mean log CFU in Copper group is lower by X amount than that of the Control group, for each vehicle separately.

     Ask how much is the hypothesized reduction X in log CFU, or rephrase it more specifically in terms of reduction of log CFU between the two groups.  
 

**Answer, Statistician:** 

**Researcher:**

Ah yes! Thanks for pointing this out. We do have previous literature that shows that Copper can reduce CFU by 1 on the log scale.

**Statistician:**

Ok, we can calculate the sample size to ensure we have enough statistical power to detect a reduction of 1 on the log 10 CFU scale from the Control to the Copper group. Sample size calculation is performed for the primary statistical hypothesis. Secondary and tertiary statistical hypotheses will have unknown statistical power because they are less important from a research perspective. 

If we determine that we have two equally important primary statistical hypotheses, we have to adjust the Type I Error for two tests (also known as the multiple comparison problem), which will increase the sample size requirement. The implications are that the study will have to run longer and use more resources. 

This is why we prioritize the research objectives and analyze the most important questions first to maximize the strength of the evidence from the data. We can still analyze the secondary and tertiary research questions, but the Type I Error is inflated with each new statistical test, and the strength of the evidence is diluted. Therefore, fewer analyses produce more powerful evidence from the data.

#### **Task Statistician:** 

Write a clear statistical problem formulation to include the hypothesized effect size for each research question.

**Hint:** Following the discussion with the researcher, choose which of the statistical hypotheses are primary and which are secondary research questions. Write your answers in the cells below.

    A. The mean `log 10 CFU` for Copper products is lower by 1 (on the log 10 scale) than the mean `log 10 CFU` for Controls, across all public transit vehicles and two Canadian cities (Vancouver and Toronto).

    B. The mean `log 10 CFU` for Copper products is lower by 1 (on the log 10 scale) than the mean `log 10 CFU` for Controls, across all vehicles in each of the Canadian cities (Vancouver and Toronto) separately.

    C. The mean `log 10 CFU` for Copper products is lower by 1 (on the log 10 scale) than the mean `log 10 CFU` for Controls, for each vehicle in each of the Canadian cities (Vancouver and Toronto) separately.

    D. The mean `log 10 CFU` for Copper products is lower by 1 (on the log 10 scale) than the mean `log 10 CFU` for Controls, for each vehicle separately.

**Primary Statistical hypothesis, Statistician:** 

**Secondary Statistical hypothesis, Statistician:** 

**Upload your work from Lecture 2 Exercise session**

- Each student will upload the Jupiter Notebook on Canvas Course 1: https://canvas.ubc.ca/courses/144703::

 `[Lecture_1_Exercise_Session 2]_[TeamNumber]_[student name].ipynb`
eg., `Lecture_1_Exercise_Session 2_Team21_Biljana_Jonoska_Stojkova.ipynb`

- Please write at the title who was responsible for writing each paragraph. 

Navigate to the Assignments section on Canvas Course 1, and upload the Jupiter document on Canvas under:
`Class Participation\Lecture 2 - Experimental designs and statistical problem formulation` 

