# Practical: Hypothesis testing and CIs

In this practical you will

- Interpret p-values alone and in conjunction with estimates and 95% confidence intervals
- Apply a two-sample t-test to compare two population means
- Apply the chi-squared test to compare two population proportions

## The two-sample t-test

**Question 1.** 

i. Re-open the mother-baby dataset. Use the code below to perform a two-sample t-test. 

ii. What is the null hypothesis here?

ii. What is the p-value? Interpret the p-value.

iii. This output gives a 95% confidence interval and a p-value. Given the p-value, without looking a the 95% confidence interval, what can we say about it, given the connection between p-values and confidence intervals? 

iv. Add the option "var.equal=TRUE" to the command. What does this do? (You may want to look at the help file: ?t.test).

In [1]:
t.test(baby$Birth.Weight ~ baby$Maternal.Smoker, data=baby, conf.level = 0.95)

ERROR: Error in eval(m$data, parent.frame()): object 'baby' not found


## Interpreting confidence intervals and p-values

**Question 2.**  

This question is loosely based on the following article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5640476/. However, please note that for pedagogic purposes (i.e. teaching) we have made up the study results, so the numbers below do not contain useful information about sepsis scores!

Sepsis is a costly and often fatal medical condition which can affect hospitalized patients. To prevent death, it is vital to identify sepsis cases early.  There are various scores intended to measure suspicion of sepsis infection. These can be dichotomised by choosing cut-offs to classify patients into suspected sepsis and not.  We will consider the systemic inflammatory response system (SIRS) criteria to be the reference (comparison) score, and explore whether other scores do better, in terms of the proportion of sepsis cases correctly identified (the sensitivity). 

We will consider the following three scores:

- National and Modified Early Earning Score (NEWs), a very simple-to-apply early warning score
- quick Sepsis-related Organ Failure Assessment (qSOFA) score, also very simple-to-apply
- eCART score, derived from a random forest algorithm including vital signs, laboratory values, and patient demographics

Two studies were undertaken to compare NEWS to the reference score SIRS (studies 1 and 2). Two studies were undertaken to compare qSOFA to the reference score SIRS (studies 3 and 4). One study was undertaken to compare eCART to SIRS (study 5).

In order to make sufficient clinical difference to justify changing systems, the new system would need to increase the sensitivity by at least 5%.

The (made-up) results are shown in the table below. Note that we have rounded the numbers a little for ease of reading.


| Study  | System | Cost |  Sample size |  Increase in sensitivity (%) | SE | 95% CI | p-value |
|---- | ---- | :-: | :-: | :-: | :-: | :-: | :-: |
| 1 | NEWs | Cheap |  20 | 15 | 15.3  |  (-15, 45)  | 0.33 |   
| 2 | NEWs | Cheap |  2,000 | 15 | 1.53  |  (12, 18)  | <0.001 |   
| 3 | qSOFA | Cheap |  15 | 10 | 18.4  |  (-26, 46)  | 0.59 |   
| 4 | qSOFA | Cheap |  15,000 | 0.3 | 0.57  |  (-0.8, 1.4)  | 0.59 |  
| 5 |eCART | Expensive |  20,000 | 3 | 0.5  |  (2, 4)  | 0.02 |  


i. Consider study 1. This compares the NEWs score to the default score, SIRS. 

- What null hypothesis is being tested? How would you interpret the p-value? 
- What does the point estimate tell us?
- What does the 95% confidence interval tell us? 

Overall, what is the conclusion of study 1?

ii. Consider study 2. This also compares the NEWS score to the default score, SIRS. 

- Compare the point estimates from studies 1 and 2. 
- What is the most important difference between studies 1 and 2? 
- Interpret the 95% confidence interval
- Interpret the p-value

Overall, what is the conclusion of study 2? Are the results of study 1 and 2 consistent with one-another?

iii. Consider studies 3 and 4. These have exactly the same p-values. Are the conclusions of the two studies therefore the same?

iv. Study 5 was a large study comparing the eCART score to the default score SIRS. This study resulted in a p-value of p=0.02 and a confidence interval which excludes 0. Should we recommend that large hospitals implement this system? 

# Comparing two proportions

**Question 3.** 

i. Define low birthweight as less than 88 ounces. Create a new variable containing a binary outcome of low birthweight. 

ii. Obtain the sample estimates of the proportion of low birthweight babies born to mothers who smoke and those who do not. Which is higher? Is this expected?

In [None]:
# Code here


iii. Use the prop.test function below to perform a chi-squared test. What is the null hypothesis being tested here? (You may want to use the help page to help you answer the question).


In [None]:
### Compare proportion of low birthweight between mothers who do and don't smoke

prop.test(c(21, 36), c(715, 459), p = NULL,
          alternative = "two.sided",
          conf.level = 0.95, correct = TRUE)

iv. What is the p-value. Interpret the p-value. 

# Interpreting tables of results


**Question 4.** [Optional] 

i. The table below is a fairly standard set of results from an observational study. 

ii. Find the p-values in the central column for (a) Aminocaproic acid vs control, and (b) history of intravenous drug use. Interpret these p-values. What are the null hypotheses here? 

iii. Look at the estimates and confidence intervals. Do these add useful information?

<img src="OR_table.png" alt="Interpretation of p-values" width="600"/>
