# Comparing counts between 2x2 categories (unrelated subjects)

We will explore how we can compare counts for categories from unrelated subjects to see if the categories are independent or if there is an association between them

There are 2 approaches for comparing counts
- `Fisher's test`: exact calculation of permutations (usually for small counts)
- `Chi-squared test`: approximation using the chi-squared distribution (larger counts)

We will cover the first approach (`Fisher's test`) that uses the exact calculation of permutations for 2 categories in a 2x2 table

![test_count_fisher_death.png](images/test_count_fisher_death.png)

---
## Data preparation

We will be using the dataset from https://vincentarelbundock.github.io/Rdatasets/doc/Stat2Data/ICU.html

The dataset includes 200 observations from ICU patients with the following variables
- `Age`
- `AgeGroup`: 1= young (under 50), 2= middle (50-69), 3 = old (70+)
- `Sex`: 1=female or 0=male
- `Emergency`: 1=emergency admission or 0=elective admission
- `SysBP`: Systolic blood pressure (mmHg)
- `PulseRate`: Heart rate (beats per minute)
- `Infection`: 1=infection suspected or 0=no infection
- `Death`: 1=patient died or 0=patient survived to discharge


In [1]:
library(tidyverse)

data <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/icu.csv")

data <- data %>% mutate(...1 = NULL,
                        Survive=as.factor(Survive),
                        Sex=as.factor(Sex),
                        Infection=as.factor(Infection),
                        Emergency=as.factor(Emergency)) %>%
                 mutate(Death = case_when(Survive == "0" ~ "1",
                                          Survive == "1" ~ "0"),
                        Death = as.factor(Death),
                        Survive = NULL) %>%

                 # change order of levels to format table
                 mutate(Infection=fct_relevel(Infection, "1", "0"),
                        Death=fct_relevel(Death, "1", "0"))

head(data)

As an example, we would like to see if the infection status is associated with the death of ICU patients

We start by selecting the 2 variables of interest(`Infection`, `Death`) and generate a table using the `table` function. We will define a `print_table` function to print a nicer table

In [2]:
# cross-table convenience function
print_table <- function(input, margin=F) {
    
    if (margin == T) { input <- addmargins(input)}
    
    input <- htmlTable::txtRound(input,1)

    input %>% 
    htmlTable::htmlTable(css.rgroup = "font-weight: 900; text-align: left;") %>%
    IRdisplay::display_html() 
    
}

We can use the `table` function to create a 2x2 table
- Note that the order of the levels was changed using `fct_relevel` to ensure the association of interest (presence of `Infection` and `Death`) are in the upper left cell of the table

In [3]:
# print 2 x 2 contingency table for Infection and Survive categories
data_table <- data %>% 
              select(Infection, Death) %>%
              table

data_table %>% print_table

We can plot it as a barchart to look at the proportions of ICU patients who survived/died in the 2 groups (with suspected infection and without)
- `geom_bar`: use `position="fill"` option

In [4]:
# set plot dimensions
options(repr.plot.width=6, repr.plot.height=8)

# stacked proportional chart for Infection and Death categories
data %>% 
ggplot(aes(x=Infection,fill=Death)) +
  geom_bar(position="fill") +
  theme_grey(base_size=20)

We notice that the number of patients who die are higher in the group of patients with suspected infection. However, we do not know if this observation could have occurred by chance

---
## Exact permutation approach

In this approach, we will 
- examine all the possible permutations of values in the table that can occur by chance (i.e. no association between 2 categories) and calculate the probabilities of each permutation
- calculate the probability of observing values in permuted data (null distribution with no association) that are equal or more extreme than those observed in the data

A small probability value would suggest that the observed data is less likely to have arisen from the null distribution 

In [5]:
data_table %>% 
print_table

In a contingency table, the counts of 2 different variables are presented in a 2 x 2 table. 

![test_count_fisher_table_small_label.png](images/test_count_fisher_table_small_label.png)

By convention:
- The independent variable (_exposure_) is specified _row-wise_, with the baseline values (no exposure) in the 2nd row (```c,d```)
- The dependent variable (_outcome_) is specified _column-wise_, the outcome values that may be associated with the exposure in the 1st column (```a,c```)

### 1. Calculate the probability of observing the exposure/outcome

![test_count_fisher_prob.png](images/test_count_fisher_prob.png)

The counts in the 2x2 table follow a hypergeometic distribution and we can use the `dhyper` function to calculate the probability of observing counts in the table
- In this case, we are interested in `a` for a potential association between exposure and outcome

In [6]:
data_table %>% 
print_table(margin = T)

In [7]:
# probability of observing count in cell a (association between exposure/outcome)
a <- data_table[1,1]
b <- data_table[1,2]
c <- data_table[2,1]
d <- data_table[2,2]

dhyper(a,a+b,c+d,a+c)

Here, we find that the probability of observing 24 patients who died in the group that had a suspected infection

### 2. Generate a null distribution by permutation and calculate the p-value

We have to consider all the cases where the counts vary within the constraints that the number of cases per group where both the _row_ and _column_ margins are fixed
- To do this, we calculate the probabilities of the possible counts
- We then sum up the probabilities of scenarios that the probability of counts are equal or smaller than the one we observe, and this is the _p-value_

![test_count_fisher_sum.png](images/test_count_fisher_sum.png)

First, we generate the null distribution of probabilities of patients that died by calculating the probabilities of observing cell `a` i.e. patients who are exposed (Infection) and died, across a range of possible values
- The minimum number who died with infection is 0
- The maximum number who died with infection is a+c

In [8]:
# we will choose to vary the count for infected patients in the death group
a <- data_table[1,1]
b <- data_table[1,2]
c <- data_table[2,1]
d <- data_table[2,2]

range_a <- seq(0,a+c) # range of a from 0 to a+c

prob_df <- tibble(count_a = range_a,
                  prob_a = dhyper(range_a, a+b, c+d, a+c))

head(prob_df)
tail(prob_df)

We can check if any of these probability values in the null distribution are equal more extreme (equal or smaller) than the observed data

In [9]:
prob_obs_a <- dhyper(a, a+b, c+d, a+c)

prob_df <- prob_df %>%
           mutate(smaller = ifelse(prob_a <= prob_obs_a,"Y","N") %>%
                            fct_relevel("Y","N"))

head(prob_df)

We can visualize the null distribution and indicate those values that are more extreme than observed in our table

In [10]:
# set plot dimension
options(repr.plot.width=8, repr.plot.height=8)

# plot distribution and highlight regions with extreme probabilities
prob_df %>% 
ggplot(aes(x=count_a,y=prob_a, fill=smaller)) + 
  geom_bar(stat = "identity") +
  labs(fill="Prob <= Observed") +
  theme_grey(base_size=16) +
  theme(legend.position="top")

Finally, we can calculate the p-value by summing up all the probabilities of observing the counts that are more extreme than observed in `a`

In [11]:
# calculate sum of probabilities (p-value) of values that are equal or smaller than observed

prob_df %>% 
filter(smaller == "Y") %>%
pull(prob_a) %>%
sum

The p-value is small, which suggests that the observed data is unlikely to be consistent with the null hypothesis. Thus, the association between the exposure and the outcome may not be due to random chance

**Using a function**

We can also use the ```fisher_test``` function from ```rstatix```

In [12]:
data_table %>% 
rstatix::fisher_test(detailed = T)

---
## Effect size

We can calculate the effect size of the association between the exposure and  outcome using 2 related measures
- Odds ratio (OR)
- Relative risk (RR)

In [13]:
data_table %>% 
print_table

Here, we identify ```Infection``` as the ```independent``` variable, which may alter the ```dependent``` outcome variable ```Death```. By convention:
- The independent variable ```Infection``` is specified _row-wise_ and the baseline (no infection) is in the second row
- The dependent variable ```Death``` is specified _column-wise_ and the possible associated outcome (death) is in the first column

![test_count_fisher_table_small_label.png](images/test_count_fisher_table_small_label.png)

### 1. Odds ratio

In this case, we are interested in calculating the odds ratio for death in patients who are infected (```numerator```) vs non-infected (```denominator```). To do this, we
- calculate the odds for death in infected patients (```numerator```)
- calculate the odds for death in non-infected patients (```denominator```)
- calculate the odds ratio (```numerator/denominator```)

![test_count_odds_death.png](images/test_count_odds_death.png)

The odds ratio can be interpreted in the following way:

<table align="left">
<thead>
<tr><th>Odds ratio</th><th>Interpretation</th></tr>
</thead>
<tbody>
<tr><td>1</td><td>Equal odds of either outcome in dependent variable</td>
<tr><td>&lt 1</td><td>Decreased odds of the outcome observed (numerator)</td>
<tr><td>&gt 1</td><td>Increased odds of the outcome observed (numerator)</td>
</tbody>
</table>


In [14]:
data_table %>% 
print_table

In [15]:
# calculate odds of death in non-infected patients (baseline)
c <- data_table[2,1]
d <- data_table[2,2]

odds_death_in_noninfected <- c/d

# calculate odds of death in infected patients (exposed)
a <- data_table[1,1]
b <- data_table[1,2]

odds_death_in_infected <- a/b

# calculate odd ratio of death in infected vs non-infected patients
OR <- odds_death_in_infected/odds_death_in_noninfected

OR

**Using a function**

We can use the `oddsratio` function from the `epitools` library and specify the `rev="both"` option to keep to the contingency table conventions for exposure/outcome
- To get the output, we use the `$` selector for `measure`

In [16]:
epitools::oddsratio(data_table, rev="both")$measure

We can see that the odds ratio is >1, indicating that patients with infection have higher odds of death
- The magnitude of the effect may not be intuitive
- Here, the odds of death in infected patients is 2.5:1 = 5:2
- In other words, for every 5 patients who died with infection, there were 2 who died without infection

### 2. Relative risk

The odds ratio may not be intuitive as it refers to the chance in favor of an event over the chance against it. 
- For example, the odds of  of getting an even number in the roll of dice = 3/3 = 1

In contrast, a risk or probability refers to the likelihood of an event over the total number of events
- Using the example of the roll of a dice, the probability of getting an even number = 3/6 = 0.5

Similarly, we can calculate the risk of an event by considering the probabilities instead of odds

![test_count_risk_death.png](images/test_count_risk_death.png)

The relative risk can be interpreted in the following way:

<table align="left">
<thead>
<tr><th>Relative risk</th><th>Interpretation</th></tr>
</thead>
<tbody>
<tr><td>1</td><td>Equal risk of either outcome in dependent variable</td>
<tr><td>&lt 1</td><td>Decreased risk of the outcome observed (numerator)</td>
<tr><td>&gt 1</td><td>Increased risk of the outcome observed (numerator)</td>
</tbody>
</table>

In [17]:
data_table %>% 
print_table

In [18]:
# calculate probability of death in non-infected patients (baseline)
c <- data_table[2,1]
d <- data_table[2,2]

prob_death_in_not_infected <- c/(c+d)

# calculate probability of death in infected patients (exposed)

a <- data_table[1,1]
b <- data_table[1,2]

prob_death_in_infected <- a/(a+b)

# calculate relative "risk" of survival in infected vs non-infected patients
RR <- prob_death_in_infected/prob_death_in_not_infected

RR

The relative risk indicates that the relative risk of death in patients with infection is
- ~2 x compared to those without infection
- ~100% more likely compared to those without infection

**Using a function**

We can use the `riskratio` function from the `epitools` library and specify the `rev="both"` option to keep to the contingency table conventions for exposure/outcome 
- To get the output, we use the `$` selector for `measure`

In [19]:
epitools::riskratio(data_table, rev = "both")$measure

### Caution

Note that the use of relative risk is suited when the population at risk is _known_
- For example, the probability of death in infected patients can be calculated as  we know the number of patients have infection and are at risk of dying
- It is best suited for prospective cohort data where all patients are enrolled and tracked, hence the population at risk is **known**
- In retrospective studies, we often do not know the population at risk, as the exposure is usually **not** known

For details see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640017/

---
## Exercise - Comparing counts 2x2 categories

For this exercise, we will use the same ICU dataset. The dataset includes 200 observations from ICU patients with the following variables
- `Age`
- `AgeGroup`: 1= young (under 50), 2= middle (50-69), 3 = old (70+)
- `Sex`: 1=female or 0=male
- `Emergency`: 1=emergency admission or 0=elective admission
- `SysBP`: Systolic blood pressure (mmHg)
- `PulseRate`: Heart rate (beats per minute)
- `Infection`: 1=infection suspected or 0=no infection
- `Death`: 1=patient died or 0=patient survived to discharge

In [None]:
library(tidyverse)

data <- read_csv("https://raw.githubusercontent.com/kennethban/dataset/main/icu.csv")

data <- data %>% mutate(...1 = NULL,
                        Survive=as.factor(Survive),
                        Sex=as.factor(Sex),
                        Infection=as.factor(Infection),
                        Emergency=as.factor(Emergency)) %>%
                 mutate(Death = case_when(Survive == "0" ~ "1",
                                          Survive == "1" ~ "0"),
                        Death = as.factor(Death),
                        Survive = NULL)
 
head(data)

We will create a local `print_table` function to format the crosstables

In [None]:
# cross-table convenience function
print_table <- function(input, margin=F) {
    
    input <- htmlTable::txtRound(input,1)
    if (margin == T) { input <- addmargins(input)}

    input %>% 
    htmlTable::htmlTable(css.rgroup = "font-weight: 900; text-align: left;") %>%
    IRdisplay::display_html() 
    
}

### Part 1

Create a contingency table of counts with the `Emergency` and `Death` variables using the `table` function and store it in `table_ex`
- Ensure the exposure and outcome are ordered in the conventional manner (Hint: use `fct_relevel` if necessary)

![test_count_fisher_table_small_label.png](images/test_count_fisher_table_small_label.png)

Print out the table using the local `print_table` function

In [None]:
# start here

In [None]:
# solution

table_ex <- data %>% 
            select(Emergency, Death) %>%
            mutate(Emergency=fct_relevel(Emergency, "1", "0"))  %>%
            mutate(Death=fct_relevel(Death, "1", "0"))  %>%
            table

table_ex %>% print_table

### Part 2

Plot a proportional barchart with `Emergency` and `Death` variables using `ggplot` from `tidyverse`
- `x`: Independent variable (exposure)
- `y`: Dependent variable (outcome)

In [None]:
library(tidyverse)

# start here

In [None]:
# solution

# set plot dimension
options(repr.plot.width=8, repr.plot.height=8)

data %>% 
ggplot(aes(x=Emergency,fill=Death)) +
  geom_bar(position="fill") +
  theme_grey(base_size=20)

### Part 3

Use `table_ex` to calculate the p-value using the `fisher_test` function from the `rstatix` library

In [None]:
# start here

In [None]:
# solution

table_ex %>% 
rstatix::fisher_test(detailed=T)

### Part 4

Use `table_ex` to calculate the odds ratio of death if the patient had an emergency admission using the `oddsratio` function from the `epitools` library
- Note the order of the values in the table, which determines the reference

In [None]:
# start here

In [None]:
# solution

epitools::oddsratio(table_ex, rev = "both")$measure