# 2. Analyzing Contingency Tables
Data is in a table with rows and columns. 
- can use binomial for a single row or single column
- multinomial to consider multiple cells at once 

## Relative risk and odds ratio
- Relative risk = $\pi_i/\pi_j$ 
	- sampling distribution is highly skewed unless sample sizes are large 
- odds = $\pi_i / (1-\pi_i)$ 
	- odds of success, e.g. if odds = 3, then we expect 3 successes for each failure 
	- $\pi =$ odds/(1 + odds) 
- odds ratio = $\theta = \Large \frac{\text{odds}_1}{\text{odds}_2} = \frac{\pi_1 / (1-\pi_1)}{\pi_2 / (1 - \pi_2) }  = \frac{n_{11}n_{22}}{n_{21}n_{12}}$  
	- like RR, sampling distribution also highly skewed unless ample size is large 
	- invariant under transposing rows/columns
	- $\theta = RR \times \large \frac{1 - \pi_2}{1 - \pi_1} \implies$ if $\pi_1, \pi_2 \approx 0$, then $\theta \approx RR$ 
-  in some cases/designs, only one of RR or OR can be estimated
	- eg Table 2.3, where study was designed so that equal numbers of pts with and w/o lung cancer are recruited, and then count number of smokers within each group. 
		- Due to this design, $P(\text{disease}) = P(\text{no disease}) = 0.5$, which is unrealistic. 
		- Thus, cannot estimate $P(\text{smoking} \mid \text{disease}) = P(\text{smoking} \cap \text{disease}) / P(\text{disease})$ 
		- However, we have $P(\text{disease} \mid \text{smoking})$, and since $OR$ is symmetric, we can use $P(X \mid Y)$ or $P(Y \mid X)$ and get the same result. 
		- From above, if we expect $P(\text{disease} \mid \text{smoking})$ and $P(\text{disease} \mid \text{no smoking})$ to both be small, then $OR \approx RR$ 
- inference for $\theta$ and $\log \theta$
	- since the distribution fo $\theta$ is highly skewed, mostly use $\log \theta$
	- thus, use CLT to find Wald CIs for $\log \theta$, then exponentiate these to get values in untransformed space

$$
\begin{align} 
var(\log \hat{\theta}) &= \sum_{i,j} \frac{1}{n_{ij}}  \\
E[\log \hat{\theta}] &= \log \theta \\
\mathcal{I} &= \left[ \log \hat{\theta} \pm 
\frac{q_{\alpha/2}}{\sqrt{\sum_{i,j} \frac{1}{n_{ij}}}}  \right] 
\end{align} 
$$

- standardized residual 
	- $p_{ij}$ = observed fraction in the $(i,j)$-th cell
	- $\pi_i$ and $\pi_j$ are expected row and column frequencies, e.g. $\pi_i =$ (sum of row i)/$n$ 
	- used when raw difference is insufficient, eg since larger frequencies lead to larger differences
$$
r = \frac{n(p_{ij} - \pi_i \pi_j)}{\sqrt{n\pi_i\pi_j(1-\pi_i)(1-\pi_j)}} \sim N(0,1)
$$

### Tests of independence
$$
\begin{align} 
\text{Chi-squared} \quad 
X^2 &= \sum_{i,j} \frac{(n_{ij} - \mu_{ij})^2}{\mu_{ij}} 
\rightarrow \chi^2 \\
\text{Likelihood ratio} \quad 
G^2 &= 2 \sum_{ij} n_{ij} \log \frac{n_{ij}}{\mu_{ij}} 
\rightarrow \chi^2 \\
\mu_{ij} &= n \hat{\pi}_{i,:} \hat{\pi}_{:,j} = 
n \cdot \frac{n_{i,:}}{n} \cdot \frac{n_{:,j}}{n} \\
	&= \frac{n_{i,:} \ n_{:,j}}{n} 
\end{align} 
$$
Where
- $n_{ij}$ are counts in the (i,j)-th cell
- $\mu_{ij}$ is the expected frequency of $n_{ij}$, assuming independence
- $n_{i,:}$ and $n_{:,j}$ are marginal totals of the i-th row and j-th column, resp. 
- for a $(r \times c)$ contingency table, the limiting $\chi^2$ distributions of $X^2$ and $G^2$ have $(r-1)(c-1)$ degrees of freedom (dofs)
	- since each row/column contains probabilities that sum to 1, one cell in each row/column can be solved knowing the rest 
	- under the null hypothesis, every cell in each row/column has the same probability
		- every cell in the k-th row will have the same probability as the marginal row probability of the k-th row, ie $\pi_{k,:} = \pi_{k,1} = \pi_{k,2} = \cdots$ 
		- since the marginal row probabilities sum to 1, there are $r-1$ non-redundant row probabilities
		- the same logic above applies to the columns as well, so there are $(r-1) + (c-1)$ total parameters when we assume independence, ie under $H_0$ 
	- under the alternative hypothesis, we only constrain the total sum to 1, so there are $rc-1$ non redundant parameters 
	- dofs = no. parameters under $H_1$ - no. parameters under $H_0$ = $rc - 1 - (r-1) - (c-1) = (r-1)(c-1)$ 

- chi-squared test of independence
	- Pearson $X^2$, see [[L15 Goodness of Fit Test for Discrete Distributions#Squared difference goes to Chi-squared with d-1 dofs|L15 Goodness of Fit Test]]
	- so-called $G^2$, ie likelihood ratio test 
- $X^2$ vs $G^2$
	- for $G^2$, contingency tables can be decomposed such that the $G^2$ for individual subtables will sum to the $G^2$ for the full table 
		- however, this is not the case for $X^2$
	- $G^2$ converges more slowly to $\chi^2$ distribution, so needs larger sample size

## Ordinal data
- test statistic $M^2 = (n-1)R^2$ , where $R^2$ is the correlation coefficient
	- $M^2 \sim \chi^2_1$ for large n 
- like $X^2$ and $G^2$, $M^2$ does not differentiate between response and explanatory, e.g. we can use whichever and get the same value for $M^2$ 
	- $u_i$ are row scores, $v_j$ are column scores
$$R^2 = \frac{cov(u, v)}{\sigma_u \sigma_v}
= \frac{\sum_{i,j} (u_i - E[u])(v_j - E[v]) \cdot \hat{\pi}_{ij}}
{\sqrt{
	\left( \sum_{i} (u_i - E[u])^2 \cdot \hat{\pi}_{i,:} \right)
	\left( \sum_{j} (v_j - E[v])^2 \cdot \hat{\pi}_{:,j} \right)
}} 
$$
- assigning scores, eg midpoints of numerical categories 
- for very ==unbalanced== (ie one category has very few samples, another has many samples) data, $R$ cannot be large 
- ==ordinal tests often have greater power== bc $M^2$ follows $\chi^2_1$ whereas $X^2, G^2 \sim \chi^2_{(r-1)(c-1)}$
	- $\chi^2$ right shifts and broadens for larger dofs => distribution of $M^2$ falls off more sharply than equal values of $X^2, G^2$, 
	- ie p-values for $M^2$ are smaller than equivalent values of $X^2, G^2$ 

## Fisher's exact test
- an exact frequentist method for fixed row and column marginals 
- probabilities of cell counts given by hypergeometric distribution  
	- discrete, so use mid-P value (use half of the observed result + probability of more extreme results)
$$
P(n_{11}) = 
\frac{\begin{pmatrix} n_{1,:} \\ n_{11} \end{pmatrix} 
\begin{pmatrix} n_{2,:} \\ n_{:,1} - n_{11} \end{pmatrix} }
{\begin{pmatrix} n \\ n_{:,1} \end{pmatrix} } 
$$

### Association in 3-way tables
| Victim | Defendant | Death Penalty | No Death Penalty |
| ------ | --------- | ------------- | ---------------- |
| White  | White     | 53            | 414              |
|        | Black     | 11            | 37               |
| Black  | White     | 0             | 16               |
|        | Black     | 4             | 139              |
| Total  | White     | 53            | 430              |
|        | Black     | 15            | 176                 |

Marginal odds ratios
- victim x defendant = $\frac{(53+414)(4+139)}{(11+37)(0+16)} = 87$ , ie sum over death penalty yes/no
	- "odds that a white defendant had white victims is ==87x== the odds that a black defendant had white victims", ie $n_{11} / n_{21}$ 
-  defendant x death penalty = $\frac{53 \cdot 176}{15 \cdot 430} = 1.45$
	- "odds that a white defendant receives the death penalty is ==45%==  higher than the odds of a black defendant receiving the death penalty"
Conditional odds ratios
- (defendant x death penalty | victim = white) = $\frac{53 \cdot 37}{11 \cdot 414}=0.43$
	- "given white victim, odds that a white defendant receives the death penalty is 43% that of a black defendant"
- (victim x defendant | death penalty = no) = $\frac{414\cdot 139}{37 \cdot 16} = 97$ 
	-   "given no death penalty, odds that a white defendant had white victim is 97x that of a black defendant"

Conditional independence vs marginal independence
- associations can change when we marginalize/condition on different variables
- ==homogeneous association== => all conditional odds are equal $\theta_1 = \theta_2 = \cdots$
	- a special case is ==conditional independence== => conditional odds = 1 
- conditional independence does not imply marginal independence

## Exercises
2.11 Find a 95% confidence interval for the population odds ratio.
| Vote in 2008 | Obama | Romney |
| ------------ | ----- | ------ |
| Obama        | 802   | 53     |
| McCain       | 34    | 494    |

To compute the odds ratio, it doesn't matter which variable we choose as response/explanatory (rows or columns).
$$
\theta = \frac{802(494)}{53(34)} = 219
$$
- $var(\log \theta) = \sum_{i,j} \frac{1}{n_{ij}} = 0.05155$
- $E[\log \theta] = \log \theta = 5.392$
- $\mathcal{I}(\theta) = \exp( 5.389 \pm q_{0.95} \sqrt{0.05155} ) = [140.9, 343.1]$ 

---

2.17 

| Race  | Democrat | Republican | Independent |
| ----- | -------- | ---------- | ----------- |
| White | 871      | 821        | 336         |
| Black | 347      | 42         | 83         |

1. Test the null hypothesis of independence between political party and race.
2. Use standardized residuals to describe evidence
3. Partition chi-squared into two components, the first of which compares the races on the Dem/Rep choice. 

Chi-squared statistic
Under $H_0$, the row and column marginal probabilities are
- $\pi_{1,:} = (\sum^3_{j=1} \pi_{1,j}) / n = 0.8112$
- $\pi_{2,:} = 1 - \pi_{1,:} = 0.1888$
- $\pi_{:,1} = 0.4872$
- $\pi_{:,2} = 0.3452$
- $\pi_{:,3} = 0.1676$
The observed probabilities are 
```julia
2×3 Matrix{Float64}:
 0.3484  0.3284  0.1344
 0.1388  0.0168  0.0332
```

Then, $X^2 = 184.3$. We know $X^2 \sim \chi^2_2$. Thus, $P(Z \geq X^2) = 0$, where $Z \sim \chi^2_2$. 
We have $G^2 = 213.9$, so $P(Z \geq X^2) = 0$ as well.

Standardized residuals
```julia
2×3 Matrix{Float64}:
 -11.9668   12.9995  -0.532628
  11.9668  -12.9995   0.532628
```
- Fewer Whites were Democrats and fewer Blacks were Republicans, with similar magnitude of lack of fit to $H_0$ 
- independent were similar 

The first partition is 
| Race  | Democrat | Republican |
| ----- | -------- | ---------- |
| White | 871      | 821        |
| Black | 347      | 42         |
 
$X^2$ for this partition is 185.4.

The second partition is 
| Race  | Democrat or Republican | Independent |
| ----- | ---------------------- | ----------- |
| White | 1692                   | 336         |
| Black | 389                    | 83          |

$X^2$ for this partition is 0.28.

Together, $185.4 + 0.28 = 185.737$ which differs slightly from the total  $X^2$ of 184.3
The p-value for the first and second partitions are 0 and 0.59. Thus, Democrat/Republican are not independent, but Democrat or Republican and Independent are independent.

However, because $X^2$ does not partition completely, we should isntead use $G^2$
- $G^2_1 = 213.6$
- $G^2_2 = 0.28$

---

2.21 
| Income | Very | LIttle | Moderately | Very Satisfied |
| ------ | ---- | ------ | ---------- | -------------- |
| <5     | 2    | 4      | 13         | 3              |
| 5-15   | 2    | 6      | 22         | 4              |
| 15-25  | 0    | 1      | 15         | 8              |
| >25    | 0    | 3      | 13         | 8              |

$X^2 = 11.5$, giving p-value $1 - P(Z \leq 11.5) = 0.24$

Standardized residuals
```julia
4×4 Matrix{Float64}:
  1.44062    0.730528  -0.160626  -1.07918
  0.752542   0.871575   0.600508  -1.77257
 -1.11714   -1.52114    0.219809   1.50979
 -1.11714   -0.157359  -0.732696   1.50979
```
strongest assocaitions are
- income 5-15 unlikely to be very satisfied 
- <5 income likely to be very unsatisfied
- income 15-25 unlikely to be little unsatisfied
- income 15-25 and >25 likely to be very satisfied 

Sample correlation
- $u = [3, 10, 20, 35]$
- $v = [1, 3, 4, 5]$
- p-value is 0.008, which gives much stronger evidence of a trend due to higher power when exploiting ordinality in data 

```julia
julia> data = [2 4 13 3; 2 6 22 4; 0 1 15 8; 0 3 13 8]
4×4 Matrix{Int64}:
 2  4  13  3
 2  6  22  4
 0  1  15  8
 0  3  13  8

julia> u = [3, 10, 20, 35]; v = [1, 3, 4, 5]
4-element Vector{Int64}:
 1
 3
 4
 5

julia> N = sum(data)
104

julia> p_rows = (data * ones(4)) ./ N; p_cols = (data' * ones(4)) ./ N;

julia> u_avg = u' * p_rows; v_avg = v' * p_cols

julia> u_var = sum(((u .- u_avg) .^ 2) .* p_rows); 

julia> v_var = sum(((v .- v_avg) .^ 2) .* p_cols);

julia> numer = ((u .- u_avg) * (v .- v_avg)') .* (data ./ N)
4×4 Matrix{Float64}:
  0.776851   0.507845   -0.0490246  -0.40351
  0.376888   0.36957    -0.0402502  -0.261016
 -0.0       -0.0317852   0.0141617   0.269387
 -0.0       -0.515566    0.06636     1.45652

julia> numer = sum(numer)
2.5364275147928996

julia> denom = sqrt(u_var*v_var)
9.698507695372635

julia> M2 = (N-1)*(numer/denom)^2
7.04485902193063

julia> p = 1 - cdf(Chisq(1), M2) 
0.007949308126710797
```

---

2.23 
| Treatment | Cancer Controlled | Cancer not controlled |
| --------- | ----------------- | --------------------- |
| Surgery   | 21                | 2                     |
| Radiation | 15                | 3                     |

Fisher's exact test uses the hypergeometric distribution.

$$
\begin{align} 
P(n=21) &= 
\frac{
	(n \in \text{surgery})
	(\# \text{controlled} - n \in \text{radiation})
}{\# \text{controlled} \in N} \\
&= \frac{
	\begin{pmatrix} 23 \\ 21 \end{pmatrix} 
	\begin{pmatrix} 18 \\ 15 \end{pmatrix}
}{\begin{pmatrix} 41 \\ 36 \end{pmatrix}} =
0.2755
\end{align} 
$$
The more extreme possibilities are $n=\{22, 23\}$. 
- $P(n=22) = 0.094$
- $P(n=23) = 0.011$
- Thus, the p-value of more extreme possibilities that are strictly greater than observed is $\sum^{23}_{i=21} P(n=i) = 0.3808$

The two-sided test also considers the more extreme possibilities $n = \{18, 19\}$. 
$$\sum^{23}_{i=18, \ i \neq 20} P(n=i) = 0.6384$$

---

2.25 
| Victim | Defendant | Death | No Death |
| ------ | --------- | ----- | -------- |
| White  | White     | 19    | 132      |
|        | Black     | 11    | 52       |
| Black  | White     | 0     | 9        |
|        | Black     | 6     | 97       |

Conditional odds ratio for defendant race x death penalty 
- conditioned on victim race, we just take the 2x2 for a given victime race
- $\theta \mid \text{white victim} = \frac{19(52)}{11(132)} = 0.6804$
- $\theta \mid \text{black victim} = 0$

Marginal odds ratio between defendant race and death penalty
- marginalize victim race by summing rows
$\theta = \frac{19(97+52)}{(132+9)(6+11)}  = 1.181$

Yes, Simpson's paradox => marginal and condl associations are in different directions. 