# Lecture 7b - Chi-Squared Test for Independence

The chi-squared test can also be used to test for independence.

## Example - Do the herbs growing in my back yard prevent COVID?

Suppose we wished to test whether a couple of common "herbs" help prevent COVID.

We take a group of people and randomly assign them to one of three groups: Herb 1, Herb 2, and Placebo.

After the experiment, we collect the data and place them in a <b><i>contingency table</b></i>:

| &nbsp; |  Herb 1 | Herb 2 | Placebo | Row Total |
| :- | :-: | - | - | -: |
| Sick |  20 | 30 | 30 | <b>80</b> |
| Not Sick |  100 | 110 | 90 | <b>300</b> |
| <b>Column Total</b> | <b>120</b> | <b>140</b> | <b>120</b> | <b>380</b> |

Note that there are six groups here, so $k = 6$.

We can set this up as a hypothesis testing problem:

\begin{align*}
    H_0 &: \text{Herbs do nothing} \\ 
    H_1 &: \text{Herbs do something (could be good or bad)} 
\end{align*}

Let's convert these hypotheses to something more formal so that we can use the chi-squared test.

"Herbs do nothing" means that it doesn't matter what medicine group you're in, the probability that you are sick (or not sick) is independent of your medicine group. 

In other words, the probability is completely captured by the row totals:

\begin{align*}
    p_{\text{sick}} &= 80/380 \approx 0.21\\
    p_{\text{not sick}} &= 300/380 \approx 0.79
\end{align*}

In other words,

\begin{align*}
    H_0 &: p_{\text{sick}} = 0.21 \ \ \text{and} \ \ p_{\text{not sick}} = 0.79 \\ 
    H_1 &: \text{Herbs do something}
\end{align*}

Thus, if $H_0$ were true, the following table summarizes what we <b><u>expect</b></u> to see:

| &nbsp; |  Herb 1 | Herb 2 | Placebo | Row Total |
| :- | :-: | - | - | -: |
| Expected Sick |  25.3 | 29.4 | 25.3 | <b>80</b> |
| Expected Not Sick |  94.7 | 110.6 | 94.7 | <b>300</b> |
| <b>Column Total</b> | <b>120</b> | <b>140</b> | <b>120</b> | <b>380</b> |

Calculating $Q$ yields,

\begin{align*}
    Q &= \sum_{i=1}^{k} \frac{(\text{observed}  - \text{expected})^2}{\text{expected}}\\
    &= \frac{(20 - 25.3)^2}{25.3} + \frac{(30 - 29.4)^2}{29.4} + \frac{(30 - 25.3)^2}{25.3} + \frac{(100 - 94.7)^2}{94.7} + \frac{(110 - 110.6)^2}{110.6} + \frac{(90 - 94.7)^2}{94.7}\\
    &\approx 2.53
\end{align*}

For contingency tables like this, the degrees of freedom of $Q$ is given by $\text{df} = (r - 1)(c - 1) = 2$.

We can then calculate the p-value using a $\chi^2$ table or using code:

In [1]:
import numpy as np
from scipy.stats import chi2

p_value = 1 - chi2.cdf(2.53, 2)
print('p-value:', p_value)

p-value: 0.28223929614052334


Therefore, we would only reject the null at $\alpha$ levels greater than $0.282$.

For example, suppose we were testing at $\alpha = 0.10$. We could calculate the chi-squared <b><i>critical value</b></i>:

In [3]:
alpha = 0.1
Q_crit = chi2.ppf(1 - alpha, 2)

print('Chi-Squared Critical Value:', Q_crit)

Chi-Squared Critical Value: 4.605170185988092


Since our statistic is much lower than the critical value, i.e. $2.53 < 4.61$, we cannot reject the null hypothesis.

Therefore, we do not have enough evidence to suggest that the herbs do something (beneficial or detrimental) for COVID.

## Example - Hand and Foot Length

Suppose we suspect that a person's foot length is related to their hand length.

We sample a bunch of people and collect the following data:

| &nbsp; |  Right foot longer | Left foot longer | Both feet same | Row Total |
| :- | :-: | - | - | -: |
| Right hand longer |  11 | 3 | 8 | <b>22</b> |
| Left hand longer |  2 | 9 | 14 | <b>25</b> |
| Both hands same |  12 | 13 | 28 | <b>53</b> |
| <b>Column Total</b> | <b>25</b> | <b>25</b> | <b>50</b> | <b>100</b> |

Note that there are nine groups here, so $k = 9$.

We can set this up as a hypothesis testing problem:

\begin{align*}
    H_0 &: \text{Foot and hand length are independent} \\ 
    H_1 &: \text{Foot and hand length are NOT independent} 
\end{align*}

Note that if foot and hand length are independent, knowledge of one random variable does not affect the calculations of the probabilities of the other random variable.

Thus, the probabilities of foot or hand length are summarized by the marginals (i.e. the row and column totals):

\begin{align*}
    p_{\text{right hand longer}} &= 0.22\\
    p_{\text{left hand longer}} &= 0.25\\
    p_{\text{both hands same}} &= 0.53
\end{align*}

and

\begin{align*}
    p_{\text{right foot longer}} &= 0.25\\
    p_{\text{left foot longer}} &= 0.25\\
    p_{\text{both feet same}} &= 0.50
\end{align*}

Since they are independent, the probability that a person belongs to one of the nine groups is just given by the product of the marginal probabilities. We can then construct the table of <b><u>expected</b></u> observations:

| &nbsp; |  Right foot longer | Left foot longer | Both feet same | Row Total |
| :- | :-: | - | - | -: |
| Right hand longer |  5.5 | 5.5 | 11 | <b>22</b> |
| Left hand longer |  6.25 | 6.25 | 12.5 | <b>25</b> |
| Both hands same |  13.25 | 13.25 | 26.5 | <b>53</b> |
| <b>Column Total</b> | <b>25</b> | <b>25</b> | <b>50</b> | <b>100</b> |

Calculating $Q$ yields,

\begin{align*}
    Q &= \sum_{i=1}^{k} \frac{(\text{observed}  - \text{expected})^2}{\text{expected}}\\
    &= \frac{(11 - 5.5)^2}{5.5} + \frac{(3 - 5.5)^2}{5.5} + \frac{(8 - 11)^2}{11} + \frac{(2 - 6.25)^2}{6.25} + \frac{(9 - 6.25)^2}{6.25} + \frac{(14 - 12.5)^2}{12.5} + \frac{(12 - 13.25)^2}{13.25} + \frac{(13 - 13.25)^2}{13.25} + \frac{(28 - 26.5)^2}{26.5}\\
    &\approx 11.942
\end{align*}

Again, the degrees of freedom of $Q$ is given by $\text{df} = (r - 1)(c - 1) = 4$.

We can then calculate the p-value using a $\chi^2$ table or using code:

In [3]:
p_value = 1 - chi2.cdf(11.942, 4)
print('p-value:', p_value)

p-value: 0.017787820214710037


Thus, for a significance level of $\alpha = 0.05$, we reject the null hypothesis that foot and hand length are independent.

In other words, they are (probably) not independent.