# Testing for Independence in Populations Classified According to Two Characteristics

## Motivating Example

The management of a certain hotel is interested in whether all its guests are treated the same regardless of the prices of their rooms. They randomly chose 155 recent guests and questioned them about the service they had received at the hotel. The following summary data resulted:

|Service ranking | Economy | Standard | Luxury |
|----------------|---------|----------|--------|
|Excellent  | 30 | 21 | 9 |
|Good       | 36 | 29 | 8 |
|Fair       | 12 | 8  | 2 |

We want to determine if service ranking is independent of the price of the room in which the guest stayed. We refer to such tables as **contingency tables**.

## Test for Independence

Suppose that we have a large population classified according to two distinct categories $X$ and $Y$. Suppose that the possible values for $X$ are $1,2,\dots, r$ and the possible values for $Y$ are $1,2\dots, s$. 

Let $P_i = P\{X=i\}$, $Q_j = P\{Y=j\}$, and $P_{ij} = P\{X=i\text{ and }Y=j\}$ for $i=1,2,\dots,r$ and $j=1,2,\dots,s$. We want to determine if there is any validity to the claim that $P_{ij} = P_iQ_j$ for $i=1,2,\dots,r$ and $j=1,2,\dots,s$.

That is, we want to determine if the categories $X$ and $Y$ are independent of each other. Or does knowledge of one influence the chances of the other's value? 

Recall that $A$ and $B$ are independent events if $$P\{A\text{ and }B\} = P\{A\}P\{B\}.$$

In particular, we are interested in testing $$H_0: P_{ij} = P_iQ_j\text{ for all $i$ and $j$}$$ against the alternative $$H_1: P_{ij}\neq P_iQ_j\text{ for some choice of $i$ and $j$}.$$

We can do this by computing the following test statistic: $$TS = \sum_{i=1}^r\sum_{j=1}^s\frac{(N_{ij} - e_{ij})^2}{e_{ij}}$$ where $N_{ij}$ is the number of observed samples with associated value $i$ for the $X$ characteristic and associated value $j$ for the $Y$ characteristic and $e_{ij}$ is the expected number of observations with associated value $i$ for the $X$ characteristic and associated value $j$ for the $Y$ characteristic, for $i=1,2,\dots,r$ and $j=1,2,\dots,s$. For a large sample size $n$, $TS$ will have an approximately chi-squared distribution with $rs-1$ degrees of freedom. (This is not an obvious result and we will not justfity it here.)

One issue with the above approach is that we do not know the values of $e_{ij}$ since the values of $P_i$ and $Q_j$ are not part of the null hypothesis (only the relationships between them). So without knowing $e_{ij},$ we must replace it in our test statistic by an estimate, namely: $$\hat{e}_{ij} = \frac{N_iM_j}{n}$$ where $n$ is the total number of samples, $N_i$ is the number of samples with associated value $i$ for the $X$ characteristic and $M_j$ is the number of samples with associated value $j$ for the $Y$ characteristic.

So we will *actually* use the test statistic: $$TS = \sum_{i=1}^r\sum_{j=1}^s\frac{(N_{ij} - \hat{e}_{ij})^2}{\hat{e}_{ij}}.$$ For a large sample size $n$, $TS$ will have an approximately chi-squared distribution with $(r-1)(s-1)$ degrees of freedom. (This is not an obvious result and we will not justfity it here.)

We get that a significance-level-$\alpha$ test of the null hypothesis $H_0$ against the alternative $H_1$ is to reject $H_0$ if $$TS\geq \chi^2_{(r-1)(s-1),\alpha}$$ and not to reject $H_0$ otherwise. This is referred to as a **chi-squared test of independence**.

## Back to the Example

Putting our data into a pandas dataframe, we can compute the relevant statistics for at the 5 percent significance level.

In [1]:
import pandas as pd
import scipy as sp

In [2]:
data =  pd.DataFrame([[30, 21, 9],
                      [36, 29, 8],
                      [12,  8, 2]])

result = sp.stats.chi2_contingency(data)
print(f'The test statistic value: {round(result[0],4)}')
print(f'The p-value: {round(result[1],4)}')
print(f'The degrees of freedom: {result[2]}')
print(f'The 95th percentile: {round(sp.stats.chi2.ppf(0.95,result[2]),4)}')

The test statistic value: 0.9467
The p-value: 0.9178
The degrees of freedom: 4
The 95th percentile: 9.4877


Our result suggests that we should not reject the null hypothesis at the 5 percent significance level.

## Examples

1. Suppose that a random sample of 300 people were chosen from the population, with the following data resulting:

|    | Democrat | Republican | Independent | 
| ---| ----     | ------     | ------      |
|Women | 68 | 56 | 32|
|Men   | 52 | 72 | 20 |

What conclusion can be drawn? Use the 5 percent level of significance.

## Problem 1

Let $P_1$ and $P_2$ represent the proportions of the population that are women and men, respectively, Let $Q_1$, $Q_2$ and $Q_3$ represent the proportions of the population that are democrats, republicans, and independents, respectively. Then $$H_0: P_{ij} = P_iQ_j\text{ for all $i$ and $j$}$$ against the alternative $$H_1: P_{ij}\neq P_iQ_j\text{ for some choice of $i$ and $j$}.$$ (Note that $P_{ij} = P\{X=i\text{ and }Y=j\}.$)

In [3]:
data = pd.DataFrame([[68, 56, 32],
                     [52, 72, 20]])
                     
result = sp.stats.chi2_contingency(data)
print(f'The test statistic value: {round(result[0],4)}')
print(f'The p-value: {round(result[1],4)}')
print(f'The degrees of freedom: {result[2]}')
print(f'The 95th percentile: {round(sp.stats.chi2.ppf(0.95,result[2]),4)}')

The test statistic value: 6.4329
The p-value: 0.0401
The degrees of freedom: 2
The 95th percentile: 5.9915


Our computations suggest that we reject the null hypothesis at the 5 percent level of significance that gender and political affiliation are independent.

2.  A public health scientist wanted to learn about the relationship between the marital status of patients being treated for depression and the severity of their conditions. The scientist chose a random sample of 159 patients who had been treated for depression at a mental health clinic and had these patients classified according to the severity of their depression—severe, normal,ormild—and according to their marital status. The following data resulted.

|        | Married | Single | Widowed or Divorced | 
| ---    | ----    | ----   | --------            |
|Severe | 22 | 16 | 19 |
|Normal | 33 | 29 | 14 | 
|Mild   | 14 | 9  | 3  |

Determine the p value of the test of the hypothesis that the depressive state of the clinic’s patients is independent of their marital status.

## Problem 2

In [4]:
data = pd.DataFrame([[22, 16, 19],
                     [33, 29, 14],
                     [14,  9,  3]])

result = sp.stats.chi2_contingency(data)
print(f'The test statistic value: {round(result[0],4)}')
print(f'The p-value: {round(result[1],4)}')
print(f'The degrees of freedom: {result[2]}')
print(f'The 95th percentile: {round(sp.stats.chi2.ppf(0.95,result[2]),4)}')

The test statistic value: 6.8281
The p-value: 0.1453
The degrees of freedom: 4
The 95th percentile: 9.4877


Our computations show a p value of approximately 0.1453.

3. A random sample of 160 patients at a health maintenance organization yielded the following information about their smoking status and blood cholesterol counts:

| Smoking Status | Low BCC | Moderate BCC | High BCC | 
| -------        | -----   | --------     | ------   |
|Heavy | 6|14|24|
|Light | 12|23|15|
|Nonsmoker | 23|32|11|

Would the hypothesis of independence between blood cholesterol count and smoking status be rejected at the 5 percent level of significance?

## Problem 3

In [5]:
data = pd.DataFrame([[ 6, 14, 24],
                     [12, 23, 15],
                     [23, 32, 11]])

result = sp.stats.chi2_contingency(data)
print(f'The test statistic value: {round(result[0],4)}')
print(f'The p-value: {round(result[1],4)}')
print(f'The degrees of freedom: {result[2]}')
print(f'The 95th percentile: {round(sp.stats.chi2.ppf(0.95,result[2]),4)}')

The test statistic value: 18.708
The p-value: 0.0009
The degrees of freedom: 4
The 95th percentile: 9.4877


Our computations suggest that we reject the null hypothesis at the 5 percent level of significance that blood cholesterol count and smoking status are independent.