# Chi-Squared Distributions and Goodness-of-Fit Tests

## Chi-Squared Distribution

If $Z_1,Z_2,\dots, Z_n$ are independent standard normal random variables, then the random variable $$\sum_{i=1}^nZ_i^2$$ is said to be a **chi-squared random variable** with $n$ degrees of freedom. It has an expected value of $n$.

## Sample Variance

This proof of this result is beyond the scope of this class, but $$S^2 = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\overline{X}\right)^2$$ has a chi-squared distribution with $n-1$ degrees of freedom.

## A Motivating Example

To determine whether accidents are more likely to occur on certain days of the week, data have been collected on all the accidents requiring medical attention that occurred over the last 12 months at an automobile plant in northern California. The data yielded a total of 250 accidents, with the number occurring on each day of the week being as follows:
 
 | Weekday   |Count|
 |-----------|-----|
 | Monday    |  62 |
 | Tuesday   |  47 | 
 | Wednesday |  44 |
 |Thursday   |  45 |
 |Friday     |  52 |

How can we test the hypothesis that accidents are equally likely to occur on any day of the work week?

### Chi-Squared Goodness-of-Fit Tests

Suppose that you have a large population where each member has a value associated with them, say, from $1$ to $k$ (the values don't have to be numerical, they can be *categorical*). Let $P_1, P_2,\dots, P_k$ be the true proportions of the population that have values $1,2,\dots,k$. Suppose that we believe that each of the $k$ different values occur with probabilities $p_1, p_2, \dots, p_k$, respectfully. Then we are interested in testing $$H_0: P_1 = p_1,P_2=p_2,\dots,P_k=p_k$$ against the alternative $$H_1: P_i\neq p_i\text{ for some $i$, $i=1,2,\dots,k$}.$$

We can do this by computing the following test statistic: $$TS = \sum_{i=1}^k\frac{(N_i - e_i)^2}{e_i}$$ where $N_i$ is the number of observed samples with associated value $i$ and $e_i$ is the expected number of observations with associated value $i$, for $i=1,2,\dots,k$. For a large sample size $n$, $TS$ will have an approximately chi-squared distribution with $k-1$ degrees of freedom. (This is not an obvious result and we will not justfity it here.)

Letting $\chi^2_{k,\alpha}$ denote the $100(1-\alpha)$th percentile for chi-squared random variable with $k$ degrees of freedom, we get that a significance-level-$\alpha$ test of the null hypothesis $H_0$ against the alternative $H_1$ is to reject $H_0$ if $$TS\geq \chi^2_{k-1,\alpha}$$ and not to reject $H_0$ otherwise. This is referred to as a **chi-squared goodness-of-fit test**.

## Hypothesis test rejection area

![Chi-squared Rejection](chi_squared_24.png)

## Back to the Example

In our earlier example we wanted to test the hypothesis that accidents are equally likely to occur on any weekday. Numbering the weekdays $1,2,\dots,5$ from Monday to Friday, we denote by $P_i$ the proportion of accidents that occur on day $i$ with $i=1,\dots,5$. So our null hypothesis is $$H_0: P_i=\frac{1}{5},\text{ $i=1,\dots,5$}$$ and our alternative hypothesis is $$H_1: P_i\neq \frac{1}{5}\text{ for some $i$, $i=1,\dots,5$.}$$

We compute the test statistic $$TS = \frac{(62-50)^2}{50} + \frac{(47-50)^2}{50} + \frac{(44-50)^2}{50} + \frac{(45-50)^2}{50} + \frac{(52-50)^2}{50} = 4.36.$$ In this case, since $\chi^2_{4,0.05} \approx 9.49$, we do not reject the null hypothesis at the 5 percent level of significance.

## Computations 

In [1]:
import pandas as pd
import scipy as sp

data = pd.Series([62,47,44,45,52])
TS = ((data-50)**2/50).sum()
chi2_95_4 = round(sp.stats.chi2.ppf(0.95,4),4)
p = round(1 - sp.stats.chi2.cdf(TS,4),4)

print(f'The test statistic value: {TS}')
print(f'The 95th percentile for a chi-squared dist with 4 dof: {chi2_95_4}')
print(f'The p-value for the test statistic: {p}')

The test statistic value: 4.36
The 95th percentile for a chi-squared dist with 4 dof: 9.4877
The p-value for the test statistic: 0.3595


## Examples

1. In a certain county, it has been historically accepted that 52 percent of the patients who go to hospital emergency rooms are in stable condition, 32 percent are in serious condition, and 16 percent are in critical condition. However,a particular county hospital feels that its percentages are different. To prove its claim, the hospital has randomly selected a sample of 300 patients who have visited its emergency room in the past 6 months. The numbers falling in each grouping are as follows:
 
 |Condition | Count|
 |----------|------|
 |Stable| 148
 |Serious| 92
 |Critical| 60

Do these data prove the claim of the hospital?

## Problem 1

Let $P_1$, $P_2$, and $P_3$ be the proportions of cases where the patients are in stable, serious, and critical condition, respectively. Then $$H_0: P_1 = 0.52, P_2 = 0.32, P_3 = 0.16.$$ The alternative hypothesis, $H_1$ is that one of these equalities does not hold.

In [2]:
n = 300
probs = pd.Series([0.52, 0.32, 0.16])
counts = pd.Series([148, 92, 60])
TS = round(((counts - n*probs)**2/(n*probs)).sum(),4)
p = round(1 - sp.stats.chi2.cdf(TS,2),4)

print(f'The test statistic value: {TS}')
print(f'The p-value for the test statistic: {p}')

The test statistic value: 3.5769
The p-value for the test statistic: 0.1672


The computations suggest that we would not reject $H_0$ at the 5 percent significance level.

2.  Consider an experiment having six possible outcomes whose probabilities are hypothesized to be 0.1, 0.1, 0.05, 0.4, 0.2, and 0.15. This is to be tested by performing 60 independent replications of the experiment. If the resultant number of times that each of the six outcomes occur is 4, 3, 7, 17, 16, and 13, should the hypothesis be rejected? Use the 5 percent level of significance.

## Problem 2

In [3]:
n = 60
probs = pd.Series([0.1, 0.1, 0.05, 0.4, 0.2, 0.15])
counts = pd.Series([4, 3, 7, 17, 16, 13])
TS = round(((counts - n*probs)**2/(n*probs)).sum(),4)
p = round(1 - sp.stats.chi2.cdf(TS,5),4)

print(f'The test statistic value: {TS}')
print(f'The p-value for the test statistic: {p}')

The test statistic value: 12.6528
The p-value for the test statistic: 0.0269


We reject the null hypothesis at the 5 percent significance level.

## Fraud Detection

Here is a sequence of 300 coin flips. We would like to study whether the sequence was the result of 300 actual coin flips, or not.

TTHHTHTTHTTTHTTTHTTTHTTHTHHTHHTHTHHTTTHHTHTHTTHTHH
TTHTHHTHTTTHHTTHHTTHHHTHHTHTTHTHTTHHTHHHTTHTHTTTHH
TTHTHTHTHTHTTHTHTHHHTTHTHTHHTHHHTHTHTTHTTHHTHTHTHT
THHTTHTHTTHHHTHTHTHTTHTTHHTTHTHHTHHHTTHHTHTTHTHTHT
HTHTHTHHHTHTHTHTHHTHHTHTHTTHTTTHHTHTTTHTHHTHHHHTTT
HHTHTHTHTHHHTTHHTHTTTHTHHTHTHTHHTHTTHTTHTHHTHTHTTT

A first attempt might be to run a chi-squared goodness-of-fitness test between the number of heads and tails.

In [4]:
flips = 'TTHHTHTTHTTTHTTTHTTTHTTHTHHTHHTHTHHTTTHHTHTHTTHTHHTTHTHHTHTTTHHTTHHTTHHHTHHTHTTHTHTTHHTHHHTTHTHTTTHHTTHTHTHTHTHTTHTHTHHHTTHTHTHHTHHHTHTHTTHTTHHTHTHTHTTHHTTHTHTTHHHTHTHTHTTHTTHHTTHTHHTHHHTTHHTHTTHTHTHTHTHTHTHHHTHTHTHTHHTHHTHTHTTHTTTHHTHTTTHTHHTHHHHTTTHHTHTHTHTHHHTTHHTHTTTHTHHTHTHTHHTHTTHTTHTHHTHTHTTT'

heads = sum([1 for result in flips if result == 'H'])
tails = 300 - heads
TS = round((heads-150)**2/150 + (tails-150)**2/150,4)
p = round(1 - sp.stats.chi2.cdf(TS,1),4)

print(f'The test statistic value: {TS}')
print(f'The p-value for the test statistic: {p}')

The test statistic value: 0.0533
The p-value for the test statistic: 0.8174


The work above suggests that it is not that unlikely to have seen this number of heads and tails. But can we do more?

We will now conduct a chi-squared goodness-of-fitness test on consecutive pairs of results. We expect that $HH$, $HT$, $TH$, and $TT$ should all come out with equal probabilites $1/4$. (Note: we don't quite get the independence we need to run a chi-squared goodness-of-fitness test, but with a large enough dataset, this effect is almost nonexistent.)

In [5]:
hh = sum([1 for i in range(len(flips[:-1])) if flips[i:i+2] == 'HH'])
ht = sum([1 for i in range(len(flips[:-1])) if flips[i:i+2] == 'HT'])
th = sum([1 for i in range(len(flips[:-1])) if flips[i:i+2] == 'TH'])
tt = 299 - hh - ht - th
TS = round((hh-299/4)**2/(299/4) + (ht-299/4)**2/(299/4) + (th-299/4)**2/(299/4) + (tt-299/4)**2/(299/4),4)
p = round(1 - sp.stats.chi2.cdf(TS,3),12)

print(f'The test statistic value: {TS}')
print(f'The p-value for the test statistic: {p}')

The test statistic value: 39.796
The p-value for the test statistic: 1.1771e-08


Our result seems to show that this dataset is extremely unlikely to have occurred by chance.