# Chi-square tests for one-way tables

In a one-way table, observations are classified according to single categorical variable.

A sample of $100$ individuals might reveal:

In [32]:
import pandas as pd

In [33]:
blood_type_cols = ['O', 'A', 'B' , 'AB']
data = pd.DataFrame(columns = blood_type_cols)
data.loc['Observed Count'] = [43,40,12,5]

In [34]:
data

Unnamed: 0,O,A,B,AB
Observed Count,43,40,12,5


Some might contend that these blood types are equally likely and we want to test whether this sample gives strong evidence against that claim.

In [35]:
expected_count = data.iloc[0].sum()/4
data.loc['Expected Count'] = expected_count

In [36]:
data

Unnamed: 0,O,A,B,AB
Observed Count,43,40,12,5
Expected Count,25,25,25,25


Calculating the $\chi^2$ statistic:

In [42]:
data.loc['(Observed - Expected)^2/Expected'] = (data.loc['Observed Count'] - data.loc['Expected Count'])**2 / data.loc['Expected Count']

In [47]:
data

Unnamed: 0,O,A,B,AB
Observed Count,43.0,40,12.0,5
Expected Count,25.0,25,25.0,25
(Observed - Expected)^2/Expected,12.96,9,6.76,16


In [50]:
chi_square = data.loc['(Observed - Expected)^2/Expected'].sum()

**Pearson's chi-squared test**

Pearson's chi-squared test uses a measure of goodness of fit which is the sum of differences between observed and expected outcome frequencies (that is, counts of observations), each squared and divided by the expectation:

$${\displaystyle \chi ^{2}=\sum _{i=1}^{n}{{\frac {(O_{i}-E_{i})}{E_{i}}}^{2}}}$$
where:

- $O_i$ = an observed count for bin i
- $E_i$ = an expected count for bin i, asserted by the null hypothesis.

The expected frequency is calculated by:

$${\displaystyle E_{i}\,=\,{\bigg (}F(X_{u})\,-\,F(X_{l}){\bigg )}\,N}, $$
where:

- $F$ = the cumulative distribution function for the probability  distribution being tested.

- $X_u$ = the upper limit for class i,
- $X_l$ = the lower limit for class i, and
- $N$ = the sample size

The resulting value can be compared with a chi-squared distribution to determine the goodness of fit. The chi-squared distribution has $(k − c)$ degrees of freedom, where $k$ is the number of non-empty cells and $c$ is the number of estimated parameters (including location and scale parameters and shape parameters) for the distribution plus one.

In our blood-type case, the degrees of freedom is the number of cells minus $1$. So it is $1$

One might find the expected count formula to be a bit confusing, but think in terms of age $X$ of  a certain population being normally distributed. Given this scenario, the expected number of people out of  a $100$ that fall in age bin $18-35$ is $[P(X\leq 35) - P(X\leq 18)] \cdot 100$.

So we can use a $\chi^2$ test to test the null hypothesis that the data comes from a specific parametric distribution (e.g. Binomial, Poisson, normal)


For example, to test the hypothesis that a random sample of $100$ people has been drawn from a population in which men and women are equal in frequency, the observed number of men and women would be compared to the theoretical frequencies of $50$ men and $50$ women. If there were $44$ men in the sample and $56$ women, then

$${\displaystyle \chi ^{2}={(44-50)^{2} \over 50}+{(56-50)^{2} \over 50}=1.44} $$

If the null hypothesis is true (i.e., men and women are chosen with equal probability in the sample), the test statistic will be drawn from a chi-squared distribution with one degree of freedom. Though one might expect two degrees of freedom (one each for the men and women), we must take into account that the total number of men and women is constrained $(100)$, and thus there is only one degree of freedom $(2 − 1)$. In other words, if the male count is known the female count is determined, and vice versa.

Consultation of the chi-squared distribution for $1$ degree of freedom shows that the probability of observing this difference (or a more extreme difference than this) if men and women are equally numerous in the population is approximately $0.23$. This probability is higher than conventional criteria for statistical significance $(.001-.05)$, so normally we would not reject the null hypothesis that the number of men in the population is the same as the number of women (i.e. we would consider our sample within the range of what we'd expect for a $50/50$ male/female ratio.)

Note the assumption that the mechanism that has generated the sample is random, in the sense of independent random selection with the same probability, here $0.5$ for both males and females. If, for example, each of the 44 males selected brought a male buddy, and each of the 56 females brought a female buddy, each ${\textstyle {(O_{i}-E_{i})}^{2}}$ will increase by a factor of $4$, while each ${\textstyle E_{i}}$ will increase by a factor of $2$. The value of the statistic will double to $2.88$. Knowing this underlying mechanism, we should of course be counting pairs. In general, the mechanism, if not defensibly random, will not be known. The distribution to which the test statistic should be referred may, accordingly, be very different from chi-squared.

In [54]:
#chi-square one-way in python

import scipy

observed_values=[18,21,16,7,15]
expected_values=[22,19,44,8,16]

scipy.stats.chisquare(observed_values, f_exp=expected_values)


Power_divergenceResult(statistic=18.94348086124402, pvalue=0.0008062955548480186)

# Chi-square tests for two(or more)-way tables (chi-square tests of independence)

Suppose there is a city of $1,000,000$ residents with four neighborhoods: A, B, C, and D. A random sample of $650$ residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:

In [55]:
import numpy as np
import pandas as pd
import scipy.stats as stats

cols = ['A', 'B', 'C', 'D']
data = pd.DataFrame(columns=cols)

data.loc['White Collar'] = [90, 60, 104, 95]
data.loc['Blue Collar'] = [30, 50, 51, 20]
data.loc['No collar'] = [30, 40, 45, 35]

data

Unnamed: 0,A,B,C,D
White Collar,90,60,104,95
Blue Collar,30,50,51,20
No collar,30,40,45,35


Let us take the sample living in neighborhood A, $150$, to estimate what proportion of the whole $1,000,000$ live in neighborhood A. Similarly we take $349/650$ to estimate what proportion of the $1,000,000$ are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood A to be

$${\displaystyle 150\times {\frac {349}{650}}\approx 80.54}$$

Then in that "cell" of the table, we have

$${\displaystyle {\frac {\left({\text{observed}}-{\text{expected}}\right)^{2}}{\text{expected}}}={\frac {\left(90-80.54\right)^{2}}{80.54}}\approx 1.11}$$

The sum of these quantities over all of the cells is the test statistic; in this case, $\approx 24.6 $. Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are

$${\displaystyle ({\text{number of rows}}-1)({\text{number of columns}}-1)=(3-1)(4-1)=6}$$

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.

In [62]:
V, p, dof, expected = stats.chi2_contingency(data) 
print ('P value for effect of area on proportion of each collar:')
print (p)
print('\nDegree of freedom:', dof)
print ('\nExpected numbers if area did not effect proportion of each collar:')
print (expected)

P value for effect of area on proportion of each collar:
0.0004098425861096696

Degree of freedom: 6

Expected numbers if area did not effect proportion of each collar:
[[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]


A related issue is a test of homogeneity. Suppose that instead of giving every resident of each of the four neighborhoods an equal chance of inclusion in the sample, we decide in advance how many residents of each neighborhood to include. Then each resident has the same chance of being chosen as do all residents of the same neighborhood, but residents of different neighborhoods would have different probabilities of being chosen if the four sample sizes are not proportional to the populations of the four neighborhoods. In such a case, we would be testing "homogeneity" rather than "independence". The question is whether the proportions of blue-collar, white-collar, and no-collar workers in the four neighborhoods are the same. However, the test is done in the same way.

**Assumptions of the Chi-square data**

When to Use Chi-Square Test for Independence
The test procedure described in this lesson is appropriate when the following conditions are met:

- The sampling method is simple random sampling.
- The variables under study are each categorical.
- If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least $5$.