In [1]:
import numpy as np
from scipy import stats

# One Variable $\chi^2$ Test (Goodness of fit)

### Dataset
A researcher wants to investigate whether the distribution of blood types among a population of 100 individuals matches the expected distribution based on the ABO blood group system. The expected distribution of blood types in the population based on the ABO system is as follows:

    [O,A,B,AB]

In [2]:
expected = np.array([ .45, .4, .11, .04])

The observed distribution of blood types among the 100 individuals is:

In [3]:
observed = np.array([.48,.32,.16,.04])

### Research Problem
Is the distribution of blood types in the population significantly different from the expected distribution based on the ABO blood group system?

### Statistical Hypotheses
$$
H_0: P_O .45; P_A .4; P_B .11; P_{AB} .04 \\
H_1: H_0 ~is~ false
$$

In [4]:
alpha = 0.05
df = 3 # k-1->4-1

In [5]:
Chi_critical =stats.chi2.ppf(1-alpha,df)
Chi_critical

7.814727903251179

### Decision Rule
Reject $H_0$ at the .05 level of significance if $χ2 ≥ 7.814$

### Calculation

$$
f_e = 
(expected~proportion)(total~ sample~ size)
$$
![image.png](attachment:image.png)

In [6]:
f_e = expected * 100
f_o = observed * 100

In [7]:
Chi_2 = np.sum((f_o - f_e)**2/f_e)

In [8]:
Chi_2 >= Chi_critical,Chi_2

(False, 4.072727272727273)

### Decision
Retain $H_0$ at the .05 level of significance because χ2 = 6.43.

### Interpretation
the distribution of blood types in the population does not differs from the expected distribution based on the ABO blood group system.

# Two Variable $\chi^2$ Test

### Dataset
Imagine you have conducted a survey of 1000 people, asking them whether they prefer tea or coffee and whether they are early birds or night owls. The results are as follows:

|  | Early birds | Night owls |
|----------|------------|------------|
| Tea      | 400        | 100        |
| Coffee | 300        | 200        |


In [9]:
data = np.array([[400, 100], [300, 200]])

### Research Problem
Is there a relationship between preferred beverage (tea or coffee) and whether someone is an early bird or a night owl?

### Statistical Hypotheses
$H_0$: There is no significant relationship between preferred beverage and whether someone is an early bird or a night owl.

$H_1$: $H_0$ is false


In [10]:
c,r = data.shape

In [11]:
alpha = 0.05
df = (c-1)*(r-1)

In [12]:
Chi_critical = stats.chi2.ppf(1-alpha,df)
Chi_critical

3.841458820694124

### Decision Rule
Reject $H_0$ at the .05 level of significance if $χ2 ≥ 3.841$

### Calculation
![image.png](attachment:image.png)


In [13]:
col_tot = np.sum(data,axis=0)
row_tot = np.sum(data,axis=1)
grand_tot = np.sum(col_tot)

In [14]:
f_e = row_tot.reshape(-1,1) * col_tot / grand_tot

In [15]:
Chi_2 = np.sum((data - f_e)**2/f_e)

In [16]:
Chi_2 >= Chi_critical,Chi_2

(True, 47.61904761904762)

### Decision
Reject $H_0$ at the .05 level of significance because χ2 = 47.61.

### Interpretation
the distribution of blood types in the population does not differs from the expected distribution based on the ABO blood group system.