# Hypothesis Testing (Chi-Square Test)

The **Chi-Square Test** is a statistical method used to determine whether there is a significant association between two categorical variables. It compares the observed results with the expected results under the assumption that there is no relationship between the variables (null hypothesis).

---

## Types of Chi-Square Tests
1. **Chi-Square Goodness of Fit Test**  
   - Used to determine whether the distribution of a single categorical variable matches an expected distribution.

2. **Chi-Square Test of Independence**  
   - Used to test whether two categorical variables are independent of each other.

---

## Hypotheses in Chi-Square Test
- **Null Hypothesis (H₀):** There is no association between the variables (or the observed distribution fits the expected distribution).  
- **Alternative Hypothesis (H₁):** There is an association between the variables (or the observed distribution does not fit the expected distribution).

---

## Formula for Chi-Square Statistic
$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

Where:  
- $O_i$ = Observed frequency  
- $E_i$ = Expected frequency

---


## Example 1: Chi-Square Goodness of Fit Test

A fair die is rolled 120 times, and the following results are obtained:  
- **Face 1**: 22 times  
- **Face 2**: 17 times  
- **Face 3**: 20 times  
- **Face 4**: 26 times  
- **Face 5**: 22 times  
- **Face 6**: 13 times  

### Hypotheses:
- **Null Hypothesis (H₀)**: The die is fair (each face appears with equal probability, i.e., $1/6$).  
- **Alternative Hypothesis (H₁)**: The die is not fair (the observed frequencies deviate significantly from the expected frequencies).

---

### Step 1: Calculate the Expected Frequency  
For a fair die, each face is equally likely.  

$$
\text{Expected Frequency} = \frac{120}{6} = 20
$$  

---

### Step 2: Compute the Chi-Square Statistic  
The formula for the chi-square statistic is:  

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$  

where $O_i$ is the observed frequency, and $E_i$ is the expected frequency.

Using the data:  
- $O_1 = 22$, $O_2 = 17$, $O_3 = 20$, $O_4 = 26$, $O_5 = 22$, $O_6 = 13$  
- $E_i = 20$ for all faces

Now, calculate the contributions for each face:

$$
\frac{(22 - 20)^2}{20} = \frac{4}{20} = 0.2
$$

$$
\frac{(17 - 20)^2}{20} = \frac{9}{20} = 0.45
$$

$$
\frac{(20 - 20)^2}{20} = \frac{0}{20} = 0
$$

$$
\frac{(26 - 20)^2}{20} = \frac{36}{20} = 1.8
$$

$$
\frac{(22 - 20)^2}{20} = \frac{4}{20} = 0.2
$$

$$
\frac{(13 - 20)^2}{20} = \frac{49}{20} = 2.45
$$

Sum of all contributions:

$$
\chi^2 = 0.2 + 0.45 + 0 + 1.8 + 0.2 + 2.45 = 5.1
$$

---

### Step 3: Determine the Critical Value  
For a 5% level of significance ($\alpha = 0.05$) and $k - 1 = 6 - 1 = 5$ degrees of freedom, the critical value from the chi-square table is:  

$$
\chi^2_{0.05, 5} = 11.07
$$

---

### Step 4: Compare the Test Statistic with the Critical Value  

$$
\chi^2 = 5.1 < 11.07
$$

---

### Step 5: Conclusion  
Since the test statistic $5.1$ is less than the critical value $11.07$, we **fail to reject the null hypothesis**. There is no significant evidence at the 5% level to conclude that the die is not fair.


In [103]:
import numpy as np
import scipy.stats as st

In [104]:
observed = np.array([22, 17, 20, 26, 22, 13])
size = len(observed)
e = 120/6
expected = np.array([e for i in range(6)])
alpha = 0.05
df = size-1

In [105]:
chi_cal = np.sum(np.square(observed-expected)/e)
chi_cal

5.1000000000000005

In [106]:
chi_table = st.chi2.ppf(1-alpha, df)
chi_table

11.070497693516351

In [107]:
if(chi_table<chi_cal):
    print('Ha is right')  
else:
    print('H0 is right')   # H0 is right means die is fair  

H0 is right


## Example 2: Chi-Square Test of Independence

A study was conducted to investigate whether there is a relationship between gender and the preferred genre of music. A sample of 235 people was selected, and the collected data is shown below:

|          | **Pop** | **Hip Hop** | **Classical** | **Rock** |
|----------|---------|-------------|---------------|----------|
| **Male** |   40    |      45     |      25       |    10    |
| **Female** | 35    |      30     |      20       |    30    |

---

### Hypotheses:
- **Null Hypothesis (H₀)**: There is no significant association between gender and music preference (they are independent).  
- **Alternative Hypothesis (H₁)**: There is a significant association between gender and music preference.

---

### Step 1: Calculate the Expected Frequencies  
The formula for the expected frequency is:  

$$
E_{ij} = \frac{(R_i \times C_j)}{n}
$$  

where $E_{ij}$ is the expected frequency for the $i^{th}$ row and $j^{th}$ column, $R_i$ is the total for row $i$, $C_j$ is the total for column $j$, and $n$ is the overall sample size.

#### Row Totals:
- Male: $40 + 45 + 25 + 10 = 120$  
- Female: $35 + 30 + 20 + 30 = 115$  

#### Column Totals:
- Pop: $40 + 35 = 75$  
- Hip Hop: $45 + 30 = 75$  
- Classical: $25 + 20 = 45$  
- Rock: $10 + 30 = 40$  

Total sample size: $n = 235$

---

### Step 2: Compute the Expected Frequencies  
For example, the expected frequency for **Male-Pop** is:  

$$
E_{Male, Pop} = \frac{(120 \times 75)}{235} = 38.3
$$  

The full table of expected frequencies is:

|          | **Pop** | **Hip Hop** | **Classical** | **Rock** |
|----------|---------|-------------|---------------|----------|
| **Male** | 38.3    | 38.3        | 23.0          | 20.4     |
| **Female** | 36.7  | 36.7        | 22.0          | 19.6     |

---

### Step 3: Compute the Chi-Square Statistic  
The formula for the chi-square statistic is:  

$$
\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$  

Now, calculate the contributions for each cell:

$$
\frac{(40 - 38.3)^2}{38.3} = 0.08
$$

$$
\frac{(45 - 38.3)^2}{38.3} = 1.17
$$

$$
\frac{(25 - 23.0)^2}{23.0} = 0.17
$$

$$
\frac{(10 - 20.4)^2}{20.4} = 5.30
$$

$$
\frac{(35 - 36.7)^2}{36.7} = 0.08
$$

$$
\frac{(30 - 36.7)^2}{36.7} = 1.22
$$

$$
\frac{(20 - 22.0)^2}{22.0} = 0.18
$$

$$
\frac{(30 - 19.6)^2}{19.6} = 5.45
$$

Sum of all contributions:

$$
\chi^2 = 0.08 + 1.17 + 0.17 + 5.30 + 0.08 + 1.22 + 0.18 + 5.45 = 13.65
$$

---

### Step 4: Determine the Critical Value  
For a 5% level of significance ($\alpha = 0.05$) and $(2 - 1)(4 - 1) = 3$ degrees of freedom, the critical value from the chi-square table is:

$$
\chi^2_{0.05, 3} = 7.81
$$

---

### Step 5: Compare the Test Statistic with the Critical Value  

$$
\chi^2 = 13.65 > 7.81
$$

---

### Step 6: Conclusion  
Since the test statistic $13.65$ is greater than the critical value $7.81$, we **reject the null hypothesis**. There is significant evidence at the 5% level to conclude that there is an association between gender and music preference.

---

In [108]:
observed_row1 = np.array([40, 45, 25, 10])
observed_row2 = np.array([35, 30, 20, 30])

observed = np.array([40, 45, 25, 10, 35, 30, 20, 30])


In [109]:
sample_size = np.sum(observed)
sample_size

235

In [110]:
row1_sum = np.sum(observed_row1)
row2_sum = np.sum(observed_row2)

row_sum = np.array([row1_sum, row2_sum])
row_sum

array([120, 115])

In [111]:
col_sum = observed_row1+observed_row2
col_sum

array([75, 75, 45, 40])

In [112]:
expected = []

for i in row_sum:
    for j in col_sum:
        expected.append((i*j)/sample_size)
        
expected

[38.297872340425535,
 38.297872340425535,
 22.97872340425532,
 20.425531914893618,
 36.702127659574465,
 36.702127659574465,
 22.02127659574468,
 19.574468085106382]

In [113]:
len(observed), len(expected)

(8, 8)

In [114]:
chi_cal = np.sum(np.square(observed-expected)/expected)
chi_cal

13.788747987117553

In [115]:
alpha = 0.05

In [116]:
no_rows = len(col_sum)
no_cols = len(row_sum)

no_rows, no_cols

(4, 2)

In [117]:
df = (no_rows-1) * (no_cols-1)
df

3

In [118]:
chi_table = st.chi2.ppf(1-alpha, df)
chi_table

7.814727903251179

In [119]:
if(chi_table<chi_cal):
    print('Ha is right')  # Ha is right means there is an association between gender and music preference.  
else:
    print('H0 is right')   

Ha is right
