In [1]:

import numpy as np
import scipy.stats as st
import pandas as pd

A fair dice is rolled 720 times and the following results are obtained:
Face 7: 22 times
Face 2: 77 times
Face 3: 20 times
26 times
Face 4:
Face 5: 22 times
Face 6: 73 times
Test at a 5% level of significance whether the die is fair.

In [2]:
oi = np.array([22,17,20,26,22,13])
ei = np.array([20,20,20,20,20,20])

df = 6-1          #die has 6 faces
chi_tbl = 11.70

In [3]:
chi_squar = sum((oi-ei)**2/ei)
chi_squar

np.float64(5.1000000000000005)

A study was conducted to investigate whether there is a relationship between
gender and the preferred genre of music. A sample of 235 people was selected, and
the data collected is shown below. Test at a 5% level of significance whether there is
a significant association between gender and music preference.

In [4]:
df=pd.DataFrame({"pop":[40,35],"hip hop":[45,30],"classical":[25,20],"rock":[10,30]},index=("male","female"))
df

Unnamed: 0,pop,hip hop,classical,rock
male,40,45,25,10
female,35,30,20,30


In [5]:
no_sample = 235
male = np.array(df.loc["male"])
female = np.array(df.loc["female"])
print(male,female)


[40 45 25 10] [35 30 20 30]


In [6]:
sum_male = np.sum(male)
sum_female = np.sum(female)
sum_row= np.array([sum_male,sum_female])
sum_row

array([120, 115])

In [7]:
col_sum = male+female
col_sum

array([75, 75, 45, 40])

In [8]:
expected_value = []
for i in sum_row:
    for j in col_sum:
        expected_value.append(i*j/no_sample)
expected_value

[np.float64(38.297872340425535),
 np.float64(38.297872340425535),
 np.float64(22.97872340425532),
 np.float64(20.425531914893618),
 np.float64(36.702127659574465),
 np.float64(36.702127659574465),
 np.float64(22.02127659574468),
 np.float64(19.574468085106382)]

In [9]:
observation_value = np.hstack((male,female))
observation_value

array([40, 45, 25, 10, 35, 30, 20, 30])

In [10]:
chi_squar_test = sum((observation_value-expected_value)**2/expected_value)
chi_squar_test

np.float64(13.788747987117553)

## Note on Chi-Square Test

The **Chi-Square Test** is a statistical method used to determine whether there is a significant association between categorical variables. It is commonly applied in two scenarios:

1. **Goodness-of-Fit Test**: Checks if an observed frequency distribution differs from a theoretical distribution.  
    - Example: Testing if a die is fair by comparing observed and expected frequencies for each face.

2. **Test of Independence**: Assesses whether two categorical variables are independent of each other.  
    - Example: Testing if gender and music preference are related.

### Key Steps in Performing a Chi-Square Test

- **State the hypotheses**:  
  - Null hypothesis (\(H_0\)): No association between variables or the observed distribution fits the expected distribution.
  - Alternative hypothesis (\(H_1\)): There is an association or the distributions differ.

- **Calculate the expected frequencies** based on the assumption that \(H_0\) is true.

- **Compute the test statistic**:  
  \[
  \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
  \]
  where \(O_i\) = observed frequency, \(E_i\) = expected frequency.

- **Compare the calculated \(\chi^2\) value** with the critical value from the chi-square distribution table at the desired significance level (e.g., 5%).

- **Draw a conclusion**:  
  - If \(\chi^2_{calculated} > \chi^2_{table}\), reject \(H_0\).
  - Otherwise, do not reject \(H_0\).

### Assumptions

- The data are frequencies (counts) in categories.
- Observations are independent.
- Expected frequency in each cell should be at least 5 for validity.

The chi-square test is widely used in research to analyze categorical data and test hypotheses about distributions and relationships between variables.