<a href="https://colab.research.google.com/github/ms624atyale/Pandas_Stats_Data_Analysis_2025/blob/main/13_ChiSquareTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color ='red'> 💦🔥 **Chi-square tests** (카이제곱검정)

##[_1. Go to Row totals; Column totals; Total numbers_](https://github.com/ms624atyale/Data_Analysis/blob/main/RowTotals_ColumnTotals_TotalN.png)

##[_2.Go to Calculating expected frequencies_](https://github.com/ms624atyale/Data_Analysis/blob/main/Fomula_ExpectedFrequencies.png)

# <font color ='red'> 💦🔥 **Chi-square tests** (카이제곱검정)

- Whether your sample data is likely to be from a specific theoretical distribution.
- Test of Independence (독립성 검정) of categorical variables
- Two-way contigency tables 2차원 분할표 is most common. However, analyses with multi-dimensional다차원 (e.g., three-way3차원 or four-way4차원) contingency tables are also possible in principle, but people switch to either log-linear models or linear regression.

|                     |dependent v1|dependent v2|dependent v3||
|:--:                 |:--:|:--:|:--:|:--:|
|independent variable1|r1c1|r1c2|r1c3|r1 total|
|independent variable2|r2c1|r2c2|r2c3|r2 total|
|independent variable3|r3c1|r3c2|r3c3|r3 total|
|                     |c1 total|c2 total|c3 total| Total |



Contingency Table분할표 for Observed Frequency관측도수 in terms of people's perferences for different SNS, either Instagram, Youtube, or Facebook
>

|            |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|125|119|56|300|
|In thier 30s|268|147|85|500|
|In thier 40s|210|75|315|600|
|            |603|341|456|1400|

How to calculate Expected Frequency기대도수?

**Formula**

    Expected Frequency = row n total * column n total / Grand Total

    where n is a number.

#**[Refer to sample .xlsx used as base for chi-square formula](https://github.com/ms624atyale/Pandas_Stats_Data_Analysis_2025/blob/main/Excel4Chi_SquareTest.xlsx)**


Estimating Expected Frequency
|            |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|300*603/1400|300*341/1400|300*456/1400|300|
|In thier 30s|500*603/1400|500*341/1400|500*456/1400|500|
|In thier 40s|600*603/1400|600*341/1400|600*456/1400|600|
|            |603|341|456|1400|

|            |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|129.2|73.1|97.7|300|
|In thier 30s|215.4|121.8|162.9|500|
|In thier 40s|258.4|146.1|195.4|500|
|            |603|341|356|1400|

>

1️⃣ Formula of chi^2

chi^2	= (OFr1c1-EFr1cl)^2/EFr1c1 + (OFr1c2-EFr1c2)^2/EFr1c2 + ... + (OFr3c3-EFr3c3)^2/EFr3c3

    

    ⤵️ Fomula of chi-square applied:

    chi^2 = 0.14 + 28.87 + 17.81 + 12.87 + 5.22 + 37.22 + 9.08 + 34.63 + 73.19 = 218.99

    2️⃣ Degree of Freedom (df)  = (Total number of rows - 1) * (Total number of columns-1)

    df = (3-1) * (3-1)

-- A **higher Chi-square** statistic indicates a **greater discrepancy** between _observed and expected frequencies_, leading to a **lower** p-value.

-- A **lower p-value** suggests that it is **less** likely the observed association is **due to random chance**,
    making it more likely that the variables are indeed associated.





## **<font color = 'red'> 🌱 🐣 Chi-Square Tests using Python codes**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[125, 119, 56],[268, 147, 85], [210, 75, 315]])

df = pd.DataFrame(observed_freq, index = ['In thier 20s', 'In their 30s', 'In thier 40s'], columns = ['Instagram', 'Youtube', 'Facebook'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups ')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value < alpha:
    print("\n", "Preference of SNS (e.g., Instagram, Youtube, Facebook) is different  among different age groups. (reject H0)")
else:
    print("\n", "Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups. (fail to reject H0)")

              Instagram  Youtube  Facebook
In thier 20s        125      119        56
In their 30s        268      147        85
In thier 40s        210       75       315


Ho: Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups 

 Chi-square statistic: 218.98992136307027

 p-value: 3.0923258542677724e-46

 Degrees of freedom: 4

 Expected frequencies:
 [[129.21428571  73.07142857  97.71428571]
 [215.35714286 121.78571429 162.85714286]
 [258.42857143 146.14285714 195.42857143]]

 Preference of SNS (e.g., Instagram, Youtube, Facebook) is different  among different age groups. (reject H0)


### 1️⃣ **Testing English native speakers' speaking fluency with chi-square tests**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Define the contingency table
observed_freq = np.array([[30, 85],
                          [95, 68]])

df = pd.DataFrame(observed_freq, index = ['Male/Female after Puberty', 'Kids under 5'], columns = ['Interuptions', 'Fillers'])

print('For your information, interruptions in speaking are like breaks중단 or disruptions방해.')
print('For your information, filler words채움말 in speaking are like uh, um, well, like, so, OK, right, you know, I mean, sort of, kind of, basically, actually, let me see, let me think, etc.')
print('\n')

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

# Print the results

print ('Ho: Speaking fluency are similar between adults and kids are independent of age.')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value < alpha:
    print("\n", "Different agae groups differ in speaking fluency. (reject H0)")
else:
    print("\n", "Different agae groups do not differ in speaking fluency. (fail to reject H0)")

For your information, interruptions in speaking are like breaks중단 or disruptions방해.
For your information, filler words채움말 in speaking are like uh, um, well, like, so, OK, right, you know, I mean, sort of, kind of, basically, actually, let me see, let me think, etc.


                           Interuptions  Fillers
Male/Female after Puberty            30       85
Kids under 5                         95       68


Ho: Speaking fluency are similar between adults and kids are independent of age.

 Chi-square statistic: 26.957080592820397

 p-value: 2.0802369135102012e-07

 Degrees of freedom: 1

 Expected frequencies:
 [[51.70863309 63.29136691]
 [73.29136691 89.70863309]]

 Different agae groups differ in speaking fluency. (reject H0)


### 2️⃣ **Testing speakers' use of grammatical function words between native speakers of English and non-native speakers of English with chi-square tests**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[125, 59, 80],[100, 12, 50]])

df = pd.DataFrame(observed_freq, index = ['Native after puberty', 'Non-native after puberty'], columns = ['Pronouns', 'Relative Clause', 'Subordinate Clause'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Using grammatical function words are similar in terms of frequency between native and non-native after-puberty speakers of English')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value < alpha:
    print("\n", "The use of grammatical function words varies depending on whether the speaker is a native or non-native user of English. (reject H0)")
else:
    print("\n", "The use of grammatical function words does not vary depending on whether the speaker is a native or non-native user of English. (fail to reject H0)")

                          Pronouns  Relative Clause  Subordinate Clause
Native after puberty           125               59                  80
Non-native after puberty       100               12                  50


Ho: Using grammatical function words are similar in terms of frequency between native and non-native after-puberty speakers of English

 Chi-square statistic: 17.38783849894961

 p-value: 0.00016760186379635635

 Degrees of freedom: 2

 Expected frequencies:
 [[139.43661972  44.          80.56338028]
 [ 85.56338028  27.          49.43661972]]

 The use of grammatical function words varies depending on whether the speaker is a native or non-native user of English. (reject H0)


### 3️⃣ **Influence of different types of smoking on outbreak of smoking-related cancers**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[100, 150],[120, 160]])

df = pd.DataFrame(observed_freq, index = ['Cigarette', 'e-Cigarette'], columns = ['Lung Cancer', 'Bladder Cancer'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Cancers in the area of Lung and Bladder are similar between cigarette and e-cigarette users')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value < alpha:
    print("\n", "The incidence rates발병률 of lung cancer and bladder cancer are different between cigarette users and e-cigarette users. (reject H0)")
else:
    print("\n", "The incidence rates발병률 of lung cancer and bladder cancer are not different between cigarette users and e-cigarette users. (fail to reject H0)")

             Lung Cancer  Bladder Cancer
Cigarette            100             150
e-Cigarette          120             160


Ho: Cancers in the area of Lung and Bladder are similar between cigarette and e-cigarette users

 Chi-square statistic: 0.3341892019271051

 p-value: 0.5632026843691589

 Degrees of freedom: 1

 Expected frequencies:
 [[103.77358491 146.22641509]
 [116.22641509 163.77358491]]

 The incidence rates발병률 of lung cancer and bladder cancer are not differ between cigarette users and e-cigarette users. (fail to reject H0)
