## Chi-Square Test

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.


<b>Null Hypothesis :</b> There is no relationship between two categorical variables







#### Note: After performing the test if the p value is smaller than 0.05 then you have to reject the null hypothesis
<hr>

Formula for calculating chi-square statistic
![image.png](attachment:image.png)

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

In [4]:
df =sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
## we will do a chi-square test to check if sex and smoker features are related or not

df_table=pd.crosstab(df.sex,df.smoker)
df_table

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


In [6]:
df_table.values

array([[60, 97],
       [33, 54]], dtype=int64)

In [7]:
Observed_values = df_table.values

In [9]:
stats.chi2_contingency(Observed_values)

(0.008763290531773594,
 0.925417020494423,
 1,
 array([[59.84016393, 97.15983607],
        [33.15983607, 53.84016393]]))

In [68]:
chi2_stat ,p,ddof,Expected_values = stats.chi2_contingency(Observed_values)

In [69]:
alpha =0.05

In [70]:
critical_value = stats.chi2.ppf(q=1-alpha,df=ddof)
critical_value

3.841458820694124

In [71]:
#pvalue
p

0.925417020494423

In [72]:
if chi2_stat>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables


In [73]:
## we can use any of the method for chi-square test

### consider another example

In [75]:
np.random.seed(10)
vote_race = np.random.choice(['black','white','asian','other'],
                             p=[0.1,0.2,0.3,0.4],
                             size=1000)
voter_party = np.random.choice(a= ["democrat","independent","republican"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

In [76]:
df = pd.DataFrame({'race':vote_race,
                   'party':voter_party
})
df.head()

Unnamed: 0,race,party
0,other,democrat
1,black,republican
2,other,independent
3,other,republican
4,asian,democrat


In [79]:
voter_tab=pd.crosstab(df.race,df.party) 
voter_tab

party,democrat,independent,republican
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
asian,125,50,113
black,42,19,51
other,149,79,168
white,81,38,85


In [80]:
observed_values = voter_tab.values

In [81]:
chi2_stat,p,ddof,expected_values = stats.chi2_contingency(observed_values)

In [82]:
alpha=0.05

In [83]:
critical = stats.chi2.ppf(1-alpha,ddof)

In [84]:
if chi2_stat>= critical:
      print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    

Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables


In [86]:
p,chi2_stat,critical

(0.7819369111879675, 3.2109942430787157, 12.591587243743977)