In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import scipy.stats as stats
%matplotlib inline

### Test for Independence

**Assumptions for chi square test:**

* The data must be in two categorical variables.
*  Each cell in the contingency table should have an expected frequency of at least 5 for the Chi-Square test to be valid.
* Random sampling from the population

**Example 1 :** A retailer wants to check if product preference (electronics or clothing) is independent of customer gender (male or female).


In [2]:
# Create data
data = {'Gender': ['Male', 'Female'], 'Electronics': [40, 70], 'Clothing': [60, 30]}
df = pd.DataFrame(data).set_index('Gender')
df

Unnamed: 0_level_0,Electronics,Clothing
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,40,60
Female,70,30


Testing the null hypothesis

>$H_0:$ Product preference is independent of gender

against the alternative hypothesis

>$H_1:$ product preference is dependent of gender

In [3]:
#performing the hypothesis using p value
from scipy.stats import chi2_contingency
chi2_stat,p_val,dof,expected=chi2_contingency(df)
p_val

3.7579211012441855e-05

In [4]:
expected

array([[55., 45.],
       [55., 45.]])

In [5]:
dof

1

since p value is less than alpha we reject the null hypothesis. Hence we dont have enough statistical significance to conclude that product preference is independent of gender

In [6]:
#testing the hypothesis using critical value
from scipy.stats import chi2
alpha=0.05
critical_value=chi2.ppf(1-alpha,dof)
critical_value

3.841458820694124

In [7]:
critical_value

3.841458820694124

In [8]:
chi2_stat

16.98989898989899

hence chi2_stat is greater than the critical value we reject the null hypothesis

**Example 2:** A researcher wants to check if smoking habits (smoker, non-smoker) are independent of education level (high school, undergraduate, postgraduate).

In [9]:
# Data
data = {'Education': ['High School', 'Undergraduate', 'Postgraduate'],
        'Smoker': [30, 40, 20], 'Non-Smoker': [70, 60, 80]}
df = pd.DataFrame(data).set_index('Education')
df

Unnamed: 0_level_0,Smoker,Non-Smoker
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
High School,30,70
Undergraduate,40,60
Postgraduate,20,80


Testing the null hypothesis 

>$H_0:$ smoking habits are independent of education level

against the alternate hypothesis

>$H_1:$ smoking habits are dependent of education level

In [10]:
#testing the hypothesis using p value
chi2_stat,p_value,dof,expected=chi2_contingency(df)
p_value

0.008549309479686051

In [11]:
expected

array([[30., 70.],
       [30., 70.],
       [30., 70.]])

In [12]:
df

Unnamed: 0_level_0,Smoker,Non-Smoker
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
High School,30,70
Undergraduate,40,60
Postgraduate,20,80


hence p value is less than alpha we reject the null hypothesis. hence we dont have enough statistical evidence to conclude that smoking habits are independent of education level

**Example 3:** A company wants to determine if internet usage frequency (daily, rarely) is independent of age group (under 18, 18-35, over 35).

In [13]:
import pandas as pd
data = {'Age Group': ['Under 18', '18-35', 'Over 35'], 'Daily': [120, 200, 90], 'Rarely': [30, 50, 80]}
df = pd.DataFrame(data).set_index('Age Group')
df

Unnamed: 0_level_0,Daily,Rarely
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Under 18,120,30
18-35,200,50
Over 35,90,80


Testing the null hypothesis

>$H_0:$ Internet usage is independent of group

against the alternate hypothesis

>$H_1:$ internet usage is dependent of group

In [14]:
#testing the hypothesis using p value
chi2_stat,p_value,dof,expected=chi2_contingency(df)
p_value

4.0361504913600605e-10

In [15]:
expected

array([[107.89473684,  42.10526316],
       [179.8245614 ,  70.1754386 ],
       [122.28070175,  47.71929825]])

In [16]:
dof

2

hence p value is less than alpha we reject the null hypothesis. hence we dont have enough statistical evidence to conclude that internet usage is independent of group

**Example 4:** The HR department wants to determine if there’s an association between employee department (Sales, Engineering, HR) and job satisfaction level (Satisfied, Dissatisfied). They want to know if satisfaction levels vary significantly across departments.

In [17]:
#Data
import pandas as pd
data = {'Department': ['Sales', 'Engineering', 'HR'], 'Satisfied': [90, 110, 80], 'Dissatisfied': [30, 40, 20]}
df = pd.DataFrame(data).set_index('Department')


In [18]:
df

Unnamed: 0_level_0,Satisfied,Dissatisfied
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Sales,90,30
Engineering,110,40
HR,80,20


Testing the null hypothesis

>$H_0:$ Job satisfaction level is independent of employee department.

against the alternate hypothesis

>$H_1:$ Job satisfaction level depends on employee department.


In [19]:
#testing the hypothesis using p value
chi2_stat,p_value,dof,expected=chi2_contingency(df)
p_value

0.47408794626106643

In [20]:
expected

array([[ 90.81081081,  29.18918919],
       [113.51351351,  36.48648649],
       [ 75.67567568,  24.32432432]])

In [21]:
dof

2

since p value is greater than alpha we fail to reject the null hypotesis, hence we have enough statistical evidence to conclude that job satisfaction is independent of employee department

**Example 5:** Researchers want to investigate whether blood type (A, B, AB, O) is associated with disease presence (Yes, No). They hypothesize that disease occurrence may vary among different blood types.

In [22]:
import pandas as pd
data = {'Blood Type': ['A', 'B', 'AB', 'O'], 'Disease Yes': [45, 35, 20, 50], 'Disease No': [155, 65, 30, 100]}
df = pd.DataFrame(data).set_index('Blood Type')


In [23]:
df

Unnamed: 0_level_0,Disease Yes,Disease No
Blood Type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,45,155
B,35,65
AB,20,30
O,50,100


Testing the null hypothesis

>$H_0:$ Disease presence is independent of blood type.

against the alternate hypothesis

>$H_1:$  Disease presence depends on blood type.

In [24]:
#testing the hypothesis using p value
chi2_stat,p_value,dof,expected=chi2_contingency(df)
p_value

0.021081101112631517

In [25]:
expected

array([[ 60., 140.],
       [ 30.,  70.],
       [ 15.,  35.],
       [ 45., 105.]])

In [26]:
dof

3

hence p value is less than alpha we reject the null hypothesis. hence we dont have enough statistical evidence to conclude that Disease presence is independent of blood type

**Example 6:**  A consulting firm wants to determine whether education level (High School, Bachelor’s, Master’s, PhD) has an impact on career progression (Promoted, Not Promoted) across different departments and genders.

In [27]:
#data
data = {
    'Department': ['Sales', 'Sales', 'Engineering', 'HR', 'Marketing', 'IT', 'Finance'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male'],
    'Education': ['High School', 'Bachelor\'s', 'Master\'s', 'PhD', 'High School', 'Bachelor\'s', 'Master\'s'],
    'Promoted': [20, 35, 50, 15, 30, 45, 40],
    'Not Promoted': [60, 65, 50, 25, 50, 55, 40]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Department,Gender,Education,Promoted,Not Promoted
0,Sales,Male,High School,20,60
1,Sales,Female,Bachelor's,35,65
2,Engineering,Male,Master's,50,50
3,HR,Female,PhD,15,25
4,Marketing,Female,High School,30,50
5,IT,Male,Bachelor's,45,55
6,Finance,Male,Master's,40,40


Testing the null hypothesis

>$H_0:$ Education level and career progression are independent

against the alternate hypothesis

>$H_1:$ Education level and career progression are dependent

In [28]:
#Testing the hypothesis using p value
df = df.drop(columns=['Department', 'Gender'])
contingency_table = df.set_index('Education')
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
p_value

0.008378023801285984

In [29]:
expected

array([[32.4137931 , 47.5862069 ],
       [40.51724138, 59.48275862],
       [40.51724138, 59.48275862],
       [16.20689655, 23.79310345],
       [32.4137931 , 47.5862069 ],
       [40.51724138, 59.48275862],
       [32.4137931 , 47.5862069 ]])

In [30]:
dof

6

since p value is less than alpha we reject the null hypothesis and hence we dont have enough statistical evidence to conclude that education level and carrer progression are independent

**Example 7:** A company wants to determine if customer preferences between Brand A and Brand B are independent of the city in which they live. They survey customers in three cities (City 1, City 2, City 3) and record their preferences for either Brand A or Brand B.

In [31]:
#Data
data = {
    'City': ['City 1', 'City 2', 'City 3'],
    'Brand A': [150, 250, 300],
    'Brand B': [200, 150, 100]
}

df = pd.DataFrame(data)
df

Unnamed: 0,City,Brand A,Brand B
0,City 1,150,200
1,City 2,250,150
2,City 3,300,100


Testing the null hypothesis

>$H_0:$  Customer preference for Brand A or Brand B is independent of the city.

against the alternate hypothesis

>$H_1:$  Customer preference for Brand A or Brand B is dependent of the city.

In [32]:
#testing the hypothesis
contingency_table = df.drop(columns=['City'])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
p_value

1.85813982196859e-18

In [33]:
expected

array([[213.04347826, 136.95652174],
       [243.47826087, 156.52173913],
       [243.47826087, 156.52173913]])

In [34]:
dof

2

since p value is less than alpha we reject the null hypothesis and hence we dont have enough statistical evidence to conclude that Customer preference for Brand A or Brand B is independent of the city

**Example 8:** A health organization conducts a survey across three different age groups (18-29, 30-49, and 50+) to determine if there is an association between age and preference for different types of exercise (e.g., Gym, Yoga, and Outdoor activities). The data is collected as follows:

In [35]:
#Data
data = {
    'Age Group': ['18-29', '30-49', '50+'],
    'Gym': [120, 100, 40],
    'Yoga': [60, 110, 90],
    'Outdoor': [80, 70, 70]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age Group,Gym,Yoga,Outdoor
0,18-29,120,60,80
1,30-49,100,110,70
2,50+,40,90,70


Testing the null hypothesis

>$H_0:$ Age group and exercise preference is independent

against the alternate hypothesis

>$H_1:$ Age group and exercise preference is dependent

In [36]:
#testing the hypothesis
contingency_table = df.drop(columns=['Age Group'])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
p_value

7.084492437489176e-09

In [37]:
expected

array([[91.35135135, 91.35135135, 77.2972973 ],
       [98.37837838, 98.37837838, 83.24324324],
       [70.27027027, 70.27027027, 59.45945946]])

In [38]:
dof

4

hence p value is less than alpha we reject the null hypothesis and we dont have enough statistical evidence that Age group and exercise preference is independent
