### Chi-square Test

Scenario: Tests if there’s an association between two categorical variables.

Example: Checking if gender and preferred product category are independent.

Dataset: preferences.csv with columns Gender, ProductCategory

In [3]:
from scipy.stats import chi2_contingency
import pandas as pd

# Example 1 - Product preferences data
preferences = pd.read_csv('retail_sales_dataset.csv')
data = pd.crosstab(preferences['Gender'], preferences['Product Category'])
chi2, p_value, _, _ = chi2_contingency(data)
print("Chi-square test for independence in preferences:", p_value)

Chi-square test for independence in preferences: 0.43304287262068974


### Chi-square test for exercise preference

In [4]:
import pandas as pd
from scipy.stats import chi2_contingency

# Create a contingency table
data = {
    'Exercise Type': ['Yoga', 'Running', 'Weightlifting'],
    'Male': [30, 40, 25],
    'Female': [20, 35, 30]
}

df = pd.DataFrame(data)
contingency_table = pd.DataFrame(df.set_index('Exercise Type'))

# Conduct the Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square Statistic:", chi2_stat)
print("p-value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between gender and exercise preference.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between gender and exercise preference.")


Chi-square Statistic: 2.2392344497607652
p-value: 0.3264047103162396
Degrees of Freedom: 2
Expected Frequencies:
 [[26.38888889 23.61111111]
 [39.58333333 35.41666667]
 [29.02777778 25.97222222]]
Fail to reject the null hypothesis: There is no significant association between gender and exercise preference.


### Chi-square for heart dataset

Create the Contingency Table: We organized our hypothetical data into a structured format representing counts of heart disease cases for each chest pain type.


Chi-square Test: We used the chi2_contingency function to perform the Chi-square test, which computes the Chi-square statistic, p-value, degrees of freedom, and expected frequencies based on the observed counts.


Interpretation: By comparing the p-value to our significance level (0.05), we can determine whether to reject or fail to reject the null hypothesis.

In [5]:
import pandas as pd
from scipy.stats import chi2_contingency

# Create a contingency table
data = {
    'Chest Pain Type': ['Typical angina', 'Atypical angina', 'Non-anginal pain', 'Asymptomatic'],
    'Heart Disease (Yes)': [30, 25, 15, 5],
    'Heart Disease (No)': [10, 15, 35, 40]
}

df = pd.DataFrame(data)
contingency_table = pd.DataFrame(df.set_index('Chest Pain Type'))

# Conduct the Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square Statistic:", chi2_stat)
print("p-value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between chest pain type and heart disease presence.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between chest pain type and heart disease presence.")


Chi-square Statistic: 45.070601851851855
p-value: 8.938482781557331e-10
Degrees of Freedom: 3
Expected Frequencies:
 [[17.14285714 22.85714286]
 [17.14285714 22.85714286]
 [21.42857143 28.57142857]
 [19.28571429 25.71428571]]
Reject the null hypothesis: There is a significant association between chest pain type and heart disease presence.
