<a href="https://colab.research.google.com/github/naenumtou/statisticalModel/blob/main/chi_squareTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Import libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from scipy.stats import chi2

In [None]:
#Import dataset
df = pd.read_csv('https://raw.githubusercontent.com/naenumtou/statisticalModel/main/datasets/titanic.csv')
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [None]:
#Keep only categorical variables
df = df[['Sex', 'Embarked', 'PassengerId']]
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0]) #Fix missing

In [None]:
#Count observation
dfGroup = df.groupby(['Sex','Embarked'])['PassengerId'].count()
dfGroup

Sex     Embarked
female  C            73
        Q            36
        S           205
male    C            95
        Q            41
        S           441
Name: PassengerId, dtype: int64

In [None]:
#Pivot table
dfPivot = df.pivot_table(index = 'Sex', columns = 'Embarked', values = 'PassengerId', aggfunc = 'count')
dfPivot

Embarked,C,Q,S
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,73,36,205
male,95,41,441


In [None]:
#Create sum of rows and columns
dfSum = pd.DataFrame.copy(dfPivot)
dfSum['Total'] = dfSum.sum(axis = 1)
dfSum.loc['Total'] = dfSum.sum()
dfSum

Embarked,C,Q,S,Total
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,73,36,205,314
male,95,41,441,577
Total,168,77,646,891


In [None]:
#Chi-Squared Test
chi2_test, p, dof, _ = chi2_contingency(dfPivot)
print(f'Chi-Squared test-statistic: {chi2_test:.4f}')
print(f'P-value: {p:.4f}')

Chi-Squared test-statistic: 12.9170
P-value: 0.0016


# Hypothesis testing
The null hypothesis (H0) of chi squared test is that there is **no relationship** between variables, which means that there is an **independent** from each other.

---
#Critical value
If critical value **less than** test-statistic, it is **reject** H0, meaning that there is a relationship between these variables or dependent on each other.

---
If critical value **greater than** test-statistic, it is **accept** H0 *(fail to reject)*, meaning that there is no relationship between these variables or independent from each other.

#P-value
If p-value **less than** alpha *(0.05)*, it is **reject** H0, meaning that there is a relationship between these variables or dependent on each other.

---
If p-value **greater than** alpha *(0.05)*, it is **accept** H0 *(fail to reject)*, meaning that there is no relationship between these variables or independent from each other.

In [None]:
#Interpret result by test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print(f'The critical value: {critical:.4f}')
if critical <= abs(chi2_test):
  print(f'Critical value: {critical:.4f} <= Chi-Squared test-statistic: {abs(chi2_test):.4f}')
  print('Dependent (Relationship)')
else:
	print('Independent (No relationship)')

The critical value: 5.9915
Critical value: 5.9915 <= Chi-Squared test-statistic: 12.9170
Dependent (Relationship)


In [None]:
#Interpret result by P-value
alpha = 1.0 - prob
if p <= alpha:
  print(f'P-value: {p:.4f} <= Alpha: {alpha:.4f}')
  print('Dependent (Relationship)')
else:
	print('Independent (No relationship)')

P-value: 0.0016 <= Alpha: 0.0500
Dependent (Relationship)
