## Introduction

**It is used to measure the relationship between the two categorical variables, a test called the chi-square statistical test.
This test is based on the concept of hypothesis testing so solving any problem using the chi-square test is very similar to the hypothesis test.**

The chi squared test is one type of hypothesis testing, so to solve any problem using the Chi-Square Test we follow the same steps as hypothesis testing. chi square test

Revise the hypothesis test steps:

Define the Null hypothesis (H0)

Define the alternative hypothesis (H1)

Design the Test statistic(T)

Take the T & H0 & H1 find the p-value

If the p-value is > 5% then we fail to reject h0 

Else reject the H0 and accept the H1

So we follow all these steps to solve the problem in the chi-square test

First, we define the hypothesis, and next to understand how to perform the chi-square test using the library.

Null Hypothesis (H0): There is no relationship between the variables.
Alternative Hypothesis (H1): There is a relationship between variables.
Let’s choose our p-value = 0.05: Choose a significance level (e.g. SL = 0.05 with a 95% confidence).
if the p-value test result is more than 0.05, it means that the test result will lie in the acceptance region and we will fail to reject the null hypothesis, which means there is no relationship between the feature variables and target variables.
If the p-value test result is less than 0.05, it means that the test result will lie in the rejection(critical) region and we will reject the null hypothesis and will go along with the alternate hypothesis, which means there is a relationship between the feature variables and target variables.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.feature_selection import chi2,SelectKBest

In [2]:
df_loan = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Loan_Dataset/loan_data_set.csv")

In [3]:
df_loan.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [5]:
df_loan.shape

(614, 13)

In [6]:
df_loan.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [7]:
df_loan.dropna(inplace=True)

In [8]:
df_loan

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [9]:
df_loan.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [10]:
df_loan.drop(labels=['Loan_ID'],axis=1,inplace=True)

In [11]:
df_loan

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [12]:
df_loan.reset_index(drop=True,inplace=True)

In [13]:
df_loan

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
475,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
476,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
477,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
478,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [15]:
df_loan['Credit_History'] = df_loan['Credit_History'].apply(lambda x: 'N' if x == 0 else 'Y')

In [16]:
df_loan['Credit_History']

0      Y
1      Y
2      Y
3      Y
4      Y
      ..
475    Y
476    Y
477    Y
478    Y
479    N
Name: Credit_History, Length: 480, dtype: object

In [23]:
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             480 non-null    object 
 1   Married            480 non-null    object 
 2   Dependents         480 non-null    object 
 3   Education          480 non-null    object 
 4   Self_Employed      480 non-null    object 
 5   ApplicantIncome    480 non-null    int64  
 6   CoapplicantIncome  480 non-null    float64
 7   LoanAmount         480 non-null    float64
 8   Loan_Amount_Term   480 non-null    float64
 9   Credit_History     480 non-null    object 
 10  Property_Area      480 non-null    object 
 11  Loan_Status        480 non-null    object 
dtypes: float64(3), int64(1), object(8)
memory usage: 45.1+ KB


In [19]:
cat_cols = df_loan.select_dtypes(include= 'object').columns
cat_cols

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [25]:
cat_col = df_loan.select_dtypes(include= 'object').drop('Loan_Status', axis = 1).columns
cat_col

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Credit_History', 'Property_Area'],
      dtype='object')

In [26]:
df_loan[cat_cols] = df_loan[cat_cols].apply(lambda x:x.astype('category'))

In [27]:
df_loan[cat_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Gender          480 non-null    category
 1   Married         480 non-null    category
 2   Dependents      480 non-null    category
 3   Education       480 non-null    category
 4   Self_Employed   480 non-null    category
 5   Credit_History  480 non-null    category
 6   Property_Area   480 non-null    category
 7   Loan_Status     480 non-null    category
dtypes: category(8)
memory usage: 4.9 KB


In [28]:
df_loan[cat_cols] = df_loan[cat_cols].apply(lambda x: x.cat.codes)

df_loan.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,1,1,0,0,4583,1508.0,128.0,360.0,1,0,0
1,1,1,0,0,1,3000,0.0,66.0,360.0,1,2,1
2,1,1,0,1,0,2583,2358.0,120.0,360.0,1,2,1
3,1,0,0,0,0,6000,0.0,141.0,360.0,1,2,1
4,1,1,2,0,1,5417,4196.0,267.0,360.0,1,2,1


In [29]:
X = df_loan[cat_col]

y = df_loan['Loan_Status']

In [31]:
# Lets use the sklearn chi2 function

cs = SelectKBest(score_func=chi2,k=7)

cs.fit(X,y)

feature_score = pd.DataFrame({'Score':cs.scores_,'P_Value':cs.pvalues_},index=X.columns)

feature_score.nlargest(n=6,columns='Score')

Unnamed: 0,Score,P_Value
Credit_History,19.617746,9e-06
Married,2.132101,0.144243
Education,1.793838,0.180459
Dependents,0.806228,0.369238
Self_Employed,0.49892,0.479975
Gender,0.357829,0.549714


In these results which feature p-values higher than the significance level, we can say these features do not have any relationship with the target variable.

So Gender p-values are higher than SL, so we fail to reject the H0 and we say that the Gender has no relation with the Loan_Status feature.

And Credit_History feature p-value is very lower than SL, so we go along with the H1 and we say that the Credit_History has relation with the Loan_Status feature.

In [32]:
cs.scores_

array([ 0.35782918,  2.13210084,  0.80622846,  1.79383793,  0.49891951,
       19.61774587,  0.27761776])

In [33]:
cs.pvalues_

array([5.49714334e-01, 1.44242943e-01, 3.69237661e-01, 1.80459248e-01,
       4.79975265e-01, 9.45865863e-06, 5.98266886e-01])