# Hypothesis tests

In this notebook, we will be performing hypothesis tests to valiate certain speculations.

In [7]:
# Load the required packages
import json
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, chi2_contingency
import plotly.io as pio
pio.renderers.default = "vscode"

In [2]:
# Load the data
df = pd.read_csv('./../../../data/cleaned_data.csv')

In [3]:
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

In [4]:
# Seggregate attrition member groups
attr = df[df['Attrition'] == 'Yes']
nattr = df[df['Attrition'] == 'No']

Followig are the some speculations are we are going to consider for our analysis:
1. There is difference between mean salaries within people who leave the company and people who stay.
2. There is difference between mean percentage hike for the two above mentioned groups.
3. Frequent travelling for employees results in attrition.
4. Overtime results in attrition.

## Claim 1 - Difference in monthly salary

In [10]:
fig = px.histogram(df, x='MonthlyIncome', color='Attrition', histnorm='probability', marginal='rug')
fig.show()


At lower incomes the probability of attrition is higher while the trend reverses at the higher range of salaries. Above the value of 11k, the proabability see a sharp decrease. After 14k, the probability of attrition literally diminishes to 0 before picking up lower values at 19k to 20k.


For the given case, the null hypothesis and the alternate hypothesis can be framed as -  
$H_0$ : The difference between the mean salaries for for people who leave and for people who stay is 0.  
$H_1$: There is difference in the mean salaries.

In [None]:
tstat, tpvalue = ttest_ind(attr['MonthlyIncome'], nattr['MonthlyIncome'], equal_var=False)
print(f"T Statistic for the test is {tstat}, and the p-value is {tpvalue}")


T Statistic for the test is -7.482621586644742, and the p-value is 4.433588628286071e-13


Choosing alpha of 5%, the p-value of the test is too small as compare to 0.05 and hence the null hypothesis is rejected. This signiffies that the there is difference between the salaries for the people who leave the company and people who stay.

## Claim 2 - Difference in percentage of hike in salary"

In [None]:
fig = px.histogram(df, x='PercentSalaryHike', color='Attrition', marginal='rug', histnorm='probability')
fig.show()


There seems to be no significant difference in probability for attrition in terms of difference in salary hike. People does not seem to care for recent salary hikes which considering for a shift.

For the given case, the null hypothesis and the alternate hypothesis can be framed as -  
$H_0$: There is no difference between the mean percent salary hike for the two groups of interest.  
$H_1$: There is difference between the mean percent salary hike for the groups of interest.

In [None]:
tstat, tpvalue = ttest_ind(attr['PercentSalaryHike'], nattr['PercentSalaryHike'], equal_var=False)
print(f"T Statistic for the test is {tstat}, and the p-value is {tpvalue}")

T Statistic for the test is -0.5042445352368612, and the p-value is 0.6144300580581377


Again choosing the alpha of 5%, the p-value is greate than 0.05. This signifies that there is no difference between the mean salary hike for the people who leave the company and the people who stay.

## Claim 3 - Frequent travelling

For performing, we first need the contingency table which is the count of category for each group of target variable.

In [None]:
travel_contingency = pd.crosstab(df['Attrition'], df['BusinessTravel'])
travel_contingency

BusinessTravel,Non-Travel,Travel_Frequently,Travel_Rarely
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,138,208,887
Yes,12,69,156


For the case at hand, the null hypothesis and the alternate hypothesis can be framed as -  
$H_0$: There is no relationship between attrition and business travel  
$H_1$: There is relationship between attrition and business travel

In [None]:
stat, p, dof, expected = chi2_contingency(travel_contingency.values.tolist())
print(f"The chi-squared test statistics is {stat} with p-value as {p}")

The chi-squared test statistics is 24.182413685655174 with p-value as 5.608614476449931e-06


Considering an alpha of 5%, the p-value is too small as compared to 0.05 and hence the null hypothesis is rejected. To forward the result in words, there is relationship between the variable attrition and business travel.

## Claim 4 - Overtime

Starting with the contingency table for the case at hand.

In [None]:
time_contingency = pd.crosstab(df['Attrition'], df['OverTime'])
time_contingency

OverTime,No,Yes
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,944,289
Yes,110,127


For this scenario, the null hypothesis and the alternate hypothesis can be framed as -  
$H_0$: There is no relationship between attrition and overtime  
$H_1$: There is relationship between attrition and overtime

In [None]:
stat, p, dof, expected = chi2_contingency(time_contingency.values.tolist())
print(f"The chi-squared test statistics is {stat} with p-value as {p}")

The chi-squared test statistics is 87.56429365828768 with p-value as 8.15842372153832e-21


Again considering the alpha of 5%, the p-value is too small as compared to 0.05 and hence the null hypothesis can be rejected. This means that there is some relationship between attrition and overtime.

In [None]:
import plotly.io as pio
import plotly.express as px
import plotly.offline as py

df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", size="sepal_length")
fig