<a href="https://colab.research.google.com/github/j-yxn/attrition_analysis/blob/main/attrition_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Observational Study using Hypothesis Testing**

Using material learned throughout the semester in STAT 2120: Intro to Statistical Analysis, this mini-project applies a two-proportion z-test to determine if there is a statistically significant difference in turnover rates between employees who work overtime and those who do not.

Understanding the motivation of employee turnover is important, as employee attrition represents a significant cost to organizations.

The dataset is from Kaggle: https://www.kaggle.com/datasets/ajinkyachintawar/employee-attrition-and-retention-analytics-dataset?resource=download

Credits to Ajinkya Chintawar.

In [10]:
import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ajinkyachintawar/employee-attrition-and-retention-analytics-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'employee-attrition-and-retention-analytics-dataset' dataset.
Path to dataset files: /kaggle/input/employee-attrition-and-retention-analytics-dataset


In [12]:
# load the data, make sure the dataset is imported before running!
df = pd.read_csv("/kaggle/input/employee-attrition-and-retention-analytics-dataset/HR-Employee-Attrition.csv")

# attrition values (Yes/No)
print(df['Attrition'])

# converted Attrition values to Binary
df['Attrition_Binary'] = df['Attrition'].apply(lambda x : 1 if x == 'Yes' else 0)

# verification of conversion
print (f"\n{df['Attrition_Binary']}")

0          Yes
1           No
2          Yes
3           No
4           No
         ...  
1471     15:16
1472     12:15
1473     14:15
1474     10:09
1475     16:24
Name: Attrition, Length: 1476, dtype: object

0       1
1       0
2       1
3       0
4       0
       ..
1471    0
1472    0
1473    0
1474    0
1475    0
Name: Attrition_Binary, Length: 1476, dtype: int64


## Hypotheses:

$p_1 =$ *The true proportion of employees who work overtime and leave the company.*

$p_2 =$ *The true proportion of employees who do not work overtime and leave the company.*

**Null Hypothesis**: $p_1 = p_2$

**Alternative Hypothesis**: $p_1 > p_2$


In [13]:
#  specifying groups (overtime/No overtime)
group_overtime = df.loc[df['OverTime'] == 'Yes', 'Attrition_Binary']
group_no_overtime = df.loc[df['OverTime'] == 'No', 'Attrition_Binary']

# calculating Attribution rates between overtime groups
overtime_AttRate = group_overtime.mean()
no_overtime_AttRate = group_no_overtime.mean()

# incorporated f-string formatting
print(f"Attrition Rate w/ Overtime: {overtime_AttRate : .2%}")
print(f"Attrition Rate w/ No Overtime: {no_overtime_AttRate : .2%}")

Attrition Rate w/ Overtime:  30.53%
Attrition Rate w/ No Overtime:  10.44%


In [14]:
# counting number of 'sucesses' (people who quit)
# the sum would collect all the 'Yes' values, which signifies the individual quitted, from each group
successes = [group_overtime.sum(), group_no_overtime.sum()]
n_samples = [len(group_overtime), len(group_no_overtime)]

# one-sided two-proportion z-test (no overtime v.s. overtime)
z_stat, p_val = proportions_ztest(count=successes, nobs=n_samples, alternative="larger")

# print z-statistic and p-val
print(f"Z-Statistic: {z_stat : .3f}")
print(f"P-Value: {p_val : .25f}")

# conclude using standard significance level of 0.05
if p_val < 0.05:
  print(f"\nResult: Statistically Significant")
  print("The null hypothesis is rejected. Employees who do overtime are more likely to quit than individuals who do not.")
else:
  print( f"\nResult: Not Significant")
  print("The null hypothesis fails to reject.  The difference could be due to random chance.")

Z-Statistic:  9.436
P-Value:  0.0000000000000000000019308

Result: Statistically Significant
The null hypothesis is rejected. Employees who do overtime are more likely to quit than individuals who do not.
