<font size = 6, color = "cyan"><b> Independent Project I </b></font>

<font size = 5, color = "pink"><b> HR Analytics: Employee Attrition </b></font>

<em><u>This project comprises two parts:</em></u>

_1. Statistical Insights & Predictions Using Machine Learning_

_2. Dashboard Visualization using Microsoft Power BI_

<font size = 4, color = "gold"><b> About the Dataset </b></font>
* The dataset titled [IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) was obtained through Kaggle
* It's a fictional dataset containing <u>1470 records and 35 attributes</u> (e.g., job satisfaction, job role, education, performance rating, years with current manager, etc.) with the <u>class being Attrition</u> (Yes/No Binary Variable)

<font size = 4, color = "gold"><b> Data Cleaning & Preprocessing</b></font>
<li>The dataset was cleaned using Microsoft Excel</li>
<li>The following data preprocessing operations were performed:</li>
        
        - Dropped Insignificant Features (Chi-Square Test) & Features With High Correlation/Weak Correlation 
        - One-Hot Encoding of Nominal Features, Ordinal Encoding of Ordinal Features, & Label Encoding
        - Feature Scaling (Z-Score Normalization)
        - Feature Selection (RFE for the Logistic Regression Model)
        - Handling Class Imbalance Using SMOTE-EEN

The new dataset after dropping features and encoding is already saved as a file that we can read directly and continue from. For more details on the data preprocessing steps, visit <font color = 'green'><em>1) Data Preprocessing, Chi-Square Analysis, & Logistic Regression Model</font></em> notebook

<font size = 4, color = "gold"><b> Data Preprocessing</b></font>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

In [2]:
dataset_file_path = r'C:\Users\user\OneDrive\Desktop\Post-Grad Stuff\Analytics Practice\Datasets (Original Format)\preprocessed_attrition_dataset.csv'
df = pd.read_csv(filepath_or_buffer = dataset_file_path)
display(df)

Unnamed: 0,Age,Attrition,BusinessTravel,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,StockOptionLevel,...,JobRole_Laboratory_Technician,JobRole_Manager,JobRole_Manufacturing_Director,JobRole_Research_Director,JobRole_Research_Scientist,JobRole_Sales_Executive,JobRole_Sales_Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single
0,41,1,1,2,3,2,4,8,1,0,...,0,0,0,0,0,1,0,0,0,1
1,49,0,2,3,2,2,2,1,0,1,...,0,0,0,0,1,0,0,0,1,0
2,37,1,1,4,2,1,3,6,1,0,...,1,0,0,0,0,0,0,0,0,1
3,33,0,2,4,3,1,3,1,1,0,...,0,0,0,0,1,0,0,0,1,0
4,27,0,1,1,3,1,2,9,0,1,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,0,2,3,4,2,4,4,0,1,...,1,0,0,0,0,0,0,0,1,0
1466,39,0,1,4,2,3,1,4,0,1,...,0,0,0,0,0,0,0,0,1,0
1467,27,0,1,2,4,2,2,1,1,1,...,0,0,1,0,0,0,0,0,1,0
1468,49,0,2,4,2,2,2,2,0,0,...,0,0,0,0,0,1,0,0,1,0


In [None]:
df.info()                       # provides insights on non-null count, dtype of each column
df.nunique()                    # shows number of unique values for each column
df.describe().drop('count')     # data summary (count, mean, standard deviation, min, max, and percentiles for each numerical column)

Unnamed: 0,Age,Attrition,BusinessTravel,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,StockOptionLevel,...,JobRole_Laboratory_Technician,JobRole_Manager,JobRole_Manufacturing_Director,JobRole_Research_Director,JobRole_Research_Scientist,JobRole_Sales_Executive,JobRole_Sales_Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single
mean,36.92381,0.161224,1.086395,2.721769,2.729932,2.063946,2.728571,2.693197,0.282993,0.793878,...,0.17619,0.069388,0.098639,0.054422,0.198639,0.221769,0.056463,0.222449,0.457823,0.319728
std,9.135373,0.367863,0.53217,1.093082,0.711561,1.10694,1.102846,2.498009,0.450606,0.852077,...,0.381112,0.254199,0.298279,0.226925,0.399112,0.415578,0.230891,0.416033,0.498387,0.46653
min,18.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,1.0,2.0,2.0,1.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,36.0,0.0,1.0,3.0,3.0,2.0,3.0,2.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,43.0,0.0,1.0,4.0,3.0,3.0,4.0,4.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
max,60.0,1.0,2.0,4.0,4.0,5.0,4.0,9.0,1.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
X = df.drop(columns = ['Attrition'])
y = df['Attrition']

<font color = "gray"><b>Feature Selection using Z-Score Normalization</b></font>

In [6]:
zscore_norm = StandardScaler()
X_rescaled = zscore_norm.fit_transform(X)
X_rescaled_df = pd.DataFrame(X_rescaled, columns = X.columns)      # SMOTE expects it as a dataframe
X_rescaled_df

Unnamed: 0,Age,BusinessTravel,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,StockOptionLevel,TotalWorkingYears,...,JobRole_Laboratory_Technician,JobRole_Manager,JobRole_Manufacturing_Director,JobRole_Research_Director,JobRole_Research_Scientist,JobRole_Sales_Executive,JobRole_Sales_Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single
0,0.446350,-0.162399,-0.660531,0.379672,-0.057788,1.153254,2.125136,1.591746,-0.932014,-0.421642,...,-0.462464,-0.273059,-0.330808,-0.239904,-0.497873,1.873287,-0.244625,-0.534873,-0.918921,1.458650
1,1.322365,1.717339,0.254625,-1.026167,-0.057788,-0.660853,-0.678049,-0.628241,0.241988,-0.164511,...,-0.462464,-0.273059,-0.330808,-0.239904,2.008543,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
2,0.008343,-0.162399,1.169781,-1.026167,-0.961486,0.246200,1.324226,1.591746,-0.932014,-0.550208,...,2.162331,-0.273059,-0.330808,-0.239904,-0.497873,-0.533821,-0.244625,-0.534873,-0.918921,1.458650
3,-0.429664,1.717339,1.169781,0.379672,-0.961486,0.246200,-0.678049,1.591746,-0.932014,-0.421642,...,-0.462464,-0.273059,-0.330808,-0.239904,2.008543,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
4,-1.086676,-0.162399,-1.575686,0.379672,-0.961486,-0.660853,2.525591,-0.628241,0.241988,-0.678774,...,2.162331,-0.273059,-0.330808,-0.239904,-0.497873,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,-0.101159,1.717339,0.254625,1.785511,-0.057788,1.153254,0.523316,-0.628241,0.241988,0.735447,...,2.162331,-0.273059,-0.330808,-0.239904,-0.497873,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
1466,0.227347,-0.162399,1.169781,-1.026167,0.845911,-1.567907,0.523316,-0.628241,0.241988,-0.293077,...,-0.462464,-0.273059,-0.330808,-0.239904,-0.497873,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
1467,-1.086676,-0.162399,-0.660531,1.785511,-0.057788,-0.660853,-0.678049,1.591746,0.241988,-0.678774,...,-0.462464,-0.273059,3.022901,-0.239904,-0.497873,-0.533821,-0.244625,-0.534873,1.088232,-0.685565
1468,1.322365,1.717339,1.169781,-1.026167,-0.057788,-0.660853,-0.277594,-0.628241,-0.932014,0.735447,...,-0.462464,-0.273059,-0.330808,-0.239904,-0.497873,1.873287,-0.244625,-0.534873,1.088232,-0.685565


<font size = 4, color = "gold"><b> I. Statistical Insights & Predictions Using Machine Learning </b></font>

<font size = 4, color = "green"><b> Experimenting with Machine Learning Models </b></font>

In [7]:
from sklearn.feature_selection import SelectFromModel           # LASSO 
from sklearn.feature_selection import RFE                       # RFE

from imblearn.combine import SMOTEENN

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, classification_report

<font size = 4, color = "gray"><b> 2) Decision Trees </b></font>

<font size = 4, color = "gray"> <em>
<li></li>
<li></li>
<li></li>
<li></li>
</em></font>

<font color = "gray">
<b>Assumptions: </b>
<li><u>:</u> </li>
</font>

<font color = "gray"><u><em>No Feature Selection or Class Imbalance Handling</em></u>

In [None]:
# split data into 80% training and 20% testing
# X_train, X_test, y_train, y_test = train_test_split(X_rescaled_df, newer_y, test_size = 0.20, random_state = 23)

# logRegModel = LogisticRegression(class_weight = 'balanced', random_state = 23)   # class_weight useful for imbalanced classes (like in our case)
# logRegModel.fit(X_train, y_train)
# prediction = logRegModel.predict(X_test)

# logReg_accuracy = accuracy_score(y_test, prediction)
# logReg_f1score = f1_score(y_test, prediction)
# logReg_precision = precision_score(y_test, prediction)
# logReg_recall = recall_score(y_test, prediction)

# logReg_confMatrix = confusion_matrix(y_test, prediction)

# print("Overall Logistic Regression Model Performance:")
# print(f'- Accuracy: {logReg_accuracy * 100:.2f} %')
# print(f'- Precision: {logReg_precision * 100:.2f} %')
# print(f'- Recall: {logReg_recall * 100:.2f} %')
# print(f'- F1-score: {logReg_f1score * 100:.2f} %\n')

# print(f'- Confusion Matrix: \n{logReg_confMatrix}')

# plt.figure(figsize=(8, 6))
# sns.heatmap(logReg_confMatrix, annot = True, fmt = 'd', cmap='Blues', cbar = True,
#             xticklabels = ['Predicted 0', 'Predicted 1'],
#             yticklabels = ['Actual 0', 'Actual 1'])
# plt.xlabel('Predicted Label')
# plt.ylabel('True Label')
# plt.title('Logistic Regression Model Confusion Matrix')
# plt.show()

As noticed, the model yields ________


<font color = "gray"><b>Feature Selection</b></font>


<font color = "gray"><u>Lasso (L1 Regularization)</u>
<li><em>Shrinks irrelevant features to 0 immediately</em></li> 
<li><em>Fast & efficient for large datasets</em></li> 
<li><em>Can remove too many features if lambda (α) is too high --> in our case I've tried 0.1, 0.5, 0.01, 0.05, and 0.001: 0.001 & above selected no features, 0.01 selected 1, 0.1 selected 31 features, 0.05 selected 26 features, and 0.5 selected 36 features</em></li> 
<li><em>Struggles when features are highly correlated</em></li> 
</font>

In [9]:
# lasso = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = 0.1)      # other solvers expect l2 regularization
# lasso.fit(X_rescaled, newer_y)
# lasso_model = SelectFromModel(lasso, prefit = True)

# X_selected_lasso = lasso_model.transform(X_rescaled)
# print("Selected Features Shape:", X_selected_lasso.shape)

# X_Ltrain, X_Ltest, y_Ltrain, y_Ltest = train_test_split(X_selected_lasso, newer_y, test_size = 0.20, random_state = 23)

# logLRegModel = LogisticRegression(class_weight = 'balanced', random_state = 23)   # class_weight useful for imbalanced classes (like in our case)
# logLRegModel.fit(X_Ltrain, y_Ltrain)
# predictionL = logLRegModel.predict(X_Ltest)

# logLReg_accuracy = accuracy_score(y_Ltest, predictionL)
# logLReg_f1score = f1_score(y_Ltest, predictionL)
# logLReg_precision = precision_score(y_Ltest, predictionL)
# logLReg_recall = recall_score(y_Ltest, predictionL)

# logLReg_confMatrix = confusion_matrix(y_Ltest, predictionL)

# print("Overall Logistic Regression Model Performance:")
# print(f'- Accuracy: {logLReg_accuracy * 100:.2f} %')
# print(f'- Precision: {logLReg_precision * 100:.2f} %')
# print(f'- Recall: {logLReg_recall * 100:.2f} %')
# print(f'- F1-score: {logLReg_f1score * 100:.2f} %')

<font color = "gray"><u>Recursive Feature Elimination (RFE)</u>
<li><em>Works well with non-linearly related features</em></li> 
<li><em>Can be used with any model (Logistic Regression, SVM, RF, etc.)</em></li> 
<li><em> Computationally expensive if dataset is large</em></li> 
<li><em>Needs model retraining for every iteration</em></li> 
</font>

In [10]:
# log_reg_rfe = LogisticRegression()                              # default solver yielded the same higher scores as liblinear solver
# rfe = RFE(log_reg_rfe, n_features_to_select = 22)               # select top 22 features (experimented with different values, 22 yielded the best F1 & Recall Scores)
# X_selected_rfe = rfe.fit_transform(X_rescaled, newer_y)
# print("Selected Features Shape:", X_selected_rfe.shape)

# X_Rtrain, X_Rtest, y_Rtrain, y_Rtest = train_test_split(X_selected_rfe, newer_y, test_size = 0.20, random_state = 23)
# logRegModel_R = LogisticRegression(class_weight = 'balanced', random_state = 23)   # class_weight useful for imbalanced classes (like in our case)
# logRegModel_R.fit(X_Rtrain, y_Rtrain)
# predictionR = logRegModel_R.predict(X_Rtest)

# logRReg_accuracy = accuracy_score(y_Rtest, predictionR)
# logRReg_f1score = f1_score(y_Rtest, predictionR)
# logRReg_precision = precision_score(y_Rtest, predictionR)
# logRReg_recall = recall_score(y_Rtest, predictionR)

# logRReg_confMatrix = confusion_matrix(y_Rtest, predictionR)

# print("Overall Logistic Regression Model Performance:")
# print(f'- Accuracy: {logRReg_accuracy * 100:.2f} %')
# print(f'- Precision: {logRReg_precision * 100:.2f} %')
# print(f'- Recall: {logRReg_recall * 100:.2f} %')
# print(f'- F1-score: {logRReg_f1score * 100:.2f} %')

<font color = "gray"><b>Handling Class Imbalance</b></font>

In [11]:
# Before Handling Class Imbalance
y.value_counts()

0    1233
1     237
Name: Attrition, dtype: int64

<font color = "gray"><u>SMOTE-EEN</u> (Without Feature Selection) -- <em> combines SMOTE with Edited Nearest Neighbor</em></font>

In [12]:
# smote_een = SMOTEENN(sampling_strategy = 'all', random_state = 23)
# X_smote_resampled, y_smote_resampled = smote_een.fit_resample(X_rescaled_df, newer_y)
# print(y_smote_resampled.value_counts())

# print('\nX_smote_resampled: ', X_smote_resampled.shape)          # 1946 records, 36 attributes

# X_Smote_train, X_Smote_test, y_Smote_train, y_Smote_test = train_test_split(X_smote_resampled, y_smote_resampled, test_size = 0.20, random_state = 23)
# logSmoteRegModel = LogisticRegression(class_weight = 'balanced', random_state = 23)   # class_weight useful for imbalanced classes (like in our case)

# logSmoteRegModel.fit(X_Smote_train, y_Smote_train)
# predictionSmote = logSmoteRegModel.predict(X_Smote_test)

# logSmoteReg_recall = recall_score(y_Smote_test, predictionSmote)
# logSmoteReg_f1score = f1_score(y_Smote_test, predictionSmote)

# print("Overall Logistic Regression Model Performance:\n")
# print(f'- Recall: {logSmoteReg_recall * 100:.2f} %')
# print(f'- F1-score: {logSmoteReg_f1score * 100:.2f} %')

<font color = "gray">SMOTE-EEN is yielding a higher recall and F1-Score than Undersampling and Oversampling. Let's test it out with RFE Feature Selection</font>

<font color = "gray"><em><u>SMOTE-EEN:</u> With RFE Feature Selection</em></font>

In [13]:
# smote_een_FS = SMOTEENN(sampling_strategy = 'all', random_state = 23)
# X_smoteFS_resampled, y_smoteFS_resampled = smote_een_FS.fit_resample(X_selected_rfe, newer_y)
# print(y_smoteFS_resampled.value_counts())

# print('\nX_smoteFS_resampled: ', X_smoteFS_resampled.shape)          # 2466 records, 36 attributes

# X_SmoteFS_train, X_SmoteFS_test, y_SmoteFS_train, y_SmoteFS_test = train_test_split(X_smoteFS_resampled, y_smoteFS_resampled, test_size = 0.20, random_state = 23)
# logSmoteFSRegModel = LogisticRegression(class_weight = 'balanced', random_state = 23)   # class_weight useful for imbalanced classes (like in our case)

# logSmoteFSRegModel.fit(X_SmoteFS_train, y_SmoteFS_train)
# predictionSmoteFS = logSmoteFSRegModel.predict(X_SmoteFS_test)

# logSmoteFSReg_recall = recall_score(y_SmoteFS_test, predictionSmoteFS)
# logSmoteFSReg_f1score = f1_score(y_SmoteFS_test, predictionSmoteFS)

# print("\nOverall Logistic Regression Model Performance:")
# print(f'- Recall: {logSmoteFSReg_recall * 100:.2f} %')
# print(f'- F1-score: {logSmoteFSReg_f1score * 100:.2f} %')

<font color = "gray">Okay so we're getting a slightly lower recall and F1-Score when undersampled with RFE feature selection than without --> proceed with the RFE selected features, if we can get a very close performance with less features, that sounds great </font>

<font color = "gray"><b>Parameter Fine Tuning</b></font>

<font color = "gray"><em>Let's go with the undersampled model. I know it significantly reduces the sample from 1K+ to < 500, but we're just playing around</em></font>

In [14]:
# param_grid = {
#     'penalty': ['l2'],
#     'C': np.logspace(-4, 4, 20),
#     'solver': ['liblinear','newton-cg', 'saga'],
#     'max_iter': [100, 1000, 2500, 5000]
# }

# gridSearch_log = GridSearchCV(estimator = logSmoteFSRegModel, param_grid = param_grid, cv = 5, scoring = 'f1')
# gridSearch_log.fit(X_SmoteFS_train, y_SmoteFS_train)

In [15]:
# print(f'Best Score: {gridSearch_log.best_score_ * 100:.2f} %')
# print(f'Best Parameters:\n{gridSearch_log.best_params_}')

<font color = "gray"><u>Hypertuned Model</u></font>

In [16]:
# X_SmoteFS_train, X_SmoteFS_test, y_SmoteFS_train, y_SmoteFS_test = train_test_split(X_smoteFS_resampled, y_smoteFS_resampled, test_size = 0.20, random_state = 23)
# logSmoteFSRegModel = LogisticRegression(C = 0.08858667904100823, max_iter = 100, penalty = 'l2', solver = 'newton-cg', random_state = 23)

# logSmoteFSRegModel.fit(X_SmoteFS_train, y_SmoteFS_train)
# predictionSmoteFS = logSmoteFSRegModel.predict(X_SmoteFS_test)

# logSmoteFSReg_accuracy = accuracy_score(y_SmoteFS_test, predictionSmoteFS)
# logSmoteFSReg_f1score = f1_score(y_SmoteFS_test, predictionSmoteFS)
# logSmoteFSReg_precision = precision_score(y_SmoteFS_test, predictionSmoteFS)
# logSmoteFSReg_recall = recall_score(y_SmoteFS_test, predictionSmoteFS)
# logSmoteFSReg_confMatrix = confusion_matrix(y_SmoteFS_test, predictionSmoteFS)


# print("Overall Logistic Regression Model Performance:")
# print(f'- Accuracy: {logSmoteFSReg_accuracy * 100:.2f} %')
# print(f'- Precision: {logSmoteFSReg_precision * 100:.2f} %')
# print(f'- Recall: {logSmoteFSReg_recall * 100:.2f} %')
# print(f'- F1-score: {logSmoteFSReg_f1score * 100:.2f} %')
# print(f'- Confusion Matrix: \n{logSmoteFSReg_confMatrix}')


# plt.figure(figsize=(8, 6))
# sns.heatmap(logSmoteFSReg_confMatrix, annot = True, fmt = 'd', cmap ='Blues', cbar = True,
#             xticklabels = ['Predicted 0', 'Predicted 1'],
#             yticklabels = ['Actual 0', 'Actual 1'])
# plt.xlabel('Predicted Label')
# plt.ylabel('True Label')
# plt.title('Logistic Regression Model With Hyperparameter Tuning Confusion Matrix')
# plt.show()

To recall, previously, these were the evaluation metrics with no parameter tuning, class imbalance handling, and feature selection:
*   Accuracy --  %
*   Precision --  %
*   Recall --  %
*   F1-Score --  %

After handing the class imbalance in the dataset using SMOTE-EEN and some parameter tuning:P
*   Accuracy:  %
*   Precision:  %
*   Recall:  %
*   F1-score: %

<font color = "gray"><b><u>Conclusion:</u></b>
In HR, it is very important we can identify employees that are at risk of leaving the organization to retain talent and reduce costs. Hence, a high recall score is very important. This model ____</font>