# Predicting and Explaining Employee Turnover at Hilton

Our objective is to train and evaluate a predictive model that predicts whether employees intend to stay or leave Hilton. We also want to understand which factors explain employees' intention to stay or leave. 

## 1. Notebook Styling and Package Management

In [None]:
import numpy as np # Library for math operations
import pandas as pd # Library for data handling
import sklearn # The machine learning library we will be using in this entire course
from sklearn import tree # Tree function is used for visualizing decision tree
from sklearn.metrics import * # Importing function that can be used to calculate different metrics
from sklearn.tree import DecisionTreeClassifier # Importing Decision Tree Classifier 
from sklearn.ensemble import RandomForestClassifier  # Importing Random Forest Classifier 
from sklearn.preprocessing import MinMaxScaler, LabelEncoder # Importing function for scaling the data
from sklearn.ensemble import GradientBoostingClassifier # Importing GB Classifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Importing GridSearchCV
from xgboost import XGBClassifier # Importing the XGBoost Classifier 
import matplotlib.pyplot as plt # Importing the package for plotting
plt.style.use('fivethirtyeight') # Use the styling from FiveThirtyEight Website
import seaborn as sns # Importing another package for plotting

## 2. Load Data

In [None]:
trainInput = pd.read_csv("Data/hilton_2024_train.csv") 
testInput = pd.read_csv("Data/hilton_2024_test.csv")

In [None]:
trainData = trainInput.drop(columns = 'Stay')
trainLabels = LabelEncoder().fit_transform(trainInput.Stay)

testData = testInput.drop(columns = 'Stay')
testLabels = LabelEncoder().fit_transform(testInput.Stay)

In [None]:
trainData.info()

## 3. Train a XGBoost Classifier

In [None]:
clf = XGBClassifier(random_state = 1)
clf.fit(trainData, trainLabels)

## 4. Evaluate the Classifier Using the Testing Data (Recommended)

We first import custom functions from a custom package called custom_functions. File custom_functions.py should be in the active directory.

In [None]:
from custom_functions import plot_conf_mat, plot_roc_curve, plot_feature_importance, calculateMetricsAndPrint

#### 4.1. Confusion Matrix:

In [None]:
plot_conf_mat(clf, # The classifier object
              testData, # The test data set aside for evaluation in train_test_split
              testLabels # Actual labels
             )

#### 4.2. Accuracy, Precision, Recall, AUC, and F1:

In [None]:
1312/(1312+252)

In [None]:
predictedProbabilities = clf.predict_proba(testData)
predictedLabels = clf.predict(testData) 
calculateMetricsAndPrint(predictedLabels, predictedProbabilities, testLabels)

In [None]:
print("F1 Score:",f1_score(testLabels, predictedLabels, average='micro'))

#### 4.3. ROC Curve:

In [None]:
predictedProbabilities

In [None]:
positiveProbabilities = predictedProbabilities[:,1]

In [None]:
plot_roc_curve(testLabels, # Actual labels
               positiveProbabilities, # Prediction scores for the positive class
               pos_label = 1 # Indicate the label that corresponds to the positive class
              )

#### 4.4. Log-Loss

In [None]:
temp = pd.DataFrame(positiveProbabilities)
temp.columns = ["prob_one"]
temp["Labels"] = testLabels

sns.histplot(data=temp, x="prob_one", hue="Labels")
plt.show()

In [None]:
log_loss(testLabels,positiveProbabilities)

## 5. Apply the Model to Kaggle Data:

In [None]:
kaggleTest = pd.read_csv("Data/hilton_2024_kaggle.csv") 

In [None]:
kaggleTest['score'] = clf.predict_proba(kaggleTest.drop(columns = 'unique_id'))[:,1]
kaggleTest[['unique_id','score']].to_csv("Data/Kaggle_Submission.csv", index = False)

Please submit to https://www.kaggle.com/t/72810a50d9ea40a287e0fdb347a07db9

Each team should make at least three submissions per week from the launch until we close the competition. 