# HR - Attrition Analytics

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

- Employee satisfaction level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left


### Loading Datasets and exploring metadata

In [None]:
import pandas as pd
import numpy as np

In [None]:
hr_df = pd.read_csv('HR_comma_sep.csv')

In [None]:
hr_df.sample(10)

In [None]:
hr_df.info()

### EDA

- How satisfcation level influences employee's decision to leave?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
sl_left = hr_df[hr_df.left == 1]['satisfaction_level']
sl_not_left = hr_df[hr_df.left == 0]['satisfaction_level']

In [None]:
sn.distplot(sl_left, label='Left')
sn.distplot(sl_not_left, label = 'Not Left');
plt.legend();

### Participant Exercise: 1

- How last evaluation influences employee's decision to leave?
- How time_spend_company influences employee's decision to leave?
    - Hint: Use Count Plot (Refer to: https://seaborn.pydata.org/generated/seaborn.countplot.html)

### Encoding Categorical Features

#### Exploring Categorical Features

In [None]:
hr_df.dept.unique()

In [None]:
hr_df.salary.unique()

- OHE - One Hot Encoding 
- Dummy Variable Creation

In [None]:
salary_dict = { 'low' : 1,
                'medium': 2,
                'high': 4}

In [None]:
hr_df['salary'] = hr_df.salary.map(salary_dict)

In [None]:
hr_df.sample(10)

In [None]:
hr_encoded_df = pd.get_dummies( hr_df,
                                columns = ['dept'] )

In [None]:
hr_encoded_df.head(5)


In [None]:
hr_encoded_df.info()

### Setting X and Y Variables

In [None]:
hr_encoded_df.columns

In [None]:
X_features = list(hr_encoded_df.columns)

In [None]:
X_features

In [None]:
X_features.remove('left')

In [None]:
X_features

### Building a model using only one variable
#### Setting X & y variable

In [None]:
X = hr_encoded_df[['satisfaction_level']]
y = hr_encoded_df.left

In [None]:
X[0:2]

In [None]:
y[0:2]

### Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, \
y_train, y_test = train_test_split( X,
                                    y,
                                    test_size = 0.2,
                                    random_state = 100 )

In [None]:
X_train.shape

In [None]:
X_test.shape

### Build a Model: V1

Logistic Regression Model - Sigmoid function

<img src="Logistic.png" alt="ML Algorithms" width="600"/>

<img src="Logistic2.png" alt="Logistic Regression" width="800"/>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg_v1 = LogisticRegression(random_state = 100, 
                               max_iter = 1000)

In [None]:
logreg_v1.fit( X_train, y_train )

### Finding Parameters

In [None]:
logreg_v1.intercept_

In [None]:
logreg_v1.coef_

In [None]:
dict( zip( X_features, np.round(logreg_v1.coef_[0], 2) ) )

In [None]:
sl_list = np.arange(0, 1, 0.05)

In [None]:
sl_probs = [(1.0 / (1.0 + np.exp(-(0.89-3.71*x)))) for x in sl_list]

In [None]:
sl_probs_df = pd.DataFrame({'SL': sl_list, 'Prob_Left': sl_probs })

In [None]:
sl_probs_df

In [None]:
sn.lineplot(data=sl_probs_df, x="SL", y="Prob_Left");

### Building a model all variables
#### Setting X & y variable

In [None]:
X = hr_encoded_df[X_features]
y = hr_encoded_df.left

In [None]:
X[0:2]

In [None]:
y[0:2]

### Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, \
y_train, y_test = train_test_split( X,
                                    y,
                                    test_size = 0.2,
                                    random_state = 100 )

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg_v1 = LogisticRegression(random_state = 100, 
                               max_iter = 1000)

In [None]:
logreg_v1.fit( X_train, y_train )

### Finding Parameters

In [None]:
logreg_v1.intercept_

In [None]:
logreg_v1.coef_

In [None]:
dict( zip( X_features, np.round(logreg_v1.coef_[0], 2) ) )

### Predict on Test Set

- p(y) >= 0.5,  y = 1 (L)
- p(y) < 0.5, y = 0 (NL)

In [None]:
logreg_v1.predict_proba( X_test )

In [None]:
pred_logreg_v1 = logreg_v1.predict( X_test )

In [None]:
y_logreg_v1 = pd.DataFrame( { "actual": y_test,
                              "predicted": pred_logreg_v1 } )

In [None]:
y_logreg_v1.sample(10, random_state = 20)

### Evaluating the model

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_logreg_v1.actual, y_logreg_v1.predicted)

### Building Confusion Matrix

<img src="confusion_matrix.png" alt="ML Algorithms" width="600"/>

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm = confusion_matrix(y_logreg_v1.actual, y_logreg_v1.predicted, labels = [1,0])

In [None]:
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                 display_labels=['Left', 'Not Left'])

In [None]:
cm_plot.plot();

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_logreg_v1.actual, y_logreg_v1.predicted))

### KNN Model

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
sn.lmplot( data = hr_df.sample(20),
           x = 'satisfaction_level',
           y = 'last_evaluation',
           hue = 'left',
           fit_reg = False,
           size = 6);

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn_v1 = KNeighborsClassifier(n_neighbors = 10, weights='uniform')

In [None]:
knn_v1.fit(X_train, y_train)

In [None]:
knn_pred = knn_v1.predict(X_test)

In [None]:
accuracy_score(y_test, knn_pred)

In [None]:
cm = confusion_matrix(y_logreg_v1.actual, knn_pred, labels = [1,0])
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                 display_labels=['Left', 'Not Left'])
cm_plot.plot();

In [None]:
print(classification_report(y_test, knn_pred))

In [None]:
from sklearn.metrics import recall_score

In [None]:
recall_score(y_test, knn_pred)

###  Participant Exercise: 2

- Grid Search
- Find the optimal hyperparameters
    - n_neighbors [5 to 20]
    - weights: ['uniform', 'distance']

### Building a Decision Tree Model

<img src="decisiontree.png" alt="decision tree"/>

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_v2 = DecisionTreeClassifier( max_depth = 5 )
### Max Depth is called hyperparameter

In [None]:
tree_v2.fit( X_train, y_train )

In [None]:
y_tree_pred = tree_v2.predict( X_test )

In [None]:
accuracy_score(y_test, y_tree_pred)

### Participant Exercise: 3

- Build the confusion matrix
- Calculate total accuracy, recall score

### Participant Exercise: 4

Grid Search for Decision Tree

- Search for max_depth from 5 to 15
- cv = 10
- scoring = 'recall'

### Feature Importance

In [None]:
tree_v2.feature_importances_

In [None]:
features_df = pd.DataFrame( { "features": X_features,
                              "importance": tree_v2.feature_importances_ } )

In [None]:
features_df.sort_values("importance", ascending = False)

### Visualizing Decision Tree

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize = (50, 12))
plot_tree(tree_v2,
          feature_names = X_features,
          class_names = ['Not Left', 'Left'],
          filled = True,
          fontsize = 10);
plt.savefig('tree.png')

In [None]:
from IPython import display

In [None]:
display.Image("tree.png")

In [None]:
params = { "max_depth": range(5, 30)}