# Classification Problem

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/</br>
**Website:** www.manaranjanp.com


## 1. HR - Attrition Analytics

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

## 2. Data Set

This dataset is taken from kaggle https://www.kaggle.com/datasets/jacksonchou/hr-data-for-analytics


### 2.1 Dependent variable

Left : 0 if employee did not leave , 1 if left company

### 2.2 Independent variables

- **satisfaction_level** : means how much employee satisfied (0 less satisfied , 1 most satisfied)
- **last_evaluation** : means employees' evaluation for last month (0 bad , 1 Excellent)
- **number_project** : number of projects the employee worked on
- **average_montly_hours** : average months employee spends at work per month
- **time_spend_company** : years the employee spent in a company
- **Work_accident** : 0 if he did not have an accident , 1 if had at least one
- **promotion_last_5years** : 0 if he did not have any promotion in last 5 years , 1 if had at least one
- **dept** : department in which employee works
- **salary** : High, medium or low bracket

## 3. Loading Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

sn.set_palette("tab10")

In [None]:
hr_df = pd.read_csv('HR_comma_sep.csv')

In [None]:
hr_df.sample(10)

In [None]:
hr_df.average_montly_hours.min(), hr_df.average_montly_hours.max()

In [None]:
hr_df.info()

In [None]:
hr_df.left.value_counts()

## 3. EDA

**Question 1**: How satisfcation level influences employee's decision to leave?

In [None]:
plt.figure(figsize=(10, 5))
sn.histplot(data = hr_df,
            x = 'satisfaction_level',
            hue = 'left',
            bins = np.arange(0.0, 1.0, 0.1),
            multiple="stack");

**Question 2**: How time spend in the company influences employee's decision to leave?

In [None]:
plt.figure(figsize=(10, 4))
sn.countplot(data = hr_df,
             x = 'time_spend_company',
             hue = 'left');

**Question 3**: Attrition patterns across different departments.

In [None]:
hr_df.dept.unique()

In [None]:
pd.crosstab(hr_df.dept,
            hr_df.left,
            normalize='index')

In [None]:
pd.crosstab(hr_df.Work_accident,
            hr_df.left,
            normalize='index')

## 4. Building a Classification Model

First we will build a model with *satisfaction level* with *left*.

### Setting X and y Values

In [None]:
X = hr_df[['satisfaction_level']]
y = hr_df.left

In [None]:
X[0:2]

In [None]:
y[0:2]

In [None]:
plt.figure(figsize=(10, 4))
sn.scatterplot(data = hr_df.sample(100, random_state = 78),
               x = 'satisfaction_level',
               y = 'left');

## KNN Classification

In [None]:
hr_df.head(5)

In [None]:
sn.scatterplot(hr_df.sample(20, random_state = 20),
               x = 'satisfaction_level',
               y = 'last_evaluation',
               hue = 'left');

In [None]:
X_features = ['satisfaction_level',
              'last_evaluation',
              'average_montly_hours',
              'time_spend_company']

X = hr_df[X_features]

y = hr_df.left

### Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 100)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Build a KNN Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn_v1 = KNeighborsClassifier(n_neighbors = 10)

In [None]:
knn_v1.fit(X_train, y_train)

# 5. Evaluating Classification Models

- How many correctly classified from the class of left i.e y = 1.
- How many correctly classified from the class of not left i.e y = 0.

**Four Scenarios**:

| Actual | Predicted | Implications |
|---------|----------|--------------|
| Left | Left | **Correct Classification** |
| Left | Not Left |  **Misclassification**: This failure has higher cost. As the model is not able to identify some employees who are likely to leave.   |
| Not Left | Left | **Misclassification**: This failure may not as high cost as the previos one.   |
| Not Left | Not Left | **Correct Classification** |


## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

### What are different accuracy measures?

**Total Accuracy** is defined as how many are correct classification across the complete test sample.

$Total Accuracy = ({\frac {TP + TN}{TP+FP+FN+TN}})$

**Precision** is defined as how many are actual positives out of total number of positives identified by the model and is defined as

$TPR = ({\frac {TP}{TP+FP}})$

**True Positive Rate (TPR) or Recall or Sensitivity** is how many actual positive are properly identified by the model out of total number actual positive in the test set and is defined as

$TPR = ({\frac {TP}{TP+FN}})$

**True Negative Rate (TNR) or Specificity** is how many are correctly indentified as correct negatives out of all acutal negative present in the test set and is defined as

$TNR = ({\frac {TN}{FP+TN}})$

**F-Score (F-Measure)** is another measure used in binary logistic regression that combines both precision and recall (harmonic mean of precision and recall) and is given by

${F1−score}$ = $({\frac {2 x Precision x Recall}{ Precision + Recall }})$

*classification_report* method in *skearn.metrics* give a detailed report of precision, recall and f1-score for each classes.

### Evaluate the model

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, accuracy_score

In [None]:
y_knn_pred = knn_v1.predict(X_test)

In [None]:
y_knn_df = pd.DataFrame({"actual" : y_test, "predicted": y_knn_pred})

In [None]:
y_knn_df.sample(10, random_state = 20)

In [None]:
cm_knn = confusion_matrix(y_knn_df.actual, y_knn_df.predicted, labels = [1, 0])

In [None]:
cm_knn

In [None]:
cm_knn_plot = ConfusionMatrixDisplay(cm_knn, display_labels=['Left', 'Not Left'])

In [None]:
cm_knn_plot.plot();

In [None]:
recall_score(y_knn_df.actual, y_knn_df.predicted)

In [None]:
print(classification_report(y_knn_df.actual, y_knn_df.predicted))

### Which is the best K i.e. n_neighbors?

In [None]:
k_vals = list(range(9, 16, 1))

In [None]:
k_vals

In [None]:
from sklearn.metrics import recall_score

In [None]:
for k in k_vals:
    knn_v1 = KNeighborsClassifier(n_neighbors = k)
    knn_v1.fit(X_train, y_train)
    y_knn_pred = knn_v1.predict(X_test)
    y_knn_df = pd.DataFrame({"actual" : y_test,
                             "predicted": y_knn_pred})
    recall_val = recall_score(y_knn_df.actual,
                                y_knn_df.predicted)
    print(f"for n_neighbors: {k} - recall score: {round(recall_val, 3)}")

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree= DecisionTreeClassifier( max_depth = 5 )

In [None]:
tree.fit( X_train, y_train )

In [None]:
y_tree_pred = tree.predict( X_test )

In [None]:
accuracy_score(y_test, y_tree_pred)

### Feature Importance

In [None]:
tree.feature_importances_

In [None]:
recall_score(y_test, y_tree_pred)

In [None]:
features_df = pd.DataFrame( { "features": X_features,
                              "importance": tree.feature_importances_ } )

In [None]:
features_df = features_df.sort_values("importance", ascending = False)

In [None]:
features_df['cumsum'] = features_df.importance.cumsum()

In [None]:
features_df

### Visualizing Decision Tree

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize = (50, 12))
plot_tree(tree,
          feature_names = X_features,
          class_names = ['Not Left', 'Left'],
          filled = True,
          fontsize = 10);
plt.savefig('tree.png')