# Classification Using Decision Trees

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/</br>
**Website:** www.manaranjanp.com


## 1. HR - Attrition Analytics

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

## 2. Data Set

This dataset is taken from kaggle https://www.kaggle.com/datasets/jacksonchou/hr-data-for-analytics


### 2.1 Dependent variable

Left : 0 if employee did not leave , 1 if left company

### 2.2 Independent variables

- **satisfaction_level** : means how much employee satisfied (0 less satisfied , 1 most satisfied)
- **last_evaluation** : means employees' evaluation for last month (0 bad , 1 Excellent)
- **number_project** : number of projects the employee worked on
- **average_montly_hours** : average months employee spends at work per month
- **time_spend_company** : years the employee spent in a company
- **Work_accident** : 0 if he did not have an accident , 1 if had at least one
- **promotion_last_5years** : 0 if he did not have any promotion in last 5 years , 1 if had at least one
- **dept** : department in which employee works
- **salary** : High, medium or low bracket

## 3. Loading Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

sn.set_palette("tab10")

In [None]:
hr_df = pd.read_csv('HR_comma_sep.csv')

In [None]:
hr_df.sample(10)

In [None]:
hr_df.info()

## 4. EDA

In [None]:
sn.scatterplot(data = hr_df.sample(20, random_state = 48),
               x = 'satisfaction_level',
               y = 'last_evaluation',
               hue = 'left');

## 5. Finding Decision Rules or Boundaries

- Decision rules or boundaries in decision trees are determined based on the feature values and their corresponding thresholds to create partitions that maximize the separation of different classes.

- Decision trees make splits in the data based on features that provide the most information gain or reduction in impurity.
 
- Impurity estimation is a concept used in decision tree classification models to measure the homogeneity or impurity of a set.

There are several impurity estimation metrics commonly used in decision tree algorithms, including:

- Gini impurity: It measures the probability of misclassifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution at that node. A lower Gini impurity indicates a more homogeneous node.

$$Gini(p) = 1 - \sum_{i=1}^{c} p_i^2$$

- Entropy: It calculates the level of disorder or uncertainty in a set of samples. Entropy is maximum when all class labels are equally probable and decreases as the distribution becomes more skewed. The goal is to minimize entropy at each node.

$$Entropy(p) = -p \log_2(p) - (1-p) \log_2(1-p)$$


### 5.1. Total Gini Impurity and Information Gain

- Gini impurity ranges between 0 and 0.5, 
- 0.0 represents a perfectly pure node (all samples belong to the same class)
- 0.5 indicates maximum impurity (samples are evenly distributed across different classes).

In [None]:
gini_total = 1 - pow(7/20, 2) - pow(13/20, 2)
gini_total

#### If decision rule is satisfaction_level < 0.2

In [None]:
sn.scatterplot(data = hr_df.sample(20, random_state = 48),
               x = 'satisfaction_level',
               y = 'last_evaluation',
               hue = 'left');
plt.axvline(x = 0.2, color = 'r', label = 'cut off');

In [None]:
gini_1 = 1 - pow(2/2, 2) - pow(0/2, 2)
gini_1

In [None]:
gini_2 = 1 - pow(5/18, 2) - pow(13/18, 2)
gini_2

### 5.2. Information gain

- Information gain is a concept used in decision trees to measure the reduction in entropy or impurity achieved by splitting the data based on a specific attribute. 

- Information gain is calculated by comparing the entropy of the parent node (before the split) with the weighted average entropy of the child nodes (after the split).

Let's computer impurity after the split is:

$$Gini_{childnodes} = [Gini_{leftchild} * P(left_{child}) + Gini_{rightchild} * P(right_{child})]$$

So, the reduction in impurity i.e. information gain is 

$$Information Gain = Gini_{parent} - Gini_{childnodes}$$



In [None]:
gini_childnodes = ((2/20)*gini_1 + (18/20)*gini_2)
gini_childnodes

In [None]:
information_gain = gini_total - gini_childnodes
round(information_gain, 2)

#### If decision rule is satisfaction_level < 0.45

In [None]:
sn.scatterplot(data = hr_df.sample(20, random_state = 48),
               x = 'satisfaction_level',
               y = 'last_evaluation',
               hue = 'left');
plt.axvline(x = 0.45, color = 'r', label = 'cut off');

In [None]:
gini_1 = 1 - pow(5/7, 2) - pow(2/7, 2)
gini_1

In [None]:
gini_2 = 1 - pow(2/13, 2) - pow(11/13, 2)
gini_2

In [None]:
information_gain = gini_total - ((7/20)*gini_1 + (13/20)*gini_2)
round(information_gain, 2)

#### Note:

- Decision rule is *satisfaction_level < 0.45* is better than *satisfaction_level < 0.2*

### Ex1: Participant Exercise

1. Find information gains for the following decision rules:
    - last_evaluation is 0.55
    - last_evaluation is 0.85
    
2. Which of the above decision rule or boundary has maximum information gain?    

## 6. Encode Categorical Features

In [None]:
hr_encoded_df = pd.get_dummies(hr_df, columns=['dept', 'salary'])

In [None]:
hr_encoded_df.head(10)

## 7. Building a Decision Tree


### 7.1. Setting X and y Values

In [None]:
X_features = list(hr_encoded_df.columns)
X_features.remove('left')

In [None]:
X_features

In [None]:
X = hr_encoded_df[X_features]
y = hr_encoded_df.left

### 7.2. Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, \
y_train, y_test = train_test_split( X,
                                    y,
                                    test_size = 0.2,
                                    random_state = 100 )

In [None]:
X_train.shape

In [None]:
X_test.shape

### 7.3. Build Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(max_depth=5, criterion='gini')

In [None]:
tree.fit( X_train, y_train )

### 7.4. Predicting on Test Data

In [None]:
y_pred = tree.predict(X_test)

In [None]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred})

In [None]:
y_df.sample(10, random_state=20)

### 7.5. Measuring Accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

### 7.6. Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm = confusion_matrix(y_df.actual, y_df.predicted, labels=[1,0])

In [None]:
cm

In [None]:
cm_plot = ConfusionMatrixDisplay(cm, display_labels=['Left', 'No Left'])

In [None]:
cm_plot.plot();

### 7.7. Finding accuracy metrices

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

In [None]:
print(classification_report(y_df.actual, y_df.predicted))

### 7.8. ROC AUC

In [None]:
pred_probs = tree.predict_proba(X_test)
y_df['pred_probs'] = pred_probs[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score, RocCurveDisplay

In [None]:
roc_auc_score(y_df.actual, y_df.pred_probs)

In [None]:
RocCurveDisplay.from_predictions(y_df.actual, y_df.pred_probs);

## 8. Visualize the Decision Tree

One of the benefits of decision tree is the rules can be visualized.

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize = (50, 12))
plot_tree(tree,
          feature_names = X_features,
          class_names = ['Not Left', 'Left'],
          filled = True,
          fontsize = 10);
plt.savefig('tree.png')

In [None]:
from IPython import display

In [None]:
display.Image("tree.png")

## 9. Feature Importance 

In [None]:
tree.feature_importances_

In [None]:
importance_df = pd.DataFrame({'feature': X_features,
                              'importance': tree.feature_importances_})

In [None]:
importance_df = importance_df.sort_values('importance', ascending = False)
importance_df

In [None]:
importance_df['cummulative_imp'] = importance_df.importance.cumsum()
importance_df

### Ex2: Participant Exercise

1. Build decision trees for the following combination of hyper paramters and measure Roc AUC of the models.
    - max_depth is 10 and criteria is gini
    - max_depth is 5 and criteria is entropy
    

### Ex3: Participant Exercise

1. Build decision trees with the top 5 features based on feature importane found in step 9 and measure model performace using Roc AUC and Confusion Matrix. 

## 10. Benefits of Decision Trees

Decision trees offer several benefits over other machine learning models, including:

- Interpretable and Easy to Understand: Decision trees provide a transparent and intuitive representation of the decision-making process. The tree structure, along with the decision rules at each node, can be easily visualized and interpreted, making it easier to explain the model's predictions to stakeholders.

- Handling Nonlinear Relationships: Decision trees can effectively capture nonlinear relationships between features and the target variable. They can learn complex decision boundaries without requiring explicit transformations or interactions between variables.

- Feature Importance: Decision trees can provide insights into feature importance. By examining how frequently and at which points features are used for splitting, decision trees can rank the features based on their predictive power, enabling feature selection and dimensionality reduction.