<a href="https://colab.research.google.com/github/mohammad0alfares/MachineLearningNotebooks/blob/master/DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the Source GitHub Reporsitory 
We need to clone some source files to be used throughtout this tutorial from a GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

# Decision Trees
**Introduction**

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.$^{[1]}$

[1] https://scikit-learn.org/0.15/modules/tree.html#tree




**Readings and Resources**

[1] https://scikit-learn.org/0.15/modules/tree.html#tree

[2] https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

[3] https://towardsdatascience.com/boosting-the-accuracy-of-your-machine-learning-models-f878d6a2d185

# Case #1: Studying Hours and Passing Exams

In this section we will use **decision tree** to infer whether a student will pass or fail an exam based on the number of hours the student spends preparing for the exam. A dataset for few students that includes the number of study hours and whether they pass (1) or fail (0) the exam. This is a binary classification problem that can be solved using **decision tree** as will be shown next.

**Implementation**

Read the input data (number of study hours and exam pass or fail) from the csv file (HoursPassExam.csv) file. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/3_decision_tree/HoursPassExam.csv")
df.head()

To get some information about the read dataset including parameters and type of fields and features use the pandas info method

In [0]:
df.info()

To view the dataset, we will use the scatter plot from matplotlib library as

In [0]:
import matplotlib.pyplot as plt
plt.scatter(df['hours'],df['pass'],color = 'red', marker = '+')

As can be seen, the output (y) is binary; 0 for failing the exam and 1 for passing the exam. Also, the chances of passing the exam increases when the number of studying hours for the exam increases. Let us divide the dataset into training and testing datasets.

In [0]:
from sklearn.model_selection import train_test_split
x = df[['hours']]
y = df['pass']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))

In [0]:
print (type(x))
print (type(y))

Next we will train the **decision tree** model and compute its accuracy

In [0]:
from sklearn import tree
model_dt = tree.DecisionTreeClassifier()
model_dt.fit(x_train,y_train)
ACC_train_dt = model_dt.score(x_train,y_train)
ACC_test_dt = model_dt.score(x_test,y_test)
print(ACC_train_dt)
print(ACC_train_dt)

Let us try to compare **DT** with logistic regression

In [0]:
## logistic regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(x_train, y_train)
ACC_train_lr = model_lr.score(x_train, y_train)
ACC_test_lr = model_lr.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)'])
t.add_row(['Training', ACC_train_lr*100, ACC_train_dt*100])
t.add_row(['Testing', ACC_test_lr*100, ACC_test_dt*100])
print(t)

**Comment of the training and testing accuracies of DT as compared to logistic regression**

# Case #2: HR Analysis

In this section, we will analyze the data of employees of a company. This data includes some information about the employees who are working at the company and those who left the company. Our objective is to predict whether an existing employee would leave the company based on his/her current status. This will help us decide to offer the employee some incentives to keep him/her in the company. This could also be used to plan early to hire new employees. We will try to solve this problem using **decision trees**.

**Implementation**

Read the input data from the csv file (HR_comma_sep.csv) file. Dataset is downloaded from Kaggle. Link: https://www.kaggle.com/giripujar/hr-analytics. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
HR = pd.read_csv('./MachineLearning/3_decision_tree/HR_comma_sep.csv')
HR.head()

To get some information about the read dataset use the pandas info method

In [0]:
HR.info()

Before applying regression to the data, we will explore and analyze the data to determine the features that influence the decision of the students to remain or leave the company.

In [0]:
left = HR[HR.left==1] ## employees who left the company 
No_left= left.shape[0]
remain = HR[HR.left==0] ## employees who remain at the company 
No_remain = remain.shape[0]
Per_left = No_left / (No_left + No_remain)

print('No_left = {}, No_remain = {} , Percentage of left = {} %'.format(No_left,No_remain,Per_left*100))


About $23\%$ employees left the company. Now, let us check which features are mostly affecting the decision of employees to leave or remain in the company. To do this, we will measure the average of each numeric feature for employees to remain or leave the company.  

In [0]:
HR.groupby('left').mean()

We may conclude the following from the table above: \\
1- Employees who remain in the company has higher satisfaction_level and thus it is a good indicator for our regression/classifier (good feature) \\
2- The last_evaluation, number of projects, and time_spend_company scores are almost independent of the employees remain or leave the company \\
3- The average_montly_hours for employees who left the company are higher than those who remained which could be an indicator (good feature) \\
4- The promotion_last_5years feature of employees remaining in the company is much higher than those left the company (good feature) \\
5- Work_accident is also an indicator so it is a good feature.




Let us also check the quality of the categories' features.

In [0]:
pd.crosstab(HR.salary,HR.left)

The salary table shows that emloyees with high salaries are more likely to stay in the company. So it is a good feature. To visualize this we make a bar plot as follows:

In [0]:
pd.crosstab(HR.salary,HR.left).plot(kind='bar')

We need also to investigate the department feature as follows

In [0]:
pd.crosstab(HR.Department,HR.left).plot(kind='bar')

The department type has a minor effect on the decision of employees to stay or leave the company. It doesn't look a major factor and thus we will ignore this feature. 

Based on the above analysis, we will create the following table which includes only the good (important, major) features affecting employees decisions to stay or leave the company

In [0]:
HR_GF = HR[['left','satisfaction_level','average_montly_hours','Work_accident','promotion_last_5years', 'salary']]
HR_GF.head()

Let us plot this data for better visualization

In [0]:
import matplotlib.pyplot as plt
HR_GF_0 = HR_GF[HR_GF['left'] == 0]
HR_GF_1 = HR_GF[HR_GF['left'] == 1]

fig, axes = plt.subplots(2, 4,figsize = (20,10))
axes[0,0].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['average_montly_hours'], color = 'blue', marker ='+')
axes[0,0].set_xlabel('satisfaction_level')
axes[0,0].set_ylabel('average_montly_hours')

axes[0,1].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['Work_accident'], color = 'blue', marker ='+')
axes[0,1].set_xlabel('satisfaction_level')
axes[0,1].set_ylabel('Work_accident')

axes[0,2].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['promotion_last_5years'], color = 'blue', marker ='+')
axes[0,2].set_xlabel('satisfaction_level')
axes[0,2].set_ylabel('promotion_last_5years')

axes[0,3].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['salary'], color = 'blue', marker ='+')
axes[0,3].set_xlabel('satisfaction_level')
axes[0,3].set_ylabel('salary')

axes[1,0].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['average_montly_hours'], color = 'orange', marker ='s')
axes[1,0].set_xlabel('satisfaction_level')
axes[1,0].set_ylabel('average_montly_hours')

axes[1,1].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['Work_accident'], color = 'orange', marker ='s')
axes[1,1].set_xlabel('satisfaction_level')
axes[1,1].set_ylabel('Work_accident')

axes[1,2].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['promotion_last_5years'], color = 'orange', marker ='s')
axes[1,2].set_xlabel('satisfaction_level')
axes[1,2].set_ylabel('promotion_last_5years')

axes[1,3].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['salary'], color = 'orange', marker ='s')
axes[1,3].set_xlabel('satisfaction_level')
axes[1,3].set_ylabel('salary')

By comparing the top and bottom figures, the dataset is separable with respect to the left feature.

**For the decision tree algorithm**, there is no need to apply one hot coding for categories features. However, we will need to convert them to numbers. We will use label encoder from sklearn library to encode the category feature (salary) as follows

In [0]:
from sklearn.preprocessing import LabelEncoder
le_salary = LabelEncoder()
HR_GF_LE = pd.DataFrame.copy(HR_GF)
HR_GF_LE['salary'] = le_salary.fit_transform(HR_GF_LE['salary'])
HR_GF_LE.head()

Let us define input (x) and output (y) of the model

In [0]:
x = HR_GF_LE.drop('left',axis=1)
y = HR_GF_LE.left

Before classification, we need to split the datset into test and training parts

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))

Now, we are ready to apply the **decision tree**

In [0]:
from sklearn import tree
HRTree = tree.DecisionTreeClassifier()
HRTree.fit(x_train,y_train)
ACC_train_rf = HRTree.score(x_train,y_train)
ACC_test_rf = HRTree.score(x_test,y_test)
print(ACC_train_rf)
print(ACC_test_rf)

Let us try to compare DT with logistic regression

In [0]:
## add column for logitic regression (training)
dm = pd.get_dummies(x_train.salary)
x_train_lr = pd.concat([x_train,dm],axis=1)
x_train_lr = x_train_lr.drop(['salary',2],axis=1)
## add column for logitic regression (testing)
dm = pd.get_dummies(x_test.salary)
x_test_lr = pd.concat([x_test,dm],axis=1)
x_test_lr = x_test_lr.drop(['salary',2],axis=1)


## logistic regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(x_train_lr, y_train)
ACC_train_lr = model_lr.score(x_train_lr, y_train)
ACC_test_lr = model_lr.score(x_test_lr, y_test)

## Decision Trees
from sklearn.tree import DecisionTreeClassifier 
model_dt = DecisionTreeClassifier()
model_dt.fit(x_train, y_train)
ACC_train_dt = model_dt.score(x_train, y_train)
ACC_test_dt = model_dt.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)'])
t.add_row(['Training', ACC_train_lr*100, ACC_train_dt*100])
t.add_row(['Testing', ACC_test_lr*100, ACC_test_dt*100])
print(t)

**Comment on the traning and testing accuracies in the table above?**

# Case #3: Recognition of Handwritten Digits

In this section, we will try to recognize handwritten digits using **decision tree**. We will be using a standard dataset available through the sklearn library called "load_digits".$^{[1][2]}$

[1] https://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction

[2] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html


In the beginning, we will load the dataset as follows

In [0]:
from sklearn.datasets import load_digits
digits = load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. Let us explore the content of the digits dataset

In [0]:
dir(digits)

The digits.data contains the features that will be used to classify the digits samples

In [0]:
print(digits.data)

The digits.images contains the images of the digits samples. They can be viewed using the following code

In [0]:
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[0])

The ground truth of the datset is stored in the digits.taget

In [0]:
print(digits.target)

Let us use Principle Component Analysis to view the digits dataset. We will lot a projection on the 2 first principal axis

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
proj = pca.fit_transform(digits.data)
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap="Paired")
plt.colorbar()

After exploring the content of the digits dataset, we will design a classified using **decision trees**. First, we decide the input feature vector (x) and the ground truth (y) 

In [0]:
x = digits.data
y = digits.target

Then we split the datset into testing and training parts

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))


Here we will train the **decision tree** model

In [0]:
from sklearn import tree
model_dt = tree.DecisionTreeClassifier()
model_dt.fit(x_train,y_train)
ACC_train_dt = model_dt.score(x_train, y_train)
ACC_test_dt = model_dt.score(x_test, y_test)
print(ACC_train_dt)
print(ACC_test_dt)

Let us try to compare **DT** with the other ML techqniues

In [0]:
## logistic regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(x_train, y_train)
ACC_train_lr = model_lr.score(x_train, y_train)
ACC_test_lr = model_lr.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)'])
t.add_row(['Training', ACC_train_lr*100, ACC_train_dt*100])
t.add_row(['Testing', ACC_test_lr*100, ACC_test_dt*100])
print(t)

**Comment on the training and testing accuracies.**

To predict the types of test samples and store it is y_pred run

In [0]:
y_pred = model_dt.predict(x_test)

Sometimes, we wish to know where did the model fail. This can be achieved using what is called the confusion matrix (discussed in more details in logistic regression).

In [0]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)

import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm,annot=True)
plt.xlabel('ground truth')
plt.ylabel('predicted')

Let us try to find out how did the **decision tree** classified a specific digit. In the next code, we will visualize the images, predicted and target values.

In [0]:
import numpy as np
fig = plt.figure(figsize=(12, 24))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

digit_visualize = 5 # the digit that we want to visualize

cnt = 0
i = 0
while (i < 128) and (i < len(y_test)) :
    if y_test[i] == digit_visualize:
       Idx = np.where(np.prod(digits.data == x_test[i,:],axis = -1))
       ax = fig.add_subplot(16, 8, cnt + 1, xticks=[], yticks=[])
       ax.imshow(digits.images[int(Idx[0])], cmap=plt.cm.binary, interpolation='nearest')
       # label the image with the target value
       ax.text(0, 7, str(y_test[i]))
       ax.text(6.5, 7, str(y_pred[i]))
       cnt+=1
    i+=1

Let us try to find out in more details where did the **decision tree** failed. In the next code, we will visualize the images, predicted and target values.

In [0]:
fig = plt.figure(figsize=(12, 24))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

cnt = 0
i = 0
while (i < 128) and (i < len(y_test)) :
    if y_test[i] != y_pred[i]:
       Idx = np.where(np.prod(digits.data == x_test[i,:],axis = -1))
       ax = fig.add_subplot(16, 8, cnt + 1, xticks=[], yticks=[])
       ax.imshow(digits.images[int(Idx[0])], cmap=plt.cm.binary, interpolation='nearest')
       # label the image with the target value
       ax.text(0, 7, str(y_test[i]))
       ax.text(6.5, 7, str(y_pred[i]))
       cnt+=1
    i+=1

# Exercises

**1) Iris identification**: use decision tree classification for the iris dataset stored in the sklearn library. You may import the dataset using "from sklearn.datasets import load_iris".