# 1. Decision trees

# Introduction

- Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter. 
- The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes.
- And the decision nodes are where the data is split.

- An example of a decision tree can be:-
- Let’s say you want to predict whether a person is fit given their information like age, eating habit, and physical activity, etc.
- The decision nodes here are questions like 
  - ‘What’s the age?’, 
  - ‘Does he exercise?’,
  - ‘Does he eat a lot of pizzas’?
- And the leaves,
  - which are outcomes like either ‘fit’, or ‘unfit’. 
- In this case this was a binary classification problem (a yes no type problem).

- There are two main types of Decision Trees:

- Classification trees (Yes/No types)
  - What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’.
  - Here the decision variable is Categorical.

- Regression trees (Continuous data types)
  - Here the decision or the outcome variable is Continuous, e.g. a number like 123.

# 1.1 Introduction to Ensemble Methods

- Ensemble modeling is a powerful way to improve the performance of your model.
- It usually pays off to apply ensemble learning over and above various models you might be building.
- Time and again, people have used ensemble models in competitions like Kaggle and benefited from it.
- Ensemble learning is a broad topic and is only confined by your own imagination. 
- Let’s start with an example to understand the basics of Ensemble learning.
- This example will bring out, how we use ensemble model every day without realizing that we are using ensemble modeling.

## Example:
- I want to invest in a company XYZ. 
- I am not sure about its performance though.
- So, I look for advice on whether the stock price will increase more than 6% per annum or not? 
- I decide to approach various experts having diverse domain experience:



- **1.Employee of Company XYZ**: 
  - This person knows the internal functionality of the company and have the insider information about the functionality of the firm.
  - But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. 
  - **In the past, he has been right 70% times.**

- **2. Financial Advisor of Company XYZ:**
  - This person has a broader perspective on how companies strategy will fair of in this competitive environment.
  - However, he lacks a view on how the company’s internal policies are fairing off. 
  - **In the past, he has been right 75% times.**

- **3. Stock Market Trader:**
  - This person has observed the company’s stock price over past 3 years. 
  - He knows the seasonality trends and how the overall market is performing.
  - He also has developed a strong intuition on how stocks might vary over time. 
  - **In the past, he has been right 70% times.**

- **4. Employee of a competitor:**
  - This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought.
  - He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject.
  - **In the past, he has been right  60% of times.**

- **5. Market Research team in same segment:**
  - This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time.
  - Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. 
  - **In the past, they have been right 75% of times.**

- **6. Social Media Expert:**
  - This person can help us understand how has company XYZ has positioned its products in the market. 
  - And how are the sentiment of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing.
  - **In the past, he has been right 65% of times.**

- **Ensemble is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model.**
- In the above example, **the way we combine all the predictions together will be termed as Ensemble Learning.**

# Title: Balance Scale Weight & Distance Database
- **Number of Instances:** 625 (49 balanced, 288 left, 288 right)
- **Number of Attributes:** 4 (numeric) + class name = 5
- **Attribute Information:**
- **Class Name (Target variable):** 3
  - L [balance scale tip to the left]
  - B [balance scale be balanced]
  - R [balance scale tip to the right]
- Left-Weight: 5 (1, 2, 3, 4, 5)
- Left-Distance: 5 (1, 2, 3, 4, 5)
- Right-Weight: 5 (1, 2, 3, 4, 5)
- Right-Distance: 5 (1, 2, 3, 4, 5)

- **Missing Attribute Values:** None
- **Class Distribution:**
- 46.08 percent are L
- 07.84 percent are B
- 46.08 percent are R

- Assumptions we make while using Decision tree :
- At the beginning, we consider the whole training set as the root.
- Attributes are assumed to be categorical for information gain and for gini index, attributes are assumed to be continuous.
- On the basis of attribute values records are distributed recursively.
- We use statistical methods for ordering attributes as root or internal node.

## Pseudocode :
- Find the best attribute and place it on the root node of the tree.
- Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of training dataset should have the same value for an attribute.
- Find leaf nodes in all branches by repeating 1 and 2 on each subset.

### While implementing the decision tree we will go through the following two phases:

- **Building Phase**
  - Preprocess the dataset.
  - Split the dataset from train and test using Python sklearn package.
  - Train the classifier.
- **Operational Phase**
  - Make predictions.
  - Calculate the accuracy.

## sklearn :
- In python, sklearn is a machine learning package which include a lot of ML algorithms.
- Here, we are using some of its modules like train_test_split, DecisionTreeClassifier and accuracy_score.


## NumPy :
- It is a numeric python module which provides fast maths functions for calculations.
- It is used to read data in numpy arrays and for manipulation purpose.


## Pandas :
- Used to read and write different files.
- Data manipulation can be done easily with dataframes.

## Terms used in code :
- Gini index and information gain both of these methods are used to select from the n attributes of the dataset which attribute would be placed at the root node or the internal node.
- **Gini Index** is a metric to measure how often a randomly chosen element would be incorrectly identified.
- It means an attribute with lower gini index should be preferred.
- Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.


- **Entropy**
  - Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy the more the information content.
- **Information Gain**
  - The entropy typically changes when we use a node in a decision tree to partition the training instances into smaller subsets. Information gain is a measure of this change in entropy.
- Sklearn supports “entropy” criteria for Information Gain and if we want to use Information Gain method in sklearn then we have to mention it explicitly.
- **Accuracy score**
  - Accuracy score is used to calculate the accuracy of the trained classifier.
- **Confusion Matrix**
  - Confusion Matrix is used to understand the trained classifier behavior over the test dataset or validate dataset.

In [0]:
# Importing the required packages 
import numpy as np 
import pandas as pd 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

In [0]:
balance_data = pd.read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-'+'databases/balance-scale/balance-scale.data',sep= ',', header = None) 

In [0]:
balance_data.head()

Unnamed: 0,0,1,2,3,4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [0]:
# Printing the dataswet shape 
print ("Dataset Lenght: ", len(balance_data)) 
print ("Dataset Shape: ", balance_data.shape) 

Dataset Lenght:  625
Dataset Shape:  (625, 5)


In [0]:
 # Seperating the target variable
X = balance_data.values[:, 1:5] 
Y = balance_data.values[:, 0] 

In [0]:
# Spliting the dataset into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100) 
      

In [0]:
# Function to perform training with giniIndex. 
clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3, min_samples_leaf=5) 

In [0]:
# Performing training 
clf_gini.fit(X_train, y_train) 

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')

In [0]:
# Decision tree with entropy 
clf_entropy = DecisionTreeClassifier( criterion = "entropy", random_state = 100,max_depth = 3, min_samples_leaf = 5) 

In [0]:
# Performing training 
clf_entropy.fit(X_train, y_train) 

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')

In [0]:
# Predicton on test with giniIndex 
y_pred = clf_gini.predict(X_test) 
print("Predicted values:") 
print(y_pred)

Predicted values:
['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L'
 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L'
 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R'
 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R'
 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R'
 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L'
 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R'
 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R']


In [0]:
# Predicton on test with entropy 
y_pred_2 = clf_entropy.predict(X_test) 
print("Predicted values:") 
print(y_pred_2)

Predicted values:
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']


In [0]:
# GiniIndex
print(confusion_matrix(y_test, y_pred))

[[ 0  6  7]
 [ 0 67 18]
 [ 0 19 71]]


In [0]:
# Entropy
print(confusion_matrix(y_test, y_pred_2))

[[ 0  6  7]
 [ 0 63 22]
 [ 0 20 70]]


In [0]:
# giniIndex
accuracy_score(y_test,y_pred)

0.7340425531914894

In [0]:
# Entropy
accuracy_score(y_test,y_pred_2)

0.7074468085106383

# In class lab WAP : Use Decision Tree Classification Algorithm
    Data Set Name: credit.csv ,Using the dataset, perform 
1. Decision Tree Classification Algorithm (Restricting the depth of the tree to 5)
2. Using entropy and Gini Index Method
3. Perform all the evaluation parameters

# Take home assignment***

    Data Set Name: Heart.csv ,Using the dataset, perform 
1. Decision Tree Classification Algorithm 
2. Using entropy and Gini Index Method
3. Perform all the evaluation parameters