# **Lab 7: Decision Trees**

## Workings Of Decision Tree
* At the root node decision tree selects feature to split the data in two major categories.
* So at the end of root node we have two decision rules and two sub trees
* Data will again be divided in two categories in each sub tree
* This process will continue until every training example is grouped together.
* So at the end of decision tree we end up with leaf node. Which represent the class or a continuous value that we are trying predict

## Criteria To Split The Data
The objective of decision tree is to split the data in such a way that at the end we have different groups of data which has more similarity and less randomness/impurity. In order to achieve this, every split in decision tree must reduce the randomness.
Decision tree uses 'entropy' or 'gini' selection criteria to split the data.
Note: We are going to use sklearn library to test classification. 'entropy' or 'gini' are selection criteria for classifier.
### Entropy
In order to find the best feature which will reduce the randomness after a split, we can compare the randomness before and after the split for every feature. In the end we choose the feature which will provide the highest reduction in randomness. Formally randomness in data is known as 'Entropy' and difference between the 'Entropy' before and after split is known as 'Information Gain'. Since in case of decision tree we may have multiple branches, information gain formula can be written as,

```
    Information Gain= Entropy(Parent Decision Node)–(Average Entropy(Child Nodes))
```

'i' in below Entropy formula represent the target classes 

   ![entropy_formula](https://raw.githubusercontent.com/satishgunjal/images/master/entropy_formula.png)

So in case of 'Entropy', decision tree will split the data using the feature that provides the highest information gain.

### Gini
In case of gini impurity, we pick a random data point in our dataset. Then randomly classify it according to the class distribution in the dataset. So it becomes very important to know the accuracy of this random classification. Gini impurity gives us the probability of incorrect classification. We’ll determine the quality of the split by weighting the impurity of each branch by how many elements it has. Resulting value is called as 'Gini Gain' or 'Gini Index'. This is what’s used to pick the best split in a decision tree. Higher the Gini Gain, better the split

'i' in below Gini formula represent the target classes 

   ![gini_formula](https://raw.githubusercontent.com/satishgunjal/images/master/gini_formula.png)

So in case of 'gini', decision tree will split the data using the feature that provides the highest gini gain.

### So Which Should We Use?
Gini impurity is computationally faster as it doesn’t require calculating logarithmic functions, though in reality neither metric results in a more accurate tree than the other. Here we will use entropy in this lab. 

# Advantages Of Decision Tree
* Simple to understand and to interpret. Trees can be visualized.
* Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
* Able to handle both numerical and categorical data.
* Able to handle multi-output problems.
* Uses a white box model. Results are easy to interpret.
* Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

# Disadvantages Of Decision Tree
* Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
* Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
* Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

# Classification Problem Example
For classification exercise we are going to use sklearns iris plant dataset.
Objective is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals

## Understanding the IRIS dataset
* iris.DESCR > Complete description of dataset
* iris.data > Data to learn. Each training set is 4 digit array of features. Total 150 training sets
* iris.feature_names > Array of all 4 feature ['sepal length (cm)','sepal width cm)','petal length (cm)','petal width (cm)']
* iris.filename > CSV file name
* iris.target > The classification label. For every training set there is one classification label(0,1,2). Here 0 for setosa, 1 for versicolor and 2 for virginica
* iris.target_names > the meaning of the features. It's an array >> ['setosa', 'versicolor', 'virginica']

From above details its clear that X = 'iris.data' and y= 'iris.target'

![Iris_setosa](https://raw.githubusercontent.com/satishgunjal/images/master/iris_species.png)

<sub><sup>Image from [Machine Learning in R for beginners](https://www.datacamp.com/community/tutorials/machine-learning-in-r)</sup></sub>

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import tree
import matplotlib.pyplot as plt
%matplotlib inline
import graphviz

UsageError: Line magic function `%import` not found.


### **Question: List the purpose of each library in comments in the code above**

## Load The Data

In [None]:
iris = datasets.load_iris()
print('Dataset structure= ', dir(iris))

Dataset structure=  ['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


In [None]:
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
df['flower_species'] = df.target.apply(lambda x : iris.target_names[x]) # Each value from 'target' is used as index to get corresponding value from 'target_names' 

print('Unique target values=',df['target'].unique())

df.sample(5)

Unique target values= [0 1 2]


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,flower_species
145,6.7,3.0,5.2,2.3,2,virginica
30,4.8,3.1,1.6,0.2,0,setosa
25,5.0,3.0,1.6,0.2,0,setosa
32,5.2,4.1,1.5,0.1,0,setosa
94,5.6,2.7,4.2,1.3,1,versicolor


## **Question: What is target column and why it is created?**

## **Question: List the mapping of different target labels w.r.t. the flower specie.**  

## **Question: From the above result, write the number of instances in the dataset**

In [None]:
#Print the top five samples of specie 1
##TO DO

In [None]:
#Print the top five samples of specie 2
##TO DO

In [None]:
#Print the top five samples of specie 3
##TO DO

### View summary of dataset

In [None]:
#Print the overall summary of the dataset using pandas info method
##TO DO

In [None]:
##check the distribution of species in the data using panadas value_count method. 
#TO DO

## **Question: Whats is the class distribution of each specie of flower in the dataset. Is the data evenly distributed or not?**

In [None]:
#Check whether the data contains any missing vakues or not, by using the panadas dataframe isnull method. 
##To DO

##**Question: Do you find any misssing values in the data?** 

## Build Machine Learning Model

In [None]:
#Lets create feature matrix X  and y labels
X = df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = df[['target']]

print('X shape=', X.shape)
print('y shape=', y.shape)

### Create Test And Train Dataset
* We will split the dataset, so that we can use one set of data for training the model and one set of data for testing the model
* We will keep 20% of data for testing and 80% of data for training the model

In [None]:
X_train,X_test, y_train, y_test = #TO DO
print('X_train dimension= ', X_train.shape)
print('X_test dimension= ', X_test.shape)
print('y_train dimension= ', y_train.shape)
print('y_train dimension= ', y_test.shape)

X_train dimension=  (120, 4)
X_test dimension=  (30, 4)
y_train dimension=  (120, 1)
y_train dimension=  (30, 1)


Now lets train the model using Decision Tree

In [None]:
"""
To obtain a deterministic behaviour during fitting always set value for 'random_state' attribute
Also note that default value of criteria to split the data is 'gini'
You are reuesyed to use 'entropy' criteria in the lab
"""
cls = tree.DecisionTreeClassifier( #TO DO)
#Fit the model using the fit method. 

### Testing The Model
* For testing we are going to use the test data only
* Question: Predict the species of 10th, 20th and 29th test example from test data

In [None]:
#TO DO

### Model Score
Check the model score using test data

In [None]:
#Check model accuracy using the score method.
#TO DO

## Visualize The Decision Tree
We will use plot_tree() function from sklearn to plot the tree and then export the tree in Graphviz format using the export_graphviz exporter. Results will be saved in iris_decision_tree.pdf file

In [None]:
plt.figure(figsize=(12,8))
tree.plot_tree(cls) 

In [None]:
dot_data = tree.export_graphviz(cls, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris_decision_tree") 

In [None]:
dot_data = tree.export_graphviz(cls, out_file=None, 
                      feature_names=iris.feature_names,  
                      class_names=iris.target_names,  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 