# Decision Tree Classification in Python

## Topics: 
* Decision Tree Classification
* Attribute selection measures
* How to build and optimize Decision Tree Classifier using Python Scikit-learn package.

    * Decision Tree Algorithm
    * How does the Decision Tree algorithm work?
    * Attribute Selection Measures
    * Information Gain
    * Gain Ratio
    * Gini index
    * Optimizing Decision Tree Performance
    * Classifier Building in Scikit-learn
    * Pros and Cons



# Business Problem

* As a marketing manager, you want a set of customers who are most likely to purchase your product. 
    * This is how you can save your marketing budget by finding your audience. 


* As a loan manager, you need to identify risky loan applications to achieve a lower loan default rate. 
    * This process of classifying customers into a group of potential and non-potential customers or safe or risky loan applications is known as a classification problem. 


* Classification is a two-step process: 
    * Learning step
        * In the learning step, the model is developed based on given training data. 
        
    * Prediction step 
        * In the prediction step, the model is used to predict the response for given data.


### Decision Tree is one of the easiest and popular classification algorithms to understand and interpret. It can be utilized for both classification and regression kind of problem.


# Decision Tree Algorithm


A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. 


The topmost node in a decision tree is known as the root node. 


It learns to partition on the basis of the attribute value. 


It partitions the tree in recursively manner call recursive partitioning. 


This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. 


That is why decision trees are easy to understand and interpret.

<img src='img/dt1.JPG'>

Decision Tree is a white box type of ML algorithm.


It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network.


Its training time is faster compared to the neural network algorithm. 


The time complexity of decision trees is a function of the number of records and number of attributes in the given data. 


The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. 


Decision trees can handle high dimensional data with good accuracy.

# How does the Decision Tree algorithm work?
The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures(ASM) to split the records.


2. Make that attribute a decision node and breaks the dataset into smaller subsets.


3. Starts tree building by repeating this process recursively for each child until one of the condition will match:


    * All the tuples belong to the same attribute value.

    * There are no more remaining attributes.

    * There are no more instances.

<img src='img/dt2.JPG'>

# Attribute Selection Measures

Attribute selection measure is a heuristic for selecting the splitting criterion that partition data into the best possible manner. 


It is also known as splitting rules because it helps us to determine breakpoints for tuples on a given node. 


ASM provides a rank to each feature(or attribute) by explaining the given dataset. 


Best score attribute will be selected as a splitting attribute. 


In the case of a continuous-valued attribute, split points for branches also need to define. 


Most popular selection measures are Information Gain, Gain Ratio, and Gini Index.

## Reference Material for more details
http://www.ijoart.org/docs/Construction-of-Decision-Tree--Attribute-Selection-Measures.pdf

## Information Gain
Shannon invented the concept of entropy, which measures the impurity of the input set. 


In physics and mathematics, entropy referred as the randomness or the impurity in the system. 


In information theory, it refers to the impurity in a group of examples. Information gain is the decrease in entropy. 


Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.

<img src='img/dt3.JPG'>

* Info(D) is the average amount of information needed to identify the class label of a tuple in D.
* |Dj|/|D| acts as the weight of the jth partition.
* InfoA(D) is the expected informa-tion required to classify a tuple from D based on the partitioning by A.


The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N().