You will learn about decision trees and move on to random forests, which are a collection of multiple decision trees. A collection of multiple models is called an **ensemble**.

With high interpretability and an intuitive algorithm, decision trees mimic the human decision-making process and excel in dealing with categorical data. Unlike other algorithms such as logistic regression or SVMs, decision trees do not find a linear relationship between the independent and the target variable. Rather, they can be used to **model highly nonlinear data**.

With decision trees, you can easily explain all the factors leading to a particular decision/prediction. Hence, they are easily understood by business people.

Random forests, being collections of multiple trees, are one of the most successful and popular models in machine learning.



### Decision Trees

As the name goes, a decision tree uses a tree-like model to make predictions. It resembles an upside-down tree. It is also very similar to how you make decisions in real life: you ask a series of questions to arrive at a decision.

A decision tree splits the data into multiple sets. Then, each of these sets is further split into subsets to arrive at a decision. 

Decision Trees naturally represent the way we make decisions. Think of a machine learning model as a decisionmaking engine that takes a decision on any given input object (data point). Imagine a doctor making a decision (the diagnosis) on whether a patient is suffering from a particular condition given the patient data, an insurance company making a decision on whether claims on a particular insurance policy needs to be paid out or not given the policy and the claim data, a company deciding on which role an applicant seeking a position in the company is eligible to apply for, based on the past track record and other details of the applicant, etc.. Solutions to each of these can be thought of as machine learning models trying to mimic the human decision making.

Refer to Figure below. ” The Heart dataset consists of data about various cardiac parameters along with an indicator column that says whether the person has a heart disease or not.

![image-3.png](attachment:image-3.png)

For the heart dataset the leaf nodes (bottom) are labelled 1 (no heart disease) or 2 (has heart disease). The decision tree model predicts that if a person has thal of type 3 (normal), pain.type other than {1,2,3} and the number of blood vessels
flouroscopy.coloured more than 0.5, then the person has heart disease. The example given above represents the path left->right->right starting from the top (the root). In general in a decision tree:

* The leaf nodes represent the final decisions
* Each intermediate node represents a simple test on one of the attributes.
* The path from the root to a leaf corresponds to a conjunction of tests at each of the nodes on the path. We say a test data point ’follows a path’ on a decision tree if it passes all the tests on the  path in the decision tree. The branches out of an intermediate node are exclusive — the test data point can follow exactly one branch out of every intermediate node it encounters.
* The prediction by a decision tree on a data point is the one corresponding to the leaf at which the path followed by the data point ends.
* There could be multiple leaves representing the same class (decision). For example in the heart disease  example, this simply means that a person does not have heart disease if: (thal=3 and pain.type in {1,2,3}) or (thal=3 and pain.type not in {1,2,3} and flouroscopy.coloured<0.5) or (thal!=3 and flouroscopy.coloured<0.5 and exercise.angina=0 and age>=51). Note that when the thal is normal, by and large the heart is normal. 
* So in some sense the thal type is a major indicator of heart disease (this is apparent from the length of the leader lines from the root node). The last condition (1 leaf on the right branch) may seem a little counter-intuitive. An abnormal thal (right branch) is probably expected at age beyond 51 and so is not considered heart diseased, whereas at an age below 51 would be considered heart disease. In general the decision by tree is a value y represented by some of the leaves if the OR of the conditions corresponding to the paths from the root to each of the leaves with value y, is true for the given data point. 

We generally assume, at least for explanation, the decision trees we consider are binary — every intermediate node has exactly two children. This is not a restriction since any more general tree can be converted into an equivalent binary tree. In practice however splits on attributes that have too many distinct values (for example a continuous valued attribute) are usually implemented as binary splits and splits on attributes with not many distinct values are implemented as multi-way splits. 

![image-2.png](attachment:image-2.png)



### Regression with Decision Trees

There are cases where you cannot directly apply linear regression to solve a regression problem. Linear regression will fit only one model to the entire data set; whereas you may want to divide the data set into multiple subsets and apply linear regression to each set separately.

In regression problems, a decision tree splits the data into multiple subsets. The difference between decision tree classification and decision tree regression is that in regression, each leaf represents a linear regression model, as opposed to a class label.


### Homogeneity Measures

* Gini Index
* Information Gain / Entropy-based
* Splitting by R-squared

### Tree Truncation

Decision trees have a strong tendency to overfit the data. So practical uses of the decision tree must necessarily incorporate some ’regularization’ measures to ensure the decision tree built does not become more complex than is necessary and starts to overfit. There are broadly two ways of regularization on decision trees:

* Truncate the decision tree during the training (growing) process preventing it from degenerating into one with one leaf for every data point in the training dataset. One or more stopping criteria are used to decide if the decision tree needs to be grown further. 
* Let the tree grow to any complexity. However add a post-processing step in which we prune the tree in a bottom-up fashion starting from the leaves. It is more common to use pruning strategies to avoid overfitting in practical implementations. 

Some popular stopping criteria and pruning strategies in the following subsections.

#### Decision Tree Stopping Criteria (Truncation)

There are several ways to truncate decision trees before they start to overfit. 

* Minimum Size of the Partition for a Split: Stop partitioning further when the current partition is small enough. 
* Minimum Change in Homogeneity Measure: Do not partition further when even the best split causes an insignificant change in the purity measure (difference between the current purity and the purity of the partitions created by the split).
* Limit on Tree Depth: If the current node is farther away from the root than a threshold, then stop partitioning further. 
* Minimum Size of the Partition at a Leaf: If any of partitions from a split has fewer than this threshold minimum, then do not consider the split. Notice the subtle difference between this condition and the minimum size required for a split. 
* Maxmimum number of leaves in the Tree: If the current number of the bottom-most nodes in the tree exceeds this limit then stop partitioning.

#### Decision Tree (Post)-Pruning

One popular approach to pruning is to use a validation set — a set of labelled data points, typically kept aside from the original training dataset. This method called reduced-error pruning, considers every one of the test (non-leaf ) nodes for pruning. Pruning a node means removing the entire subtree below the node, making it a leaf, and assigning the majority class (or the average of the values in case it is regression) among the training data points that pass through that node. A node in the tree is pruned only if the decision tree obtained after the pruning has an accuracy that is no worse on the validation dataset than the tree prior to pruning. This ensures that parts of the tree that were added due to accidental irregularities in the data are removed, as these irregularities are not likely to repeat.

Though there are various ways to truncate or prune trees, the DecisionTreeClassifier function in sklearn provides the
following hyperparameters which you can control:

* **criterion (Gini/IG or entropy):** It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value “gini”.

* **max_features:** It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.

    1. If an integer is inputted then it considers that value as max features at each split.
    2. If float value is taken then it shows the percentage of features at each split.
    3. If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
    4. If “log2” is taken then max_features= log2(n_features).
    5. If None, then max_features=n_features. By default, it takes “None” value.
    
    
* **max_depth:** The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.

* **min_samples_split:** This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.

* **min_samples_leaf:** The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
