#### Decision Tree

A **decision tree**, as the term suggests, uses a tree-like model to make predictions. It resembles an upside-down tree and uses a similar process that you do to make decisions in real life, i.e., by asking a series of questions to arrive at a decision.

![image.png](attachment:image.png)

#### Building a descision tree

Constructing a decision tree involves the following steps:

1. Recursive binary splitting/partitioning the data into smaller subsets
2. Selecting the best rule from a variable/ attribute for the split
3. Applying the split based on the rules obtained from the attributes
4. Repeating the process for the subsets obtained
5. Continuing the process until the stopping criterion is reached
7. Assigning the majority class/average value as the prediction

![image-2.png](attachment:image-2.png)

The decision tree building process is a top-down approach. The top-down approach refers to the process of starting from the top with the whole data and gradually splitting the data into smaller subsets. 

The reason we call the process greedy is because it does not take into account what will happen in the next two or three steps. The entire structure of the tree changes with small variations in the input data. This, in turn, changes the way you split and the final decisions altogether. This means that the process is not holistic in nature, as it only aims to gain an immediate result that is derived after splitting the data at a particular node based on a certain rule of the attribute.

### Tree Models Over Linear Models

1. There are certain cases where you cannot directly apply linear regression to solve a regression problem. Linear regression fits only one model to the entire data set; however, you may want to divide the data set into multiple subsets and apply decision tree algorithm in such cases to handle non-linearity.

2. Predictions made by a decision tree are easily interpretable.

3. A decision tree is versatile in nature. It does not assume anything specific about the nature of the attributes in a data set. It can seamlessly handle all kinds of data such as numeric, categorical, strings, Boolean, etc.

4. A decision tree is scale-invariant. It does not require normalisation, as it only has to compare the values within an attribute, and it handles multicollinearity better.

5. Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.

6. They are highly efficient and fast algorithms.

7. They can identify complex relationships and work well in certain cases where you cannot fit a single linear relationship between the target and feature variables. This is where regression with decision trees comes into the picture.


In regression problems, a decision tree splits the data into multiple subsets. The difference between decision tree classification and decision tree regression is that in regression, each leaf represents the average of all the values as the prediction as opposed to a class label in classification trees. For classification problems, the prediction is assigned to a leaf node using majority voting but for regression, it is done by taking the average value.

The disadvantages of decision trees include:

1. Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

2. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.

3. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

4. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

5. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

**Splitting and Homogeneity**

We basically arrive at these questions; Given many attributes, how do you decide which rules obtained from the attributes to choose in order to split the data set? From a single feature, you can get many rules and you may use any of these to make the split. Do you randomly select these and split the data set or should there be a selection criterion for choosing one over the other? What are you trying to achieve with the split?

If a partition contains data points with identical labels (for example, label 1), then you can classify the entire partition as that particular label (label 1). However, this is an oversimplified example. In real-world data sets, you will almost never have completely homogenous data sets (or even nodes) after splitting the data. Hence, it is important that you try to split the nodes such that the resulting nodes are as homogenous as possible. One important thing to remember is that homogeneity here is always referred to response (target) variable's homogeneity.

For example, let's suppose we consider the same heart disease example in which you wanted to classify whether a person has a heart disease or not. If one of the nodes is labelled ‘Blood Pressure’, try to split it with a rule such that all the data points that pass the rule have one label and those that do not pass the rule have a different label. Thus, you need to ensure that the response variable's homogeneity in the resultant splits is as high as possible.

**For classification purposes, a data set is completely homogeneous if it contains only a single class label. For regression purposes, a data set is completely homogeneous if its variance is as small as possible. You will understand regression trees better in the upcoming segments.**

A tree can be split based on different rules of an attribute and these attributes can be categorical or continuous in nature. If an attribute is nominal categorical, then there are  $ 2^{k-1} - 1 $ possible splits for this attribute, where k is the number of classes. In this case, each possible subset of categories is examined to determine the best split.

If an attribute is ordinal categorical or continuous in nature with n different values, there are **n - 1** different possible splits for it. Each value of the attribute is sorted from the smallest to the largest and candidate splits based on the individual values is examined to determine the best split point which maximizes the homogeneity at a node.

There are various other techniques like calculating percentiles and midpoints of the sorted values for handling continuous features in different algorithms and this process is known as discretization.


Various methods, such as the classification error, Gini index and entropy, can be used to quantify homogeneity

**Impurity Measures**

![image-3.png](attachment:image-3.png)

In practice, classification error does not perform well. So, we generally prefer using either the Gini index or entropy over it.

***Gini index is the degree of a randomly chosen datapoint being classified incorrectly.***
Gini index of 0 indicates that all the data points belong to a single class. Gini index of 0.5 indicates that the data points are equally distributed among the different classes.
**Hence, the higher the homogeneity, the lower the Gini index.**


Entropy

***Entropy quantifies the degree of disorder in the given data***, its value varies from 0 to 1. Entropy and the Gini index are similar numerically. If a data set is completely homogenous, then the entropy of such a data set will be 0, i.e., there is no disorder in the data. If a data set contains an equal distribution of both the classes, then the entropy of that data set will be 1, i.e., there is complete disorder in the data. Hence, like the Gini index, the higher the homogeneity, the lower the entropy.

#### How do you identify the attribute that results in the best split?

For finding the attribute to best split, we need to calculate difference in impurity, if the split is meaningful, the post split entropy will be lower than parent.

The change in impurity or the purity gain is given by the difference of impurity post-split from impurity pre-split, i.e.,
 $ {\delta}Impurity = Impurity(pre-split) - Impurity(post-split) $
 
The post-split impurity is calculated by finding the weighted average of two child nodes. The split that results in maximum gain is chosen as the best split

the information gain is calculated by:

$ Gain = D - D_{A} $

where D is the entropy of the parent set (data before splitting),
$ D_{A} $ is the entropy of the partitions obtained after splitting on attribute  A . Note that reduction in entropy implies information gain.

We always try to maximise information gain by achieving maximum homogeneity and this is possible only when the value of entropy decreases from the parent set after splitting.

In case of a classification problem, you always try to maximise purity gain or reduce the impurity at a node after every split and this process is repeated till you reach the leaf node for the final prediction. 

**Disadvantages**

The following is a summary of the disadvantages of decision trees:

1. They tend to overfit the data. If allowed to grow with no check on its complexity, a decision tree will keep splitting until it has correctly classified (or rather, mugged up) all the data points in the training set.
2. They tend to be quite unstable, which is an implication of overfitting. A few changes in the data can considerably change a tree.

#### Decision Tree Regression

![image-4.png](attachment:image-4.png)

A higher value of MSE means that the data values are dispersed widely around mean, and a lower value of MSE means that the data values are dispersed closely around mean and this is usually the preferred case while building a regression tree.

 

The regression tree building process can be summarised as follows:

1. Calculate the MSE of the target variable.
2. Split the data set based on different rules obtained from the attributes and calculate the MSE for each of these nodes.
3. The resulting MSE is subtracted from the MSE before the split. This result is called the MSE reduction.
4. The attribute with the largest MSE reduction is chosen for the decision node.
5. The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until you get significantly low MSE and the node becomes as homogeneous as possible.
6. Finally, when no further splitting is required, assign this as the leaf node and calculate the average as the final prediction when the number of instances is more than one at a leaf node.

So, you need to split the data such that the weighted MSE of the partitions obtained after splitting is lower than that obtained with the original or parent data set. In other words, the fit of the model should be as ‘good’ as possible after splitting. As you can see, the process is surprisingly similar to what you did for classification using trees.

#### Ensemble

An ensemble refers to a group of things viewed as a whole rather than individually. In an ensemble, a collection of models is used to make predictions, rather than individual models. Arguably, the most popular in the family of ensemble models is the random forest, which is an ensemble made by the combination of a large number of decision trees.

In principle, ensembles can be made by combining all types of models. An ensemble can have a logistic regression, a neural network, and a few decision trees working in unison

**Important Aspect for Ensembleing:** Diversity and Acceptability

1. Diversity ensures that the models serve complementary purposes, which means that the individual models make predictions independent of each other - No intented correlation b/w different models output.

2. Acceptability implies that each model is at least better than a random model. This is a pretty lenient criterion for each model to be accepted into the ensemble, i.e., it has to be at least better than a random guesser.

There are a number of ways in which you can bring diversity among your models you plan to include in your ensemble.

1. Use different subsets of training data
2. Use different training hyperparameters
3. Use different types of classifiers - different models like LR,DT, Neural in unison
4. Use different features

**Some Popular Ensembles**

Some of the popular approaches to ensembling, such as **voting, stacking and blending, boosting and bagging**

Voting combines the output of different algorithms by taking a vote. In the case of a classification model, if the majority of the classifiers predict a particular class, then the output of the model would be the same class. In the case of a regression problem, the model output is the average of all the predictions made by the individual models. In this way, every classifier/regressor has an equal say in the final prediction.

 ![image.png](attachment:image.png)

Another approach to carry out manual ensembling is to pass the outputs of the individual models to a level-2 classifier/regressor as derived meta features, which will decide what weights should be given to each of the model outputs in the final prediction. In this way, the outputs of the individual models are combined with different weightages in the final prediction. This is the high-level approach behind stacking and blending.

![image-2.png](attachment:image-2.png)

Boosting is one of the most popular approaches to ensembling. It can be used with any technique and combines the weak learners into strong learners by creating sequential models such that the final model has higher accuracy than the individual models. You saw the example shown below to see intuitively how adaptive boosting works.

![image-3.png](attachment:image-3.png)

Bagging creates different training subsets from the sample training data with replacement, and an algorithm with the same set of hyperparameters is built on these different subsets of data. In this way, the same algorithm with a similar set of hyperparameters is exposed to different parts of data, resulting in a slight difference between the individual models. The predictions of these individual models are combined by taking the average of all the values for regression or a majority vote for a classification problem.

 

Bagging works well with high variance algorithms and is easy to parallelise. By high variance, we mean algorithms which change a lot with slight changes in the data as a result of which these algorithms very easily overfit if not controlled. If you recall, decision trees are very prone to overfitting if we don't tune the hyperparameters well. Hence, bagging works very well for high-variance models like decision trees.

 

However,  it has got some disadvantages as well. In this approach, you cannot really see the individual trees one by one and figure out what is going on behind the ensemble as it is a combination of n number of trees working together. This leads to a loss of interpretability. Also, it does not work well when any of the features dominate because of which all the trees look similar and hence the property of diversity in ensembles is lost. Sometimes bagging can be computationally expensive and is applied depending on the case.

 

So far you have seen handling classification problems with ensembles. But remember that ensembles work well for regression problems as well. The working is almost similar except for the aggregation of models the average of the predictions is taken for regression instead of majority voting for classification ensembles.

#### Introduction to Random Forests

Bagging chooses random samples of observations from a data set. Each of these samples is then used to train each tree in the forest. However, keep in mind that bagging is only a sampling technique and is not specific to random forests.

Bootstrapping refers to creating bootstrap samples from a given data set. A bootstrap sample is created by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 40–70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.

A random forest selects **a random sample of data points (bootstrap sample) to build each tree and a random sample of features while splitting a node**. Randomly selecting features ensures that each tree is diverse and that some prominent features are dominating in all the trees making them somewhat similar.

**Advantages of Blackbox Models Over Tree and Linear Models**

1. Diversity: Diversity arises because each tree is created with a subset of the attributes/features/variables, i.e., not all the attributes are considered while making each tree; the choice of the attributes is random. This ensures that the trees are independent of each other
 
2. Stability: Stability arises because the answers given by a large number of trees average out. A random forest has a lower model variance than an ordinary individual tree.

3. Immunity to the curse of dimensionality: Since each tree does not consider all the features, the feature space (the number of features that a model has to consider) reduces. This makes an algorithm immune to the curse of dimensionality. Also, a large feature space causes computational and complexity issues.
 
4. Parallelization: You need a number of trees to make a forest. Since two trees are independently built on different data and attributes, they can be built separately. This implies that you can make full use of your multi-core CPU to build random forests. Suppose there are 4 cores and 100 trees to be built; each core can build 25 trees to make a forest.

5. Testing/training data and the OOB (out-of-bag) error: You should always avoid violating the fundamental tenet of learning: 'Not testing a model on what it has been trained on’. While building individual trees, you can choose a random subset of the observations to train them. If you have 10,000 observations, each tree may only be built from 7,000 (70%) randomly chosen observations. OOB is the mean prediction error on each training sample xᵢ, using only the trees that do not have xᵢ in their bootstrap sample used for building the model. This is very similar to a cross-validation (CV) error. In a CV error, you can measure the performance on the subset of data that the model has not seen before.

In fact, it has been proven that using an OOB estimate is as accurate as using a test data set of a size equal to the training set.

Thus, the OOB error omits the need for set-aside test data (though you can still work with test data like you have been doing, at the cost of eating into the training data).

**Feature Importance in Random Forests**

The importance of features in random forests, sometimes called ‘Gini importance’ or ‘mean decrease impurity’, is defined as the total decrease in node impurity (it is weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all the trees of the ensemble

For each variable, the sum of the Gini decreases across every tree of the forest and is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average.