### <span style="color:black"><u>**Decision Tree Regression**</u><a name="DecisionTree"></a></span>


**Intuition**

* [Decision Tree](https://gdcoder.com/decision-tree-regressor-explained-in-depth/) regression is a type of predictive model that utilises recursive splitting to make decisions based on particular binary conditions, using mean squared error (sometimes mean absolute error, though this is less common) as a metric find the optimal splits for particular nodes
* Because of this, decision tree regression can handle data that is non-linear in nature
* The decision tree has a root node (at the top of the tree), internal nodes (arrows pointing too them and away from them) and leaf nodes (aka terminal nodes, which only have arrows pointing to them)
* Features don't need to be scaled unlike previously seen algorithms such as Ridge, Lasso and ElasticNet

**The Algorithm**

----

Credit: An Introduction to Statistical Learning: With Applications in R
* The genreal idea of decision tree regression is that we want to divide the feature space, that is, the set of all possible values for $X_1, X_2, ...X_p$ into a total of $J$ distinct and non-overlapping regions $R_1, R_2, ..., R_J$
* If multiple observations fall into a single region $R_j$ that represents a leaf node, the model prediction will simply be the mean of all the training observations in $R_j$
* These regions that divide the predictor space are comprised of high dimensional rectangles, or boxes, to allow for easy model interpretation
* Ultimately, the goal behind decision tree regression is to find the boxes $R_1, R_2, ..., R_J$ that minimises the residual sum of squares, given by 

$$\sum_{j=1}^{J}\sum_{i \in R_j}(y_i - \hat{y}_{R_j})^2$$

in which the $\hat{y}_{R_j}$ term is the mean of all the observations that fall into the $j^{th}$ box

* To begin the process of recursive binary splitting, sweeping through all possible predictors $X_1, X_2, ..., X_p$ in our predictor space, for each $X_j$ we aim to find a vertical line 
cutoff point $x = s$ such that segementing the predictor space in to two regions $\{X|X_j < s\}$ and $\{X|X_j \geq s\}$ results in the largest possible reduction in the residual sum of squares. These all become candidates for the root node 
* The predictor and cutoff combination $X_j$ and $x=s$ that results in the lowest residual sum of squares is what then gets chosen as the root node
* To determine the remaining nodes, we keep on prioritising the features and $x=s$ cutoff value thresholds that minimise mean squared error, given previous conditions from the nodes above it are still true
* Once we have reached a leaf node, the output is just the average of all the data points at that particular leaf node
* Once we have built our tree, we can run a new observation down our tree get an output value, formally represented as

$$f(X) = \sum_{m=1}^M c_m 1_{(X \in R_m)}$$

whereby we segement our feature space into regions $R_1, ...R_M$ and make a prediction based of what leaf node our observation went to 

---
#### <span style="color:black"><b><u>Hyperparameters for Decision Tree Models</u></b></span>

* To find the ideal tree that aims to have low bias and low variance we can use cross validation techniques such as grid search and randomised search built into the Scikit-Learn library 
* Here we can try many hyperparameter combos and choose the one that had the best cross validated $R^2$ score. We can go for other metrics too.
* Some hyperparameters have been listed below:

**min_samples_split**
* The `min_samples_split` hyperparameter tells our decision tree model how many samples we need (minimum) if we want to make another split to any of the internal nodes of the tree
* For instance, if we set this parameter to 10, but a split results in 3 observations being assigned to a particular value, it then becomes a leaf node.

**min_samples_leaf**
* In tree models, the leaf node (terminal node) is the node where there are no arrows pointing away from it. No further splitting can be done
* In other words, it is the base of the tree that makes the final decision about an observation in our data
* So this parameter tells the model how many observations in a particular leaf node we would like to have as a minimum. 
* For example, setting this parameter to 5 would mean that we can't have a situation where only 2 observations are part of a particular leaf node
* Though it must be noted that sometimes the rules set on `min_samples_split` and `min_samples_leaf` can't both be adhered to. 
* For instance if we set the `min_samples_split` parameter to 10, we are telling our model that we need at least 10 samples to potentially do another split
* But if we have say 11 observations at an internal node and then set `min_samples_leaf` to 8, a (7 and 4) or a (6 and 5) split would violate the rules we placed on `min_samples_leaf`.
* So while we had enough samples to make the split, there are insuffient amounts of observations at a leaf node
* In these situations, the split will not be performed

**max_depth**
* The `max_depth` parameter is a hyperparameter that allows us to determine how deep we want our tree to be
* If we create a very large and complex tree that very deep, we are likely to get excellent results on the training set
* However, the model will have a hard time generalising on out of sample data, and will likely perform poorly on the testing set as it has learned too much of the noise that exists in the training data, without adequately capturing the signal component of the model
* This results in a model that is overfit to the training data, which is ultimately one of the drawbacks of decision trees
* Though it is important to also consider the case where the tree isn't deep enough and isn't given the opportunity to recognise the relationships in the data 
* This results in the model being underfit, that is, poor training accuracy and poor testing accuracy where both high bias and variance is present
* If we want to display our tree and interpret it, we would opt for a lower `max_depth`

**max_features**
* This represents how many predictor variables we would like to take into account when making decisions on particular nodes (e.g. selecting a root node)
* We can use the square root of the number of features, the log in base 2 or just the total number of features. Fractions can be done too
* Randomly selecting features (ie not using all of them at once) forms a large part of random forest models, which I'll go into after first fitting a decision tree model to the data

**criterion**
* The `criterion` hyperparameter in decision tree regression models is the metric we want our regressor to use when determining our nodes
* Popular options include mean squared error, where nodes are created based off what minimises take the average of the sum of squared residuals 
* Mean absolute error is also sometimes used, and is where we make decisions based off what minimises the average of the sum of absolute value residual distances




In [None]:
# Create our parameter grid
# This will then help us find the best hyperparameters for our model based off the cross validated accuracy score
param_grid = dict()
param_grid['min_samples_split'] = np.arange(10, 40, 1)
param_grid['min_samples_leaf'] = np.arange(3, 30, 1)
param_grid['max_depth'] = np.arange(3, 12, 1)
param_grid['criterion'] = ["mse", "mae"]

dt_regressor = DecisionTreeRegressor()

# Try 200 different combinations and use 10 fold cross validation on each, hence 2000 different tree models are built
dt_random = RandomizedSearchCV(estimator = dt_regressor, 
                               param_distributions = param_grid, 
                               n_iter = 200, 
                               cv = 10, 
                               verbose = 2, 
                               random_state = 3, 
                               n_jobs = -1)

dt_random.fit(X_train, y_train)

In [None]:
# Best parameters
print(f"Best Parameters: {dt_random.best_params_}")

# Best R^2 score
print(f"Best Score: {dt_random.best_score_}")

So after running 200 fits, the parameter combination that resulted in the highest cross validated accuracy score (which in our case refers to $R^2$) was:

* min_samples_split = 10
* min_samples_leaf = 3
* max_depth = 9
* criterion = mse

In [None]:
# Use these parameters for our decision tree model 
dt_final = DecisionTreeRegressor(min_samples_split = 10, 
                                 min_samples_leaf = 3, 
                                 max_depth = 9, 
                                 criterion = 'mse')
dt_final.fit(X_train, y_train)

In [None]:
# Make a Prediction

# Year: 2060
# Country Infant Mortality: 0.03
# Health Expenditure %: 76
# Employment to population ration (15+): 78.9
# Developed Country
# Mean Schooling of 12 years
# 99.99% have access to electricity
# GDP Per Capita: 100,028

print("Predicted life expectancy:")
dt_final.predict([[2060, 0.03, 76, 78.9, 1, 12, 99.99, np.log(100_028)]])[0]

How well did it perform on test data that has not been seen by the model?

In [None]:
print(f"R^2 on the testing set = {dt_final.score(X_test, y_test)}")

* So we have found a model that has performed really well on data it had never seen before
* Essentially, our predictor variables placed in our model are able to eliminate 95.6% of the sum of squares around the mean of the Life expectancy variable in the testing set. 
* So far this has outperformed all of our linear models in terms of $R^2$

-----
**Conclusion**

- What is nice about this is the fact that for decision tree regression models, we do not need need to adhere to the Gauss-Markov assumptions like the ones we had in OLS regression as we are now dealing with non-parametric ways of modelling particular phenomena
- However, decision trees are notorious to overfitting the training dataset (especially if max_depth is set to a very high number) which means that we can potentially run the risk of creating a model with low bias on the training set and high variance on the testing set. Though in our case, the coefficient of determination on the testing set was high, so our tree was able to predict out of sample data well.
- Though perhaps the biggest disadvantage of the decision trees is the fact that [they can't extrapolate](http://freerangestats.info/blog/2016/12/10/extrapolation) the same way that linear models or a neural network can. 
- For example, if the final split hypothetically was whether 'ln(GDP_cap)' >  9, an obvservation with ln(GDP_cap) = 9.01 and ln(GDP_cap) = 12 will be put into the exact same leaf, despite the differences in GDP/Capita being $154570.27. In this case, this large increase in GDP/Capita might actually have a big impact on life expectancy but the model has not recognosed that
- I'll save this model for future use, though bagging an boosting techniques will likely be a better option moving forward

In [None]:
# Make a prediction

# Year: 2047
# Country Infant Mortality: 6.48
# Health Expenditure %: 56
# Employment to population ration (15+): 48.71
# Developed Country
# Mean Schooling of 12 years
# 99.34% have access to electricity
# GDP Per Capita: 62,124

print(f"Predicted life expectancy = {dt_final.predict([[2047, 6.48, 56, 48.71, 1, 12, 99.34, 62_124]])[0]}")

# Save our model
joblib.dump(dt_final, './models/decisiontree.joblib')