# DECISION TREE

- is another method we can use for finding relationship between target and one or more predictors
- decision trees can be used for both categorical and continous targets (so both classification and regression tasks -  today we'll be focusing on regression trees)
- the main idea is to create a tree of decisions that best partitions the data 
- creating a tree involves deciding on which features to split the data and what conditions to use for splitting, as well as with knowing when to stop



![title](pictures/decision_tree.png)






In [1]:
import pandas as pd
import numpy as np
import json
import graphviz
import matplotlib.pyplot as plt
from sklearn import tree

pd.set_option("display.max_rows",6)

%matplotlib inline

In [2]:
df_data = pd.read_csv('varsom_ml_preproc.csv', index_col=0)



# RANDOM FORESTS

- are using Bagging (bootstrap aggregating) algorithm

- **bagging** is an ensemble learning where we build each model using the same algorithm, but we train each learner on different set of data


# GRADIENT BOOSTED TREES

- are using Boosting (Ada Boost) algorithm

- **boosting** is a variation of ensemble trees that strives to improve the learners by focusing on areas where the system is not performing well.



### We can start by creating one decision tree of depth 3 using all features 

In [None]:
from sklearn.tree import DecisionTreeRegressor

dec_tree = DecisionTreeRegressor(random_state=222, max_depth = 3)

dec_tree.fit(X_train, y_train) # we're using the same data as in last linear model

predictions_dt = dec_tree.predict(X_test)

In [None]:
# in order to visualize the tree we need package graphviz and it needs to be installed 

import sys
!conda install --yes --prefix {sys.prefix} python-graphviz

In [None]:
# Visualize the tree

from sklearn import tree
import graphviz 

dot_data = tree.export_graphviz(dec_tree, out_file=None, 
                         feature_names=boston_data.drop('MEDV', axis=1).columns,   
                         filled=True, rounded=True,  
                         special_characters=True)  

graph = graphviz.Source(dot_data)  
graph 

Again, we can see that first we're splitting on RM and LSTAT, meaning that those are the most important variables.



We need to evaluate our model:

In [None]:
print("RSS for decision tree model is {0:.2f}".format(RSS(y_test, predictions_dt)))

In [None]:
print('Decision tree R^2: %.4f' % dec_tree.score(X_test, y_test)) 

So, the linear model is performing better than decision tree model. 

But we have selected tree of depth 3 - could we get a better model by selecting different depth? 

In [None]:
def RSS_new(f, y, X):
    return sum((y - f.predict(X))**2)

depths = range(1, 10)

tree_models = [DecisionTreeRegressor(random_state=222, max_depth=d).fit(X_train, y_train) for d in depths]
tree_RSS = [RSS_new(f, y_test, X_test) for f in tree_models]


plt.plot(depths, tree_RSS, color = 'red')
plt.xlabel('Tree depth')
plt.ylabel('RSS')

In [None]:
# so let's create a tree with depth = 6

dec_tree = DecisionTreeRegressor(random_state=222, max_depth = 6)

dec_tree.fit(X_train, y_train) # we're using the same data as in last linear model

predictions_dt = dec_tree.predict(X_test)

In [None]:
# Visualize the tree

from sklearn import tree
import graphviz 

dot_data = tree.export_graphviz(dec_tree, out_file=None, 
                         feature_names=boston_data.drop('MEDV', axis=1).columns,   
                         filled=True, rounded=True,  
                         special_characters=True)  

graph = graphviz.Source(dot_data)  
graph 

In [None]:
print("RSS for decision tree model is {0:.2f}".format(RSS(y_test, predictions_dt)))

In [None]:
print('Decision tree R^2: %.4f' % dec_tree.score(X_test, y_test)) 

Now we see slight improvement in both RSS and $R^{2}$ - but does such a small improvement justifies usage of much more complex model? 


Maybe we'l get better results with **random forests** and/or **gradient boosted trees**. 

### Random forest 

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state = 422, max_depth = 6)
forest_reg.fit(X_train, y_train)

predictions_rf = forest_reg.predict(X_test)

In [None]:
print("RSS for random forest model is {0:.2f}".format(RSS(y_test, predictions_rf)))

In [None]:
print('Random forest R^2: %.4f' % forest_reg.score(X_test, y_test)) 

### Gradient boosting

In [None]:
# create a gradient boosting regressor with random state 222 and max depth 6 

# Print RSS and R^2 for your model


In [None]:
# solution

from sklearn.ensemble import GradientBoostingRegressor

grad_boost = GradientBoostingRegressor(random_state = 222, max_depth = 6)

grad_boost.fit(X_train, y_train)

predictions_gb = grad_boost.predict(X_test)

print("RSS for gradient boosted tree model is {0:.2f}".format(RSS(y_test, predictions_gb)))
print()
print('Gradient boosted tree R^2: %.4f' % grad_boost.score(X_test, y_test)) 

Again, very small difference between random forests and boosted trees.

We can use random forest model.

Last thing we can check is imprtance of variables - if some of the feature are not as useful as the other in explaining the variability in our target variable, we cn exclude them in order to simplify our model. 

In [None]:
feature_labels = np.array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'RM', 'AGE', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])
importance = forest_reg.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

In [None]:
# excluding the variables where impotance is less than 1%

X = boston_data[['CRIM', 'RM', 'TAX', 'PTRATIO', 'LSTAT', 'AGE', 'B', 'INDUS']]
y = boston_data["MEDV"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 222)

In [None]:
grad_boost = GradientBoostingRegressor(random_state = 222, max_depth = 6)

grad_boost.fit(X_train, y_train)

predictions_gb = grad_boost.predict(X_test)

In [None]:
print("RSS for gradient boosted tree model is {0:.2f}".format(RSS(y_test, predictions_gb)))

In [None]:
print('Gradient boosted tree R^2: %.4f' % grad_boost.score(X_test, y_test)) 