# Decision Tree Regression:

## Purpose of this notebook

In this notebook, data is generated fictitiously to explain the functionality of decision trees. 
The focus here is on the **visualization of the regression tree with regards to the data points and the tree structure**. The fact that a clear overfitting takes place AND not train-test-splitting is performed is therefore irrelevant.

Feel free to alter some hyperparameter and test new ones to see how it affects the tree structure. Here you'll find the [scikit learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

## Import the required libraries

In [None]:
# Import packages
import numpy as np  
import pandas as pd 
import matplotlib.pyplot as plt 

## The Data

We will create a fictitous dataset about video games including the type of the game, the production cost and the profit for 14 games. 

In [None]:
# Data description: Name, Production Cost, Profit.
dataset = np.array( 
[['Asset Flip', 100, 1000], 
['Text Based', 500, 3000], 
['Visual Novel', 1500, 5000], 
['2D Pixel Art', 3500, 8000], 
['2D Vector Art', 5000, 6500], 
['Strategy', 6000, 7000], 
['First Person Shooter', 8000, 15000], 
['Simulator', 9500, 20000], 
['Racing', 12000, 21000], 
['RPG', 14000, 25000],  
['Sandbox', 15500, 27000], 
['Open-World', 16500, 30000], 
['MMOFPS', 25000, 52000], 
['MMORPG', 30000, 80000] 
]) 

# print the dataset 
print(dataset)

## Define feature

We will select the second column (production cost) as the feature we will feed into our Decision Tree.

In [None]:
# select all rows from production cost column
X = dataset[:, 1:2].astype(int)  
  
# print X 
print(X) 

## Define the target

We will define the elements in the third column (profit) as our target variable.

In [None]:
# select all rows from profit column to represent the labels
y = dataset[:, 2].astype(int)  
  
# print y 
print(y) 

## Modelling

We will train our `DecisionTreeRegressor` on the complete dataset. In a second step we will use the fitted model to predict the profit for a new instance. 

In [None]:
# import the regressor 
from sklearn.tree import DecisionTreeRegressor  
  
# create a regressor object 
dec_tree = DecisionTreeRegressor()  
  
# fit the regressor with X and Y data 
dec_tree.fit(X, y) 

You can try to change some parameters of the decision tree stated here, like `max_depth`, `max_features` etc. to see how it affects the results below.

In [None]:
# predicting a new value 
# test the output by changing values, like 3750 
y_pred = dec_tree.predict(np.array(3750).reshape(1, -1)) 
  
# print the predicted price 
print("Predicted Profit: % d\n"% y_pred) 

## Visualising the result

We can visualise our decision tree and the results in various ways. The first plot shows for which production costs our model predicts which prices.

In [None]:
# use np.arange for creating a range of values from min value of X to max value of X  
# with a difference of 0.01 between two consecutive values 
X_grid = np.arange(min(X), max(X), 0.01) 
  
# use .reshape for reshaping the data into a len(X_grid)*1 array, i.e. to make 
# a column out of the X_grid values 
X_grid = X_grid.reshape((len(X_grid), 1))  
  
# scatter plot for original data 
plt.scatter(X, y, color = 'red') 
  
# plot predicted data 
plt.plot(X_grid, dec_tree.predict(X_grid), color = 'blue')  
  
# specify title and labels
plt.title('Profit to Production Cost (Decision Tree Regression)')  
plt.xlabel('Production Cost') 
plt.ylabel('Profit') 
  
# show the plot 
plt.show() 

In this plot, we have *Profit* on the vertical axis and *Production Cost* on the horizontal axis. The plot jumps up (and occasionally down) in steps, and remains horizontal in between steps. In other words, the plot is a collection of horizontal line segments connected by vertical line segments. Each horizontal segment passes through one data point.

 In the second plot you can see a visual representation of our decision tree from the stump to the leaves. We will use [sklearns `plot_tree()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) function for this.  

In [None]:
# import necessary library  
from sklearn.tree import plot_tree
  
fig = plt.figure(figsize=(25,20))
dectree_plot = plot_tree(dec_tree,feature_names=['Production Cost'], filled=True)

# You can export the graphic with the following command
# plt.savefig('decision_tree')

This is a visual representation of our decision tree from the stump to the leaves: 
- the root node considers the *Production Cost* value, and gives rise to two branches
- one of the child nodes of the root considers production costs less than or equal to 7000.0, and the other one higher
- the first child node continues with binary branching for several depths, while the second one for just one more depth
- the maximum depth of the tree is 6
- the total number of leaf nodes is 12, each of which captures only one data point