Check [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for `DecisionTreeRegressor`; check [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) for ``RandomForestRegressor``; check [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor) for `GradientBoostingRegressor`.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

import warnings
warnings.filterwarnings('ignore')  # Suppress all warnings

# 1. Decision Tree Regressor

The dataset ``ram_price.csv`` contains the price information of some historical computer memory: Random Access Memory (RAM). 

- As the relationship between ``price`` and ``date`` is quite skewed (you may visualize them with a simple scatterplot), we take log transformation of ``price`` for better fit with the linear regression model.

- For tree model, it doesn't matter whether we take log transformation of ``price`` or not. 

In [None]:
ram = pd.read_csv("ram_price.csv")

ram['log_price'] = np.log(ram['price'])    # Add a new column named log_price 

display(ram.shape, ram.describe())

**Split the Data into Train/Test**

Arbitrarily split the data based on ``date`` so that we use historical data (``date < 2000``) to forecast RAM prices after the year 2000.

In [None]:
data_train = ram[ram['date'] < 2000]       
data_test = ram[ram['date'] >= 2000]

X_train = data_train[['date']]       # Features should in 2D
y_train = data_train['log_price']

X_test = data_test[['date']]
y_test = data_test['log_price']

display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

**Visualization**

Here we visualize train and test data with different color.

In [None]:
plt.plot(data_train['date'], data_train['log_price'], linewidth = 3, label="Train data")                 # Train
plt.plot(data_test['date'], data_test['log_price'], color='lightblue', linewidth = 3, label="Test data") # Test
plt.xlabel('Year')
plt.ylabel("(Log) Price/Mbyte")
plt.legend();

**Model Training & Evaluation**

Now, let's train ``Linear Regression`` and ``Decision Tree Regressor``, with ``log_price`` as the target variable. Check the train & test ``R2`` accordingly.

- The ``Regression Tree`` model overfit very much!

In [None]:
# Simple linear regression
lr = LinearRegression().fit(X_train, y_train)

# For reproducible result, fix random_state in tree models
tr = DecisionTreeRegressor(max_depth=3, random_state = 1).fit(X_train, y_train)  

print("Linear Regression Train & Test R2: {:.2f}, {:.2f}".format(lr.score(X_train,y_train),lr.score(X_test,y_test)))
print("Regression Tree Train & Test R2: {:.2f}, {:.2f}".format(tr.score(X_train,y_train),tr.score(X_test,y_test)))  # negative test R2!

**Visualize the Tree model**

- Note the ``impurity`` of each node (as indicated by color) is measured by ``squared_error (i.e., MSE)`` in regression trees.

In [None]:
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(18,10))  

plot_tree(decision_tree = tr, 
          feature_names = X_train.columns,    # Or X_train.columns.to_list() if using an old version of scikit-learn   
          filled = True,   
          fontsize = 10)  

fig.suptitle('Regession Tree for RAM Log_Price', fontsize = 15);   

In the root node, (1) what is the predicted price for instances falling into this node? (2) what is the measurement of impurity in this node? 


In [None]:
print("1. Predicted price in the root node: ", y_train.mean())
print("2. MSE measures the impurity: ", sum((y_train - y_train.mean())**2)/len(y_train) )

<font color=red>***Exercise 1: Your Codes Here***</font>  


**Visualize the Actual and Predicted log_price (the model)**

- To display the predicted ``log_price`` for train and test set, you may need to get all features ``ram[['date']]`` for prediction first.
- Visualize the predicted values with a dashline. You may use format string ``"--"`` within `plt.plot()`.

In [None]:
# Make predictions on all data (both train and test)
X_all = ram[['date']]
pred_tr = tr.predict(X_all) 
pred_lr = lr.predict(X_all)

In [None]:
# Plot the actual log_price on y axis  
# Train
# Test

# Plot the predicted log_price on y axis  
# Predicted by tree
# Predicted by lr

plt.xlabel('Year')
plt.ylabel("(Log) Price/Mbyte")
plt.legend();

# 2. Ensemble of Trees  


Let's use the ``boston_house_prices.csv`` data again.  As data scaling doesn't matter for tree models, let's skip the preprocessing steps here. 

In [None]:
# Load the data
df = pd.read_csv('boston_house_prices.csv')

# Separate data into train and test
X = df.drop(columns = 'MEDV')
y = df['MEDV']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)    
display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## 2.1 Build Three Models Individualy 

Let's compare 3 tree models: ``DecisionTreeRegressor``, ``RandomForestRegressor`` and ``GradientBoostingRegressor``, all with default settings.  

- You will see ``DecisionTreeRegressor`` has overfited, and ``GradientBoostingRegressor`` returns the best performance on test data. 

In [None]:
tr = DecisionTreeRegressor(random_state = 0)       # max_depth = None, min_samples_leaf = 1 (default)   
tr.fit(X_train, y_train)

rf = RandomForestRegressor(random_state = 0)       # n_estimators = 100, max_depth = None, min_samples_leaf=1 (default)
rf.fit(X_train, y_train)

gb = GradientBoostingRegressor(random_state = 0)   # n_estimators = 100, learning_rate=0.1, max_depth = 3 (default)
gb.fit(X_train, y_train)

print("tr Train vs Test R2: {:.2f} vs {:.2f}".format(tr.score(X_train, y_train), tr.score(X_test, y_test)))  
print("rf Train vs. Test R2: {:.2f} vs. {:.2f}".format(rf.score(X_train, y_train), rf.score(X_test, y_test)))
print("gb Train vs Test R2: {:.2f} vs. {:.2f}".format(gb.score(X_train, y_train), gb.score(X_test, y_test)))

Checkout all the trees we've trained! You can even use ``plot_tree()`` function to visualize them one by one.

In [None]:
#rf.estimators_   # A list with 100 estimators, use rf.estimators_[0] to select the first tree
#gb.estimators_   # A 2D array with 100 estimators (100,1), use gb.estimators_[0,0] to select the first tree

## 2.2  Model and Parameter Comparison with GridSearchCV

Let's use ``GridSearchCV`` to compare the three models and their parameters.

In [None]:
# Build a pipeline with only one step 

from sklearn.pipeline import Pipeline

pipe = Pipeline([('regressor', DecisionTreeRegressor())])    # Put any model in this pipeline

In [None]:
# Construct the param_grid using a list of dictionaries

# tune max_depth, min_samples_leaf, max_features for tree (3 * 4 *3 = 36)
param1 = {'regressor': [DecisionTreeRegressor()],
          'regressor__max_depth':[3,5,7],
          'regressor__min_samples_leaf':[10,20,30,40],
          'regressor__max_features':[0.5, 0.8, 1]}        # Compare %features randomly selected for spliting

# tune n_estimators, max_samples, max_features, max_depth for random forest (3 * 4 * 3 * 3 = 108)
param2 = {'regressor':[RandomForestRegressor()],
          'regressor__max_depth':[3,5,7], 
          'regressor__n_estimators':[50,100,200],           
          'regressor__max_samples':[0.1, 0.5, 0.8, 1],     # %bootstraped sample/training data size
          'regressor__max_features':[0.5, 0.8, 1]}         # Compare %features randomly selected for splitting

# tune n_estimators, learning_rate, subsample, max_feature for gradient boosting (3 * 3 * 4 * 3 = 108) 
param3 = {'regressor':[GradientBoostingRegressor()],
          'regressor__n_estimators':[50,100,200],           
          'regressor__learning_rate':[0.1, 1, 10],         # Learning rate 
          'regressor__subsample':[0.1, 0.5, 0.8, 1],       # %data randomly sampled for each stage
          'regressor__max_features':[0.5,0.8,1]}           # Compare %features randomly selected for splitting

params = [param1, param2, param3]    # Wrap multiple dictionaries in a list

<font color=red>***Exercise 2: Your Codes Here***</font>  

With the above parameter grid, please fit the GridSearchCV object on the train set and answer below questions:

- What is the best model and associated parameter? 
- That is the cross-validation score for the best model/paramters?
- How many models have been built in order to find the best model and params?( *Hints: check the `params` key*)
- Check the generalization performance of the best model refit on the training data.   

In [None]:
# Fit the GridSearchCV object on train set, with 5-fold cross-validation



In [None]:
# No. of models trained in cross validation (GridSearch)

len(grid.cv_results_['params']) * 5

In [None]:
# Best model's performance on test set

grid.score(X_test, y_test)     

# Alternatively, uncomment the below 
#best = grid.best_estimator_
#best.score(X_test, y_test)