Data Mining [H02C6a] - Spring 2019

# Gradient Boosting

<img src='../img/gbt_pic.png'>
Gradient Boosting is arguably one of the most commonly used machine learning algorithm. 

In this exercise, you will implement a very naive version of gradient boosted trees for solving regression tasks.

## Introduction

**Question 1.1:** What is boosting? What other ensembling tecnhiques are you familiar with?

**Question 1.2:** What is the idea behind gradient boosting? What does <i>'gradient'</i> in its title refer to?

Hint: if you do not know anything about gradient boosting yet, try reading the following article:
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

## Gradient Boosting form scratch

**Question 2.1** Complete the skeleton code below to get the very basic implementation of the gradient tree boosting algorithm.

In [1]:
import numpy as np
from sklearn.tree import *

In [2]:
class myGBR():
    def __init__(self):
        return
    
    def fit(self, X, y, max_num_iter = 10, max_depth = 3, learning_rate = 0.5):
        self.learning_rate = learning_rate
        
        self.init_est = np.mean(y)
        y_pred = self.init_est
        
        self.trees = []
        
        for i in range(max_num_iter):

            residuals = y-y_pred

            dt = DecisionTreeRegressor(max_depth=max_depth)
            dt.fit(X,residuals)
            self.trees.append(dt)
            
            y_pred += self.learning_rate*dt.predict(X)
            
    def predict(self, X):
        y_pred = self.init_est + self.learning_rate*np.sum(np.array([dt.predict(X) for dt in self.trees]), axis=0)
        return y_pred

## Experiments

It is now time to experiment with your learner. We will use the standard California housing prices dataset which can be loaded directly from sklearn.

In [3]:
# Read the data
from sklearn.datasets import *
data = fetch_california_housing()
X = data.data
y = data.target

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\glavi\scikit_learn_data


In [4]:
# Split the data in train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [5]:
def mse (y,y_pred):
    return np.mean((y-y_pred)**2)

Let's see how well our model performs on the test set

In [6]:
gbr = myGBR()
gbr.fit(X_train, y_train, max_num_iter = 50, max_depth = 5, learning_rate = 0.1)

y_pred = gbr.predict(X_test)
mse(y_test, y_pred)

0.2710749414345778

**Question 3.1** How does that compare to other algorithms (e.g., single regression tree, Random Forest, ...)?

In [7]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt.fit(X_train,y_train)

y_pred = dt.predict(X_test)
mse(y_test, y_pred)

0.5240489018608054

In [8]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)
mse(y_test, y_pred)



0.28272101961023033

**Question 3.2** What are the effects of different model parameters (learning rate, number of iterations,...) on the performance?

In [None]:
# Your code here