![banner](../img/cdips_2017_logo.png)

# Gradient Boosting Regression - All targets
With single target GBR under our belts (see [single target notebook](05_A - Gradient Boosted Regression Trees - Single Target.ipynb), 
we can now evaluate performance for all 5 targets by performing GBR
on each target separately.  We will end up with a model for each target.

If you wish to follow the same method of selecting feature importances
as in the [single GBR notebook](05_A - Gradient Boosted Regression Trees - Single Target.ipynb), this can be achieved by setting up an
skl [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). 
A sequence of transformations is easily applied to multiple targets.

For simplicity, this notebook takes the first 100 principal components
to reduce the number of features.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as skl
import numpy as np

import seaborn as sns
sns.set(font_scale=2)

import scripts.load_data as load

%matplotlib inline

In [None]:
import sklearn.preprocessing
import sklearn.decomposition
import sklearn.ensemble
import sklearn.model_selection
import sklearn.feature_selection
import sklearn.metrics

from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

from time import time

In [None]:
X,y=load.load_training_spectra()
pca = skl.decomposition.PCA(n_components=100)
X_transformed = pca.fit_transform(X)


Two functions for different scoring 
metrics are in the cell below.
More information on scoring metrics 
can be found on [scikit-learn's site](http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score).

`scoreGBR` returns $R^2$ as calculated 
from all targets.  $R^2$ is defined 
as 

\begin{equation}
R^2(y,\hat{y}) = 
1-\frac{\sum (y_i - \hat{y}_i)^2}
       {\sum (y_i - \bar{y})^2}
\end{equation}

where $\hat{y}$ is the predicted 
value, $y$ is the true value, 
and $\bar{y} = \frac{1}{n}\sum{y_i}$.  All 
sums are over $i$ between $1$ and $n$.

`MCRMSE` returns the [kaggle scoring metric](https://www.kaggle.com/c/afsis-soil-properties#evaluation): 
mean columnwise root mean squared error, 
the average of the RMSE found for each target. 

\begin{equation}
MCRMSE = \frac{1}{5} \sum_{j=1}^5 \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{ij} - \hat{y}_{ij})^2}
\end{equation}

In [None]:
def trainGBR(GBR_models, X_train, y_train):
    for output_idx, GBR_model in enumerate(GBR_models):
        GBR_model.fit(X_train, y_train.iloc[:, output_idx])

def scoreGBR(GBR_models,X_test,y_test):
    
    score = np.zeros(len(GBR_models))
    y_pred = np.zeros(y_test.shape)
    
    for output_idx,GBR_model in enumerate(GBR_models):
        y = y_test.iloc[:,output_idx]
        y_hat = GBR_model.predict(X_test)
        y_pred[:,output_idx] = y_hat
    
    score = sklearn.metrics.r2_score(y_test, y_pred, multioutput='variance_weighted')
    
    return score

#Kaggle scoring metric: mean columnwise root mean square error
def MCRMSE(GBR_models, X_test, y_test):
    score = np.zeros(len(GBR_models))
    y_pred = np.zeros(y_test.shape)
    
    for output_idx,GBR_model in enumerate(GBR_models):
        y = y_test.iloc[:,output_idx]
        y_hat = GBR_model.predict(X_test)
        y_pred[:,output_idx] = y_hat
        score[output_idx]=np.sqrt(skl.metrics.mean_squared_error(y,y_hat))
        #print(score[output_idx])
   
    meanscore = np.mean(score)
    #print(meanscore)
    return meanscore

The set of hyperparameters below achieve reasonable performance on all 5 targets.
For the ambitious, you may be able to get better models by tuning these
for each individual target, so that each target gets its own optimal set of
hyperparameters.

In [None]:
num_outputs = y.shape[1]
params = {'n_estimators':5000,
          'max_depth':3,
          'min_samples_split':15,
          'min_samples_leaf':3,
          'max_features':0.8,
          'learning_rate':0.01}

GBR_models = [skl.ensemble.GradientBoostingRegressor(**params) for _ in range(num_outputs)]

In [None]:
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X_transformed,y,test_size=0.2)

In [None]:
start=time()
trainGBR(GBR_models, X_train, y_train)
print("GradientBoostingRegressor took %.2f seconds"
      % (time() - start))

In [None]:
scoreGBR(GBR_models,X_test,y_test)

In [None]:
scoreGBR(GBR_models,X_train,y_train)

In [None]:
MCRMSE(GBR_models, X_test, y_test)

In [None]:
MCRMSE(GBR_models, X_train, y_train)