## Model Building in XGBoost

This is a great article for tunning XGboost: http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [1]:
import os
windows=False
if (windows):
    mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
    os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
    
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
import time
import csv
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 8
import math

_start_time = time.time()

def tic():
    global _start_time 
    _start_time = time.time()

def tac():
    t_sec = round(time.time() - _start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print('Time passed: {}hour:{}min:{}sec'.format(t_hour,t_min,t_sec))

In [3]:
use_validation=True
scale_numericals=False

In [4]:
#now we load our modified train and test set
tic()
sufix=""
if (use_validation): sufix += "_holdout"
if (scale_numericals): sufix += "_scaled"
print ("sufix: "+sufix)

train = pd.read_csv("./input-data/train_modified"+sufix+".csv",
                    dtype = {'Canal_ID': 'int8',
                            'log_target':  'float64',
                            'Log_Target_mean_lag1': 'float64',
                            'Log_Target_mean_lag2': 'float64',
                            'Log_Target_mean_lag3': 'float64',
                            'Log_Target_mean_lag4': 'float64',
                            'Log_Target_mean_lag5': 'float64',
                            'Lags_sum': 'float64',
                            'pairs_mean':  'float64',
                            'brand': 'int8',
                            'prodtype_cluster': 'int32',
                            'Qty_Ruta_SAK_Bin': 'int32',
                            'ZipCode': 'uint32',
                            'week_ct': 'int8',
                            'NombreCliente': 'int32',
                            'Producto_ID_clust_ID':'int32',
                            'Ruta_SAK_clust_ID':'int32',
                            'Agencia_ID_clust_ID':'int32',
                            'Cliente_ID_clust_ID':'int32'},
                   )
                  
val = pd.read_csv("./input-data/val_modified"+sufix+".csv",
                    dtype = {'Canal_ID': 'int8',
                            'log_target':  'float64',
                            'Log_Target_mean_lag1': 'float64',
                            'Log_Target_mean_lag2': 'float64',
                            'Log_Target_mean_lag3': 'float64',
                            'Log_Target_mean_lag4': 'float64',
                            'Log_Target_mean_lag5': 'float64',
                            'Lags_sum': 'float64',
                            'pairs_mean':  'float64',
                            'brand': 'int8',
                            'prodtype_cluster': 'int32',
                            'Qty_Ruta_SAK_Bin': 'int32',
                            'ZipCode': 'uint32',
                            'week_ct': 'int8',
                            'NombreCliente': 'int32',
                            'Producto_ID_clust_ID':'int32',
                            'Ruta_SAK_clust_ID':'int32',
                            'Agencia_ID_clust_ID':'int32',
                            'Cliente_ID_clust_ID':'int32'},
                   ) 
    
test = pd.read_csv("./input-data/test_modified"+sufix+".csv",
                    dtype = {'id': 'uint32',
                            'Canal_ID': 'int8',
                            'Log_Target_mean_lag1': 'float64',
                            'Log_Target_mean_lag2': 'float64',
                            'Log_Target_mean_lag3': 'float64',
                            'Log_Target_mean_lag4': 'float64',
                            'Log_Target_mean_lag5': 'float64',
                            'Lags_sum': 'float64',
                            'pairs_mean':  'float64',
                            'brand': 'int8',
                            'prodtype_cluster': 'int32',
                            'Qty_Ruta_SAK_Bin': 'int32',
                            'ZipCode': 'uint32',
                            'week_ct': 'int8',
                            'NombreCliente': 'int32',
                            'Producto_ID_clust_ID':'int32',
                            'Ruta_SAK_clust_ID':'int32',
                            'Agencia_ID_clust_ID':'int32',
                            'Cliente_ID_clust_ID':'int32'},
                      )
tac()

sufix: _holdout
Time passed: 0hour:0min:51sec


In [5]:
#Define target and ID columns:
target = 'log_target'
IDcol = 'id'

## Train multiple models per client cluster

Ok, so we said on our prior step (Models wiht scikit-learn) that we need to deal with the data set high variance. Let's do this first:

Looking at the plot below, created on the clustering-by-demand on the feature engineering notebook, we see that some client clusters behave very differntly from others. So this explain why our model is failing on predicting accurately for all of them.
We are going then to create a wrapper function to create as many models as Client Clusters by demand are (Cliente_ID_clust_ID). The scores should be bettter individually, and the concatenation of all 400 models should yield a better overall RSMLE than our baseline 0.47.

![Image of Variables vs Hypothesis](./input-data/h2o-clustByDem_Cliente_ID_400.png)

In [6]:
import xgboost as xgb
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV

def modelfit(alg, ctrain, cval, ctest, predictors, target, IDcol):
    
    #Fit the algorithm on the data
    watchlist = [(cval[predictors], cval[target])]
    alg.fit(ctrain[predictors], ctrain[target], eval_set=watchlist, eval_metric='rmse', early_stopping_rounds=50, verbose=False)
    
    alg.evals_result()

    #Predict training set:
    ctrain["predictions"] = alg.predict(ctrain[predictors])
    ctrain["predictions"] = np.maximum(ctrain["predictions"], 0)

    
    #Predict validation (holdout) set:
    cval["predictions"] = alg.predict(cval[predictors])
    cval["predictions"] = np.maximum(cval["predictions"], 0)# we make all negative numbers = 0 since there cannot be a negative demand

    
    #Predict on testing data: we need to revert it back to target by applying expm1
    ctest[target] = alg.predict(ctest[predictors])
    ctest[target] = np.maximum(ctest[target], 0) # we make all negative numbers = 0 since there cannot be a negative demand
    
    print ('RMSLE VAL: ', np.sqrt(metrics.mean_squared_error(cval[target].values, cval["predictions"].values)))
    
    return ctrain[[target,"predictions"]], cval[[target,"predictions"]], ctest[[IDcol,target]]
    

In [7]:
def clusters_fit (alg, dtrain, dval, dtest, predictors, target, IDcol, filepath):
    
    train_predictions = pd.DataFrame(index=[target,"predictions"])
    val_predictions = pd.DataFrame(index=[target,"predictions"])
    test_predictions = pd.DataFrame(index=[IDcol,target])
    
    clusters_list = train.Cliente_ID_clust_ID.drop_duplicates().get_values()
    
    for cluster in clusters_list:
        
        #we get the cluster train,val, test data
        ctrain = dtrain.loc[dtrain["Cliente_ID_clust_ID"] == cluster]
        cval   = dval.loc[dval["Cliente_ID_clust_ID"] == cluster]
        ctest  = dtest.loc[dtest["Cliente_ID_clust_ID"] == cluster]
        
        #we train the cluster
        ctrain, cval, ctest = modelfit(model, ctrain, cval, ctest, predictors, target, IDcol)
        
        #concatenate each cluster result
        train_predictions = pd.concat([train_predictions,ctrain],ignore_index=True)
        val_predictions = pd.concat([val_predictions,cval],ignore_index=True)
        test_predictions = pd.concat([test_predictions,ctest],ignore_index=True)
        
    
    #Print model report:
    print ("\nModel Report")
    print ('RMSLE TRAIN: ', np.sqrt(metrics.mean_squared_error(dtrain[target].values, train_predictions[target])))
    print ('RMSLE VAL: ', np.sqrt(metrics.mean_squared_error(dval[target].values, val_predictions[target])))
    
    #Predict on testing data: we need to revert it back to target by applying expm1
    test_predictions[target] = np.expm1(test_predictions[target])
    test_predictions[target] = np.maximum(test_predictions[target], 0) # we make all negative numbers = 0 since there cannot be a negative demand
  
    
    print ('NUM ROWS PREDICTED: ', test_predictions.shape[0] )
    print ('NUM NEGATIVES PREDICTED: ', test_predictions[target][test_predictions[target] < 0].count())
    print ('MIN TARGET PREDICTED: ', test_predictions[target].min())
    print ('MEAN TARGET PREDICTED: ', test_predictions[target].mean())
    print ('MAX TARGET PREDICTED: ', test_predictions[target].max())
    
    #Export submission file:
    submission = test_predictions.copy()
    submission[IDcol] = submission[IDcol].astype(int)
    submission.rename(columns={target: 'Demanda_uni_equil'}, inplace=True)
    submission.to_csv("./Submissions/"+filename, index=False)
        

### Alg10 - XGB-1

Looking at the behavior of the random forest sckit-learn model (our best so far with score of 0.47), let's borrow some parameters from it, and see if we have improved

In [8]:
predictors = ['Canal_ID', 'Log_Target_mean_lag1', 'Log_Target_mean_lag2', 'Log_Target_mean_lag3', 'Log_Target_mean_lag4', 
              'Log_Target_mean_lag5','Lags_sum', 'brand', 'prodtype_cluster', 'Qty_Ruta_SAK_Bin', 'ZipCode', 'Producto_ID_clust_ID']


model = xgb.XGBRegressor(n_estimators = 50, objective="reg:linear", learning_rate= 0.1, max_depth=10,
                         subsample=0.85,colsample_bytree=0.7)

tic()
clusters_fit(model, train, val, test, predictors, target, IDcol, 'alg10.csv')
tac()


RMSLE VAL:  0.504476809774
RMSLE VAL:  0.411533214317
RMSLE VAL:  0.43973570085
RMSLE VAL:  0.411024145431
RMSLE VAL:  0.538882143992
RMSLE VAL:  0.457453228549
RMSLE VAL:  0.454718086982
RMSLE VAL:  0.524101808507
RMSLE VAL:  0.480459504838
RMSLE VAL:  0.559216601161
RMSLE VAL:  0.463009332224
RMSLE VAL:  0.539271713866
RMSLE VAL:  0.450717040962
RMSLE VAL:  0.533560039429
RMSLE VAL:  0.657766196204
RMSLE VAL:  0.427734832854
RMSLE VAL:  0.47998707625
RMSLE VAL:  0.534321323108
RMSLE VAL:  0.451141760502
RMSLE VAL:  0.510919086283
RMSLE VAL:  0.452260572799
RMSLE VAL:  0.710465530112
RMSLE VAL:  0.474813833156
RMSLE VAL:  0.467305759094
RMSLE VAL:  0.342479978786
RMSLE VAL:  0.481005957126
RMSLE VAL:  0.591958551938
RMSLE VAL:  0.485771621916
RMSLE VAL:  0.533540321496
RMSLE VAL:  0.464081499086
RMSLE VAL:  0.474319668149
RMSLE VAL:  0.455628824345
RMSLE VAL:  0.462235740914
RMSLE VAL:  0.717501410431
RMSLE VAL:  0.787615656434
RMSLE VAL:  0.560629010909
RMSLE VAL:  0.546398959646
RMS

ValueError: Found arrays with inconsistent numbers of samples: [10406868 10406870]

Great improvement from past algos. The most important thing that I see here is that the Feature importance map
is very different from the H2O models. LB Scores between XGB and H2O are similar, so this is a great case for ensembling!

Let's try more estimators