Simple XGBoost using transformed features based on [notebook][1]
------------------------------------------------

No additional features are added, none are dropped.

 - continuous features are transformed
 - categorical features are factorized
 - hyper parameters are tuned using GridSearchCV

  [1]: https://www.kaggle.com/snmateen/allstate-claims-severity/simple-eda-feature-transformations

In [None]:
# import relevant modules for a start
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import math
%matplotlib inline

In [None]:
# disable warnings!
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load training dataset
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
y = np.log(train['loss'])
train.drop('loss', axis = 1, inplace= True)
train['dftype'] = 'train'
test['dftype'] = 'test'
full = train.append(test)

Feature Tranformation:
----------------------

Based on the EDA on the continous features, let's transform the features, refer eda notebook for more details

In [None]:
feature_transformation = {  'cont1': 'boxcox'
                          , 'cont2': 'np.tan'
                          , 'cont3': 'none'
                          , 'cont4': 'boxcox'
                          , 'cont5': 'boxcox'
                          , 'cont6': 'boxcox'
                          , 'cont7': 'boxcox'
                          , 'cont8': 'boxcox'
                          , 'cont9': 'boxcox'
                          , 'cont10': 'boxcox'
                          , 'cont11': 'boxcox'
                          , 'cont12': 'boxcox'
                          , 'cont13': 'abs_mean_shift'
                          , 'cont14': 'abs_mean_shift'
                         }

In [None]:
# function - absolute of mean shifted data (which will be later used in function transformer)
def abs_mean_shift(data):
    return np.abs(data - np.mean(data))

In [None]:
# import modules specific to preprocessing
from scipy import stats
from sklearn.preprocessing import FunctionTransformer
from sklearn.cross_validation import train_test_split
from sklearn import metrics

feature transformation based on the best suited transformation picked in the EDA phase

In [None]:
# loop thru the dictionary (created above) and transform the features.
transformed_continous_features = []
for k, v in feature_transformation.items():
    print('processing feature: {0}, with transformation: {1}'.format(k,v))
    transformed_feature = 't_' + k
    transformed_continous_features.append(transformed_feature)
    if v == 'boxcox':
        xt, _ = stats.boxcox(full[k]+1)
    elif v == 'none':
        xt = full[k]
    else:
        xt = FunctionTransformer(eval(v)).transform(full[k]).reshape(full.shape[0],1)
    full[transformed_feature] = xt

factorizing the categorical variables
-------------------------------------

There are 116 categorical variables which needs to be factorized.

In [None]:
factored_categorical_features = []
print("Factorizing feature: ")
for column_name in full.select_dtypes(include = ['object']).columns:
    print(column_name)
    factored_feature = 'f_' + column_name
    factored_categorical_features.append(factored_feature)
    full[factored_feature] = pd.factorize(full[column_name], sort = True)[0].reshape(full.shape[0],1)

In [None]:
final_features = transformed_continous_features
final_features.extend(factored_categorical_features)

Loading the preprocessing and training algorithm module

In [None]:
from sklearn import preprocessing 
import xgboost as xgb

Train and Test 75 / 25 split

In [None]:
%%time
X_full = pd.DataFrame(preprocessing.StandardScaler().fit_transform(full[final_features])
                      , columns = final_features
                      , index = full.index)

In [None]:
X_train = X_full[full.dftype == 'train']
X_test = X_full[full.dftype == 'test']

In [None]:
%%time
# custom evalution function which takes into account of log / exp transformation
def meaerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'error', metrics.mean_absolute_error(np.exp(preds),np.exp(labels))

# creating train and test Data Matrix for the xgboost training
dtrain = xgb.DMatrix(data = X_train, label = y)
dtest = xgb.DMatrix(data = X_test)
watchlist = [(dtrain, 'train')]

In [None]:
%%time
# These are the parameters picked in the grid search hyper parameter tuning / optimization
params = {"eta" : "0.01" 
          , "silent": "1" 
          , "booster": "gbtree"
          , "max_depth" : "10" 
          , "min_child_weight" : "9" 
          , "subsample" : "1" 
          , "colsample_bytree" : "0.2"}

best = xgb.train(params
              , dtrain
              , 4500
              , watchlist
              , early_stopping_rounds = 25
              , feval=meaerror
              , verbose_eval=False
             )
print(best.attributes())

Plotting feature importance
---------------------------

In [None]:
plt.figure(figsize=(10,30))
xgb.plot_importance(best, ax = plt.subplot(111))

In [None]:
dtest = xgb.DMatrix(data = X_test)
prediction = np.exp(best.predict(dtest))

submission = pd.DataFrame()
submission['loss'] = prediction
submission['id'] = test.id
submission.to_csv('xgboost_tuned_hp.csv', index=False)