# All State Claims Severity
![All State](https://www.allstate.com/resources/Allstate/images/financial/global/allstate-logo-header-170x45.png)

How severe is an insurance claim?

When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

Each row in this dataset represents an insurance claim. You must predict the value for the 'loss' column. Variables prefaced with 'cat' are categorical, while those prefaced with 'cont' are continuous.

In [20]:
# Import relevant libraries
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt

In [26]:
# Read the test and train claims data csv into a pandas dataframe
claims_train = pd.read_csv("data/train.csv")
claims_test = pd.read_csv("data/test.csv")
claims_train.dtypes[0:5]

cat1    object
cat2    object
cat3    object
cat4    object
dtype: object

In [27]:
# Convert categorical features from object to str for later use as categorical features in model selection
for i in range(1,117):
    col_name = 'cat' + str(i)
    claims_train[col_name] = claims_train[col_name].astype('str')
    claims_test[col_name] = claims_test[col_name].astype('str')

In [28]:
# Print out the size of the dataset and a sample of ten rows
print("Row count: ", claims_train.shape[0])
print("Column count: ", claims_train.shape[1])
claims_train.head(5)

Row count:  188318
Column count:  132


Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


In [45]:
from sklearn.preprocessing import LabelEncoder
X = pd.concat([claims_train,claims_test])
le_dict = {}
for i in range(1,117):
    col_name = 'cat'+str(i)
    le_dict[col_name] = LabelEncoder()
    X.loc[:,col_name] = le_dict[col_name].fit_transform(X[col_name])

In [46]:
X.head()

Unnamed: 0,cat1,cat10,cat100,cat101,cat102,cat103,cat104,cat105,cat106,cat107,...,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,id,loss
0,0,0,1,6,0,0,8,4,6,9,...,0.245921,0.187583,0.789639,0.310061,0.718367,0.33506,0.3026,0.67135,1,2213.18
1,0,1,11,5,0,0,4,4,8,10,...,0.737068,0.592681,0.614134,0.885834,0.438917,0.436585,0.60087,0.35127,2,1283.6
2,0,1,11,14,0,1,4,5,7,5,...,0.358319,0.484196,0.236924,0.397069,0.289648,0.315545,0.2732,0.26076,5,3005.09
3,1,0,8,3,0,0,4,4,8,10,...,0.555782,0.527991,0.373816,0.422268,0.440945,0.391128,0.31796,0.32128,10,939.85
4,0,1,5,9,0,0,3,4,10,6,...,0.15999,0.527991,0.473202,0.704268,0.178193,0.247408,0.24564,0.22089,11,2763.85


## XGBoost Model

In [49]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 4

#columns_to_use = []

# Join the features from train and test together
X = pd.concat([claims_train,claims_test])

# XGBoost doesn't (yet) handle categorical features automatically, so we need to change
# them to columns of integer values.
# See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
le_dict = {}
for i in range(1,117):
    col_name = 'cat'+str(i)
    le_dict[col_name] = LabelEncoder()
    X.loc[:,col_name] = le_dict[col_name].fit_transform(X[col_name])

# Prepare the inputs for the model
y = 'loss'
IDcol = 'id'

def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=10, early_stopping_rounds=50,dump=False,model_name='xgb'):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[y].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='rmse', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain[y],eval_metric='rmse')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
        
    #Print model report:
    print("\nModel Report\n", str(alg))
    print("MSE : %.4g" % metrics.mean_squared_error(dtrain[y].values, dtrain_predictions))
    print("RMSE : %.4g" % np.sqrt(abs(metrics.mean_squared_error(dtrain[y].values, dtrain_predictions))))
    print("R^2 : %f" % metrics.r2_score(dtrain[y].values, dtrain_predictions))
                    
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=True)
    feat_imp.plot(kind='barh', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    
    if dump:
        print(alg.booster().dump_model("Model-%s.txt" % model_name))
        

#Choose all predictors except target & IDcols
predictors = [x for x in X.columns if x not in [y, IDcol]]
xgb1 = XGBRegressor(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'reg:linear',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, X, predictors)

ImportError: No module named 'xgboost'