# Allstate Claims Severity Prediction Project

## Introduction

For this project, we take part in an expired Kaggle competition that was sponsored by Allstate. As an insurance provider, Allstate wanted to get better insight into how severe a claim of an insured person would be based upon a set of features. I'll let Kaggle explain the rest:

"Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience."

Our model will be scored with the Mean Absolute Error metric. Let's begin!

### Preprocessing

In [14]:
# As always, we import our necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [15]:
# First things first: we load in our datasets
train = pd.read_csv('allstate_train.csv')
test = pd.read_csv('allstate_test.csv')

In [16]:
# Let's get an idea of what our datasets look like
train

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.335060,0.30260,0.67135,0.83510,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.60
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.27320,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.321570,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.21230,0.204687,0.202213,0.246011,0.432606,2763.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188313,587620,A,B,A,A,A,A,A,A,B,...,0.242437,0.289949,0.24564,0.30859,0.32935,0.223038,0.220003,0.333292,0.208216,1198.62
188314,587624,A,A,A,A,A,B,A,A,A,...,0.334270,0.382000,0.63475,0.40455,0.47779,0.307628,0.301921,0.318646,0.305872,1108.34
188315,587630,A,B,A,A,A,A,A,B,B,...,0.345883,0.370534,0.24564,0.45808,0.47779,0.445614,0.443374,0.339244,0.503888,5762.64
188316,587632,A,B,A,A,A,A,A,A,B,...,0.704364,0.562866,0.34987,0.44767,0.53881,0.863052,0.852865,0.654753,0.721707,1562.87


According to the competition rules, we are trying to predict the loss column. The "cat" columns are categorical variables while the "cont" variables are continuous. This is pretty obvious. As always, the first thing we should do is understand how dirty our dataset is currently.

In [17]:
# We have 132 columns. Ample room to have null values. Let's check the Null value situation for both datasets
train.isnull().sum().value_counts()

0    132
dtype: int64

In [18]:
# And for our test set?
test.isnull().sum().value_counts()

0    131
dtype: int64

Nice, no null values. Note that the difference in column numbers is because the test set excludes the "loss" column. In any event, we can proceed to feature engineering.

### Categorical Variables

I always like to understand how to handle the categorical variables of any dataset. Luckily, because we don't have any null values, we will not have to perform median/mean imputation. 

In [19]:
# First, let's understand if there are any other categorical variables outside of A and B
# We first isolate the categorical columns
categorical_columns = [c for c in train.columns if train[c].dtype == 'O']

In [20]:
# We then iterate over these columns and return their respective value counts to see if there is a 'C' or 'D' variable
col_number = 0
for c in categorical_columns:
    col_number += 1
    freq = train[c].value_counts()
    print(f"The distribution of the cat{col_number} column is\n{freq}\n")

The distribution of the cat1 column is
A    141550
B     46768
Name: cat1, dtype: int64

The distribution of the cat2 column is
A    106721
B     81597
Name: cat2, dtype: int64

The distribution of the cat3 column is
A    177993
B     10325
Name: cat3, dtype: int64

The distribution of the cat4 column is
A    128395
B     59923
Name: cat4, dtype: int64

The distribution of the cat5 column is
A    123737
B     64581
Name: cat5, dtype: int64

The distribution of the cat6 column is
A    131693
B     56625
Name: cat6, dtype: int64

The distribution of the cat7 column is
A    183744
B      4574
Name: cat7, dtype: int64

The distribution of the cat8 column is
A    177274
B     11044
Name: cat8, dtype: int64

The distribution of the cat9 column is
A    113122
B     75196
Name: cat9, dtype: int64

The distribution of the cat10 column is
A    160213
B     28105
Name: cat10, dtype: int64

The distribution of the cat11 column is
A    168186
B     20132
Name: cat11, dtype: int64

The distribution 

Notice in the later columns, post cat73, we start to see more variables than solely A & B. In fact, in some of the columns, namely the last column, we have 326 different possible scenarios. This leaves us with an interesting dilemma. If we were to perform dummy encoding, we would had 326 new columns to the already 200+ columns that we just added.

We will perform dummy variable creation, however, we might want to come back and think of a better way to handle them.

In [21]:
# We'll create a function to handle this
def categorical_conversion(dataset):
    for c in categorical_columns:  # We iterate over our categorical columns
        cats = pd.get_dummies(dataset[c], prefix=f'{c}')  # We then create the dummy variable columns 
        dataset = pd.concat([dataset, cats], axis=1)  # We then append these columns to the end of the dataframe
        dataset.drop(c, axis=1, inplace=True)  # We then drop the original column
    return dataset

In [22]:
# We then apply our function to our dataframe
new_train = categorical_conversion(train)

In [23]:
# Let's take a look to see if our conversion worked
new_train.head()

Unnamed: 0,id,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,...,cat116_P,cat116_Q,cat116_R,cat116_S,cat116_T,cat116_U,cat116_V,cat116_W,cat116_X,cat116_Y
0,1,0.7263,0.245921,0.187583,0.789639,0.310061,0.718367,0.33506,0.3026,0.67135,...,0,0,0,0,0,0,0,0,0,0
1,2,0.330514,0.737068,0.592681,0.614134,0.885834,0.438917,0.436585,0.60087,0.35127,...,0,0,0,0,0,0,0,0,0,0
2,5,0.261841,0.358319,0.484196,0.236924,0.397069,0.289648,0.315545,0.2732,0.26076,...,0,0,0,0,0,0,0,0,0,0
3,10,0.321594,0.555782,0.527991,0.373816,0.422268,0.440945,0.391128,0.31796,0.32128,...,0,0,0,0,0,0,0,0,0,0
4,11,0.273204,0.15999,0.527991,0.473202,0.704268,0.178193,0.247408,0.24564,0.22089,...,0,0,0,0,0,0,0,0,0,0


### Feature Matrix and Target Column Construction for Training and Test Sets

In [24]:
# We now proceed to defining our feature maxtrix and our target column. 
X = new_train.drop(["id", "loss"], axis=1)
y = new_train["loss"]

However, one thing to note is that some of the answers captured in our training matrix is not shared with our test matrix. As a result, the final training feature matrix and test feature matrix will have different dimensions and column names. Again, this is due to the fact that someone may have not chose cat116_U in our test matrix, but chose it in our training matrix.

As such, we shoulud make sure that the features we select are shared between the test and training set to remain consistent throughout our training and prediction pipelines.

In [25]:
# We proceed by applying the same transformations to our test set that we just applied to our training set.
# We first isolate the id column of our test data. This will help us when we go to make our submission
test_ids = test["id"]

# We then put our test dataframe through our categorical feature conversion process
new_test = categorical_conversion(test)

# We then drop the id column
test_X = new_test.drop("id", axis=1)

In [26]:
# We then create a shared feature matrix so that our training and test matrices have the same dimensions
shared_columns = [column for column in test_X.columns if column in X.columns]

In [27]:
# We then filter both dataframes to contain the same columns
test_X = test_X[shared_columns]
X = X[shared_columns]

In [28]:
# We now proceed to split our training set and create a cross validation set to help us assess if we are overfitting our underfitting
threshold = int(len(X) * .7)
train_X, cv_X = X[:threshold], X[threshold:]
train_y, cv_y = y[:threshold], y[threshold:]

In [29]:
# Let's ensure that our matrices have the same dimenstions
train_X.head()

Unnamed: 0,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,...,cat116_MU,cat116_MV,cat116_MW,cat116_O,cat116_Q,cat116_R,cat116_S,cat116_T,cat116_U,cat116_Y
0,0.7263,0.245921,0.187583,0.789639,0.310061,0.718367,0.33506,0.3026,0.67135,0.8351,...,0,0,0,0,0,0,0,0,0,0
1,0.330514,0.737068,0.592681,0.614134,0.885834,0.438917,0.436585,0.60087,0.35127,0.43919,...,0,0,0,0,0,0,0,0,0,0
2,0.261841,0.358319,0.484196,0.236924,0.397069,0.289648,0.315545,0.2732,0.26076,0.32446,...,0,0,0,0,0,0,0,0,0,0
3,0.321594,0.555782,0.527991,0.373816,0.422268,0.440945,0.391128,0.31796,0.32128,0.44467,...,0,0,0,0,0,0,0,0,0,0
4,0.273204,0.15999,0.527991,0.473202,0.704268,0.178193,0.247408,0.24564,0.22089,0.2123,...,0,0,0,0,0,0,0,0,0,0


In [30]:
test_X.head()

Unnamed: 0,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,...,cat116_MU,cat116_MV,cat116_MW,cat116_O,cat116_Q,cat116_R,cat116_S,cat116_T,cat116_U,cat116_Y
0,0.321594,0.299102,0.246911,0.402922,0.281143,0.466591,0.317681,0.61229,0.34365,0.38016,...,0,0,0,0,0,0,0,0,0,0
1,0.634734,0.620805,0.65431,0.946616,0.836443,0.482425,0.44376,0.7133,0.5189,0.60401,...,0,0,0,0,0,0,0,0,0,0
2,0.290813,0.737068,0.711159,0.412789,0.718531,0.212308,0.325779,0.29758,0.34365,0.30529,...,0,0,0,0,0,0,0,0,0,0
3,0.268622,0.681761,0.592681,0.354893,0.397069,0.36993,0.342355,0.40028,0.33237,0.3148,...,0,0,0,0,0,0,0,0,0,0
4,0.553846,0.299102,0.26357,0.696873,0.302678,0.398862,0.391833,0.23688,0.43731,0.50556,...,0,0,0,0,0,0,0,0,0,0


Both matrices have 1,079 columns. Our transformation worked! We now proceed to actually building, training, and applying our models.

### Model 1: Linear Regression

In [31]:
# We first create our model and then 
lr = LinearRegression()
lr.fit(train_X, train_y)

LinearRegression()

In [32]:
# We then generate predictions on the training set and cv set to help us determine if our model is over/underfitting
# We will use a function to help us accomplish this task for each of our four models
def cross_validate_score(model):
    train_preds = model.predict(train_X)
    cv_preds = model.predict(cv_X)
    train_mae = mean_absolute_error(train_y, train_preds)
    cv_mae = mean_absolute_error(cv_y, cv_preds)
    print(f"The model in this section had a training MAE of {round(train_mae, 5)}.")
    print(f"The model in this section had a cross-validation MAE of {round(cv_mae, 5)}.")

In [33]:
cross_validate_score(lr)

The model in this section had a training MAE of 1287.55729.
The model in this section had a cross-validation MAE of 120574716.41838.


Conclusion: Linear Regression is severely overfitting. Let's modify this model slightly by incorporating regularization parameters. 

### Model 2: Ridge Regression

In [34]:
# We first import the ridge regression class
from sklearn.linear_model import Ridge

In [35]:
rr = Ridge(alpha=0.7)
rr.fit(train_X, train_y)

Ridge(alpha=0.7)

In [36]:
cross_validate_score(rr)

The model in this section had a training MAE of 1289.71248.
The model in this section had a cross-validation MAE of 1303.04345.


### Model 3: XGBoost

In [52]:
xgb = XGBRegressor(objective="reg:squarederror", n_estimators=100, max_depth=4, reg_alpha=10)
xgb.fit(train_X, train_y)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=10, reg_lambda=1, ...)

In [53]:
cross_validate_score(xgb)

The model in this section had a training MAE of 1172.29201.
The model in this section had a cross-validation MAE of 1206.93484.


Conclusion: A way more balanced model than our regular linear regression model, and possibly our best model yet.

### Model 4: Random Forest

For our random forest model, I wanted to implement a GridSearchCV pipeline to obtain the appropriate hyperparameters. However, due to the large scale nature of the dataset from Allstate, my computer, while by no means a pea-shooter, simply could not process the large number of models efficiently. As a result, I had to settle for a regular RandomForestRegressor model with n_estimators equal to 50 to be able to efficiently get a result for my model (it took about 10 minutes to train).

It is well documented that RF models do not scale well to large datasets. Now I know why. Due to my lack of patience, I won't be performing GridSearch to help try and improve my overfitting problem. 

In [39]:
# We then define our RandomForestRegressor class. We adjust the min_samples_leaf hyperparameter to help "regularize" our model
rfr = RandomForestRegressor(n_estimators=100, min_samples_leaf=4)

In [40]:
rfr.fit(train_X, train_y)

RandomForestRegressor(min_samples_leaf=4)

In [41]:
# Let's run this through our MAE function
cross_validate_score(rfr)

The model in this section had a training MAE of 728.38175.
The model in this section had a cross-validation MAE of 1228.22401.


Conclusion: Our best function yet! However, it is definitely still overfitting. This is something we will look to correct going forward.

### Generating our Kaggle Submissions

The time has not finally come for us to apply our models to the test set. Once our test set has been properly converted, we will export the answers to a csv file. We will then submit these to Kaggle and report the results in the conclusion section 

In [64]:
def kaggle_submission(model, file_name):
    loss = model.predict(test_X)
    mapped = {0: 'loss'}
    submission_df = pd.DataFrame(loss, index=test_ids).rename(mapper=mapped, axis=1)
    submission_df.to_csv(file_name)

In [68]:
# Model 2
kaggle_submission(rr, "ridge_reg.csv")

In [66]:
# Model 3
kaggle_submission(xgb, "xgb.csv")

In [67]:
# Model 4
kaggle_submission(rfr, "random_forest.csv")

## Conclusion

After a lot of hardwork and fine tuning, let's dig into the results and see how we scored on our model. My hypothesis is that our XGB model will perform the best, while our ridge regression model will perform the worst. Here are our results:
1. XGBoost: 1194 (good enough for about 2203/3047 on the leaderboard)
2. Random Forest: 1223
3. Ridge Regression: 1293

These were the order with which I expected the models to come in. I'm happy that our XGBoost model was as good as it is, however, I would definitely like to make some improvements here. I will probably come back at somepoint and perform GridSearchCV on this model. Another idea I had was to check and see if there are any outliers in our training data. This might improve performance slightly.