Accurately predicting claims severity (the dollar loss amount resulting from an insurance claim) is very important to an insurance company for several reasons:

1. The longer a claim stays open, the more it will cost to eventually close. Closing claims earlier helps to reduce claim costs overall, and allows for efficient allocation of resources. 
2. Insurance companies are required to set aside funds that will eventually pay for claims. The amount of these funds should not be too high or too low. If they are too high, then that is money that could be better used elsewhere in the company. If the amount is too low, the company will eventually have to raise the amount to pay claims, and this movement may send the wrong signal to investors.

Allstate made data available for a Kaggle competition in 2016. Due to the sensitive nature of the data, both from the perspective of protecting the privacy of policyholders, and protecting the business interest of Allstate, much of the data is not easily translated to a data dictionary. Allstate has provided 116 categorical dimensions with no labeling. Many of these categories are likely easily guessed: for example, several of the categories have either an input of "A" or an input of "B". One of these categories is likely identifying "male" and "female". 

In addition, Allstate provided 14 continuous variables. It appears that these variables have already been transformed, as the values for each input are all between 0 and 1. One of these likely represents age of the driver, for example. 

The goal of the algorithm is to use the provided to data to accurately predict claim severity, which is a continuous number. Due to the nature of the protected data, it is difficult to guess which variables may have greater or lesser value in predicting claim severity. That being said, I believe in general that claim severity can be predicted using detailed claim data.

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('C:\Users\dsingh\Documents\Data Science\Allstate_Kaggle\\train.csv')

In [5]:
df.head()

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


The "loss" amount in the final colum is what we are trying to predict.

# DATA I WISH I HAD

Other than the obvious (a data dictionary), there are a few items I would have liked to have tested. Examples:
1. Date the accident occurred and date the claim was filed (I would have used the delta between these as a variable)
2. A description of the claim from the initial filing of the claim (several sentences describing what happened, in the policyholder's words). I would have liked to have searched for key words that may have been signicant indicators of claim severity.

The data I have is probably focused on details about the car(s) involved, the claimant(s) involved, and the type of loss (whether damage to a car, injury to a person, both, etc.). This is likely sufficient to come up with some good predictions.

We also don't know what time period the data is from. I am assuming that there is no time series element involved - inflation certainly influences the cost of claims severity over time. But the time period is not explicitly stated, so it is another unknown.

# QUESTIONS I HAVE

I am unclear on whether or not I can use multiple ML algorithms. I am assuming that I can, but I don't know what that might look like and what the protocols are. For example, can I use logistic regression or some other grouping algorithm to group my data into buckets, and then run linear regression on those buckets? Would that ever make sense to do? In general: how can I make this more nuanced than just using lasso and some gridsearch? 

Also: see below in "Domain Knowledge" section. Any thoughts on how I can apply domain knowledge given the limitations? 

# OUTCOMES

The output will be a set of claim severity predictions (a loss dollar amount prediciton for each row of data). It is difficult to know how accurate my model has to be in order to be considered a "success"; however, personal automobile insurance is an extremely competitive field. Any gain, even if not huge, is good. 

The target audience should be expecting the same output as I am expecting to produce: claim severity predictions. The target audience in this case is probably quite analytically sophisticated, given that Allstate already has a data science team in place.

That said: it is difficult to know how successful the algorithm needs to be in order to achieve a gain, because we don't know how successful Allstate already is at anticipating claim severity with other methods already in place. 

# DOMAIN KNOWLEDGE

I am an actuary working in the insurance industry, so I do have plenty of domain knowledge. I could have used this knowledge to create new variables (by combining some variables that we have), but not knowing what the variables are makes it a lot harder to apply domain knowledge. 