
## Project:  Allstate Claims Severity
#### Author:   Joshep Downs, James Peng, Megan Pera, Diana Rodenberger 
#### Purpose:  Predicting cost and severity of claims for AllState
#### Created:  12/6/2016 
#### Submitted: 12/6/2016 

### Team name in Kaggle: UCB_207_1

## Link to Leaderboard
https://www.kaggle.com/c/allstate-claims-severity/leaderboard


In [17]:
%matplotlib inline
import unittest

# General libraries.
import re, os, sys
import numpy as np
import pandas as pd
from itertools import compress
from sklearn.metrics import mean_absolute_error
from sklearn import linear_model
from sklearn.linear_model import Ridge
from datetime import datetime
from sklearn.feature_selection import mutual_info_regression


This notebook takes the datasets that Diana created and tries to pare down the number of variables by selecting relevant variables using mutual_info_regression. 

### Load and split train data

In [5]:
X = pd.read_csv("~/Downloads/AllstateChallenge-master/data_out/X_dummies_train.csv")
y = pd.read_csv("~/Downloads/AllstateChallenge-master/data_out/y_train.csv", header=None)
id = pd.read_csv("~/Downloads/AllstateChallenge-master/data_out/id_train.csv") 

In [13]:
# Set variables to hold dev and training data
dev_data, dev_labels, dev_id = X[168318:], y[168318:], id[168318:]
train_data, train_labels, train_id = X[:168318], y[:168318], id[:168318]

In [15]:
print(train_data.shape)
print(train_labels.shape)
print(dev_data.shape)
print(dev_labels.shape)

(168318, 1190)
(168318, 1)
(20000, 1190)
(20000, 1)


### Run a baseline Ridge regression for comparison

In [18]:
# Run a Ridge regression and evaluate MAE
lr1 = linear_model.Ridge(alpha=0.00001, normalize=True)
lr1.fit(train_data, train_labels)

#use same linear model previously fit with training data
dev_log_pred = lr1.predict(dev_data)

mae = mean_absolute_error(dev_labels, dev_log_pred)
print('mean_absolute_error on test data {0}'.format(mae))


mean_absolute_error on test data 0.1900172851793714


### Experimenting with mutual info regression

In [31]:
full = train_data
full['loss'] = train_labels
mini = full.sample(1000, random_state=1)

# Pull out target vector
loss_array = np.asarray(mini.loc[:,('loss')]) 

# Prepare mini matrix
mini.drop('loss', axis=1, inplace=True)
col = list(mini.columns)
mini_train_matrix = mini.as_matrix(col) # variable df, as a matrix

# Running mutual_info_regression
feature_info = mutual_info_regression(mini_train_matrix,loss_array)

In [55]:
# Returns an array of estimated mutual information between each feature and the loss target
# the kernel dies when you do the whole dataset; not ideal
# Writing a function that runs mutual_info_regression on a random chunk of data at a time

# Pulling loss back into the data set for random sampling purposes
full = train_data
full['loss'] = train_labels

pd.options.mode.chained_assignment = None
def find_features(n):
    '''
    This function returns a list of features to keep in the data set for regression.
    Randomly samples and calculates mutual_info_regression n times. 
    '''
    i = 0
    features = []
    
    while i < n:
        # Pulling a random chunk of data from X 
        full = train_data
        full['loss'] = train_labels
        mini = full.sample(1000)
        
        # Pull out target vector
        loss_array = np.asarray(mini.loc[:,('loss')]) 
        
        # Prepare mini matrix
        mini.drop('loss', axis=1, inplace=True)
        col = list(mini.columns)
        mini_train_matrix = mini.as_matrix(col) # variable df, as a matrix

        # Running mutual_info_regression
        feature_info = mutual_info_regression(mini_train_matrix,loss_array)
        
        # Finding features that return more than 0 information 
        keep = feature_info > 0
        
        if i == 0:
            features = keep
        else:
            features2 = list(compress(col, keep))
            features = set(features).intersection(features2)
        i += 1
        
    return features

In [56]:
# Running 5 times and timing the process
t1 = datetime.today().timestamp() # start timer
keep = find_features(5)
t2 = datetime.today().timestamp() # end timer
print("This took",t2-t1,"seconds")

This took 92.34055280685425 seconds


In [57]:
# Pull out variable names that contribute information
names = []
for item in keep:
    delim_pos=int(np.core.defchararray.find(item,'_'))
    var=item[:delim_pos]
    names.append(var)
vars = np.unique(np.asarray(names))

# Include all categories of relevant variables
col = list(train_data.columns)
final = []
for i in range(0,len(col)):
    for v in vars:
        if col[i][:delim_pos] == v:
            final.append(col[i])
        else:
            pass

print("Mutual info regression cuts down the number of variables from",len(col), "to", len(final))

Mutual info regression cuts down the number of variables from 1191 to 613


In [58]:
# pare down data sets to only use the 885 relevant variables
dev_set = dev_data[final]
train_set = train_data[final]

print(dev_set.shape)
print(train_set.shape)

(20000, 613)
(168318, 613)


### Run a Ridge regression with optimized data set

In [59]:
# Run a Ridge regression and evaluate MAE
pd.options.mode.chained_assignment = None

lr2 = linear_model.Ridge(alpha=0.00001, normalize=True)
lr2.fit(train_set, train_labels)

#use same linear model previously fit with training data
dev_log_pred = lr2.predict(dev_set)

mae = mean_absolute_error(dev_labels, dev_log_pred)
print('mean_absolute_error on test data {0}'.format(mae))


mean_absolute_error on test data 0.21792750669700212


###### In conclusion, using mutual_info_regression takes a long time and does not improve the mae above what is gained using L2 regression. 