# Allstate insurance data set

### The Allstate Corporation is one of the largest insurance providers in the United States and one of the largest that is publicly held.

Accurately predicting severity claims in the event of an accident is getting increasingly difficult. Due and prompt payment is very important for customer satisfaction. The of aim the this analysis is to create a machine learning model that will accurately predict claims severity.  

### Importing necessary modules for data analysis and visualization

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

I am ready to load the data into pandas for exploration

In [3]:
dataset = pd.read_csv('~/Documents/All_state_dataset/train.csv', sep = ',', low_memory = False)
dataset.head(10)

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85
5,13,A,B,A,A,A,A,A,A,B,...,0.364464,0.401162,0.26847,0.46226,0.50556,0.366788,0.359249,0.345247,0.726792,5142.87
6,14,A,A,A,A,B,A,A,A,A,...,0.381515,0.363768,0.24564,0.40455,0.47225,0.334828,0.352251,0.342239,0.382931,1132.22
7,20,A,B,A,B,A,A,A,A,B,...,0.867021,0.583389,0.90267,0.84847,0.80218,0.644013,0.785706,0.859764,0.242416,3585.75
8,23,A,B,B,B,B,A,A,A,B,...,0.628534,0.384099,0.61229,0.38249,0.51111,0.682315,0.669033,0.756454,0.361191,10280.2
9,24,A,B,A,A,B,B,A,A,B,...,0.713343,0.469223,0.3026,0.67135,0.8351,0.863052,0.879347,0.822493,0.294523,6184.59


viewing properties of the dataset

In [4]:
dataset.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 132 entries, id to loss
dtypes: float64(15), int64(1), object(116)
memory usage: 982.0 MB


I set the memory usage of the .info() function to False to get accurate memory usage of the dataset because python usually approximates.
This dataset is consuming memory of 982 MB which is why it took quite some time when I was reading it into pandas. While aggregating over the data as it is, is possible, I would try to see if I could optimize how data is stored in each column to reduce file size, so processing time could reduce.

Looking closely, 16 columns are numerical while 116 columns are object. And of course objects storing in python is not optimized so it typically takes up more memory. But let'e see precisely how much space each data type is occupying.

In [5]:
for dtype in ['float','int','object']:
    selected_dtype = dataset.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float columns: 1.35 MB
Average memory usage for int columns: 0.72 MB
Average memory usage for object columns: 8.20 MB


Clearly objects columns are using up a lot of space and we have 116 objects columns. Practicaly most of the memory is consumed here. So next I'll try to optimize storage of objects columns to see if I can reduce memory usage

## Optimizing object columns using categoricals

Before jumping right into optimizing, one thing to note is that optimizing by categoricals only works effectively on columns with number of unique values less than 50% of all column values.

I will check if the stated condition above is met before optimizing.

In [6]:
converted_obj = pd.DataFrame()

for data in dataset.select_dtypes(include=['object']).columns:
    unique_values = len(dataset.select_dtypes(include=['object'])[data].unique())
    total_values = len(dataset.select_dtypes(include=['object'])[data])
    if unique_values / total_values < 0.5:
        converted_obj.loc[:,data] = dataset.select_dtypes(include=['object'])[data].astype('category')
    else:
        converted_obj.loc[:,data] = dataset.select_dtypes(include=['object'])[col]

So now I will check if the there is any memory reduction

In [7]:
converted_obj.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 116 entries, cat1 to cat116
dtypes: category(116)
memory usage: 21.3 MB


Great, now all the objects column have been reduced to memory of 21.3 MB. Now I will create a copy of dataset called optimized_dataset which I'm going to be performing analysis on from now on.

In [8]:
optimized_dataset = dataset.copy()
optimized_dataset[converted_obj.columns] = converted_obj
optimized_dataset.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 132 entries, id to loss
dtypes: category(116), float64(15), int64(1)
memory usage: 44.3 MB


In [9]:
optimized_dataset.head()

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


# Feature Selection

First I will encode categorical variables using labelencoder

In [22]:
from sklearn.preprocessing import LabelEncoder

In [23]:
for cat in converted_obj.columns:
    encode = LabelEncoder()
    encode.fit(optimized_dataset[cat].unique())
    optimized_dataset[cat] = encode.transform(optimized_dataset[cat])

In [28]:
optimized_dataset.head()

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,0,1,0,1,0,0,0,0,1,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,0,1,0,0,0,0,0,0,1,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,0,1,0,0,1,0,0,0,1,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,1,1,0,1,0,0,0,0,1,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,0,1,0,1,0,0,0,0,1,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


### In this next stage I am going to be deciding which features are going to be useful in training my  machine learning model

For this selection I am going to be using a univariate selector. I will be using SelectKBest and chi-square from scikit-learn library.

In [29]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [30]:
y = optimized_dataset['loss']       # Selecting Target variable
X = optimized_dataset.iloc[:,0:-1]   # Selecting independent variables

In [31]:
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)

ValueError: Unknown label type: (array([2213.18, 1283.6 , 3005.09, ..., 5762.64, 1562.87, 4751.72]),)