# Policy Prediction



<b>Library Imports</b> :

In [1]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

'''
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

'''

'\naxis : {0 or ‘index’, 1 or ‘columns’}, default 0\n0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise\n\n'

<b>Read Data</b> :

In [2]:
train = pd.read_csv("Data/train.csv")
train = pd.read_csv("Data/train_short.csv")
test = pd.read_csv("Data/test.csv")
test_session = pd.read_csv("Data/test_session_history.csv")


<b>Explore Training Data</b>:

In [3]:
train.head()

Unnamed: 0,customer_ID,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,car_value,risk_factor,age_oldest,age_youngest,married_couple,C_previous,duration_previous,policy,cost
0,140813,9,1,2,13:02,PA,37390,1,0,13,e,,24,24,0,1.0,5.0,1,675
1,140815,6,1,3,13:41,FL,33166,2,0,2,e,2.0,52,21,0,1.0,6.0,4,653
2,140809,7,1,4,14:45,PA,39166,1,1,20,d,,51,51,0,3.0,1.0,1,620
3,140811,6,1,1,12:42,RI,38272,1,0,17,c,3.0,24,24,0,3.0,8.0,3,661
4,281617,7,1,2,9:47,IA,43903,1,0,12,e,,55,55,0,4.0,4.0,3,595


<b>Data Preparation and Cleaning</b> :

<u>`Step 1`</u> : Remove the default index added to the Data Frame.

We can set the customer_ID as the index as this uniquely identifies the customer in the dataframe/dataset.
To do this we can use the Pandas <b><i>set_index</i></b> function.

In [4]:
train.set_index('customer_ID', drop=True, inplace=True)
train.head()

Unnamed: 0_level_0,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,car_value,risk_factor,age_oldest,age_youngest,married_couple,C_previous,duration_previous,policy,cost
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
140813,9,1,2,13:02,PA,37390,1,0,13,e,,24,24,0,1.0,5.0,1,675
140815,6,1,3,13:41,FL,33166,2,0,2,e,2.0,52,21,0,1.0,6.0,4,653
140809,7,1,4,14:45,PA,39166,1,1,20,d,,51,51,0,3.0,1.0,1,620
140811,6,1,1,12:42,RI,38272,1,0,17,c,3.0,24,24,0,3.0,8.0,3,661
281617,7,1,2,9:47,IA,43903,1,0,12,e,,55,55,0,4.0,4.0,3,595


<u>`Setp 2`</u> : To check for missing values

We can use the Pandas <b><i>count</i></b> function.

The count function returns Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None).

In the step below we can see that there are 67663 rows in all columns except - risk_factor, C_previous and duration_previous. This is because these columns have missing/NaN values.

In [5]:
train.count(axis=0, level=None, numeric_only=False)

shopping_pt          67663
record_type          67663
day                  67663
time                 67663
state                67663
location             67663
group_size           67663
homeowner            67663
car_age              67663
car_value            67663
risk_factor          43618
age_oldest           67663
age_youngest         67663
married_couple       67663
C_previous           67103
duration_previous    67103
policy               67663
cost                 67663
dtype: int64

An alternate to the <b><i>count</i></b> function is the <b><i>info</i></b> function, which gives the concise summary of a DataFrame as below.

In [6]:
train.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67663 entries, 140813 to 123397
Data columns (total 18 columns):
shopping_pt          67663 non-null int64
record_type          67663 non-null int64
day                  67663 non-null int64
time                 67663 non-null object
state                67663 non-null object
location             67663 non-null int64
group_size           67663 non-null int64
homeowner            67663 non-null int64
car_age              67663 non-null int64
car_value            67663 non-null object
risk_factor          43618 non-null float64
age_oldest           67663 non-null int64
age_youngest         67663 non-null int64
married_couple       67663 non-null int64
C_previous           67103 non-null float64
duration_previous    67103 non-null float64
policy               67663 non-null int64
cost                 67663 non-null int64
dtypes: float64(3), int64(12), object(3)
memory usage: 9.8+ MB


<u>`Step 3`</u> : To drop the NaN values

We can use the Pandas <b><i>dropna</i></b> function

The <b><i>dropna</i></b> function returns objects with labels on given axis omitted where alternately any or all of the data are missing

In [8]:
train.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Unnamed: 0_level_0,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,car_value,risk_factor,age_oldest,age_youngest,married_couple,C_previous,duration_previous,policy,cost
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
140815,6,1,3,13:41,FL,33166,2,0,2,e,2.0,52,21,0,1.0,6.0,4,653
140811,6,1,1,12:42,RI,38272,1,0,17,c,3.0,24,24,0,3.0,8.0,3,661
281623,9,1,2,9:51,MD,31741,1,0,18,e,3.0,35,35,0,1.0,3.0,1,669
140829,5,1,0,14:49,MD,31888,1,1,9,e,1.0,64,64,0,3.0,10.0,2,607
281613,5,1,2,15:44,FL,31168,1,1,14,d,1.0,68,68,0,3.0,3.0,1,550
140821,4,1,2,11:33,FL,33790,1,1,13,f,2.0,44,44,0,1.0,9.0,3,635
281607,3,1,4,10:04,AR,32254,1,1,18,f,3.0,72,72,0,2.0,1.0,3,622
140847,5,1,4,17:45,NV,43606,1,0,2,c,2.0,62,62,0,1.0,7.0,2,621
281663,4,1,4,16:48,WA,31696,1,0,8,f,4.0,25,25,0,4.0,2.0,4,620
140837,9,1,4,12:24,FL,30400,2,1,1,h,3.0,75,72,1,4.0,2.0,2,706


To verify if the NaN values were dropped we can check the count.

In [9]:
train_new.count(axis=0, level=None, numeric_only=False)

shopping_pt          43349
record_type          43349
day                  43349
time                 43349
state                43349
location             43349
group_size           43349
homeowner            43349
car_age              43349
car_value            43349
risk_factor          43349
age_oldest           43349
age_youngest         43349
married_couple       43349
C_previous           43349
duration_previous    43349
policy               43349
cost                 43349
dtype: int64

This shows that all the NaN's have been deleted. The number of rows decreased from 67663 to 43349, i.e, 35% of the data had incomplete records.




<u>`Step 4`</u> : Remove Duplicates

Every row has a set of features that represent the customer's profile. When the same combination of features repeat, the model will get biased as it would see one/few set of features (i.e, customer profiles) more often than the others. Therefore the right approach would require the model to be trained by unique set of sample data.

To do this first, the columns where the duplicates need to be looked for should be selected.

In [None]:
columns_to_search = train['']

In [None]:
del train['customer_ID']
