Your task is to build a claim frequency model that takes as input the attributes of a single property (state, square footage and building age) and outputs the expected number of claims for that building _per unit of exposure_. We would like you to use 1,000 square feet as your unit of exposure. Therefore, if a building is 2,000 square feet, then it corresponds to 2 units of exposure. If your model outputs 3.23 for this building, it will mean that you predict that, on average, this building will generate 6.46 claims per year. 

Importing all the three Data files:
1) For importing claim data you need to have pyarrow on the system, please refer the environment file

In [74]:
import pandas as pd
import numpy as np


In [21]:
df_policy= pd.read_excel('../policies_10272021.xlsx')
df_property=pd.read_csv('../properties_10272021.csv')
df_claim=pd.read_parquet('../claims_10272021.parquet',engine='pyarrow')

In [22]:
df_policy.head()

Unnamed: 0,pol,start,end
0,11bb0dd0,4Jun.2018,4Jun.2019
1,96a3c554,25Jan2017,25Jan2018
2,35a90ece,26Oct2018,26Oct2019
3,6d034563,3Aug.2017,3Aug.2018
4,c70d089a,28Nov.2017,28Nov.2018


In [23]:
df_pol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1760 entries, 0 to 1759
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   pol     1760 non-null   object
 1   start   1760 non-null   object
 2   end     1760 non-null   object
dtypes: object(3)
memory usage: 41.4+ KB


Policy dataset has three columns (policy number and start and end date) . It doesn't have any null values 

In [24]:
df_property.head()

Unnamed: 0,prop_id,pol,state,sqft,age
0,7019,0152f838,OH,10876,67
1,8025,604b4377,AZ,87946,95
2,5766,b7007f67,FL,57978,13
3,6120,bae6a6d7,AZ,89202,53
4,1468,a5916966,AZ,35934,29


In [25]:
df_property.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8413 entries, 0 to 8412
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   prop_id  8413 non-null   int64 
 1   pol      8413 non-null   object
 2   state    8413 non-null   object
 3   sqft     8413 non-null   int64 
 4   age      8413 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 328.8+ KB


Policy dataset has 5 columns (property id, policy number, state, sqrt and age). It doesn't have any null values

In [26]:
df_claim.head()

Unnamed: 0,pol,property,start_date,amount
0,0152f838,7019,25Oct2018,157866.849557
1,0152f838,7019,25Oct2018,918867.179064
2,0152f838,7019,25Oct2018,128602.395049
3,0152f838,7019,25Oct2018,447153.652892
4,0152f838,7019,25Oct2018,221691.94043


In [27]:
df_claim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62565 entries, 0 to 62564
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pol         62565 non-null  object 
 1   property    62565 non-null  int64  
 2   start_date  62565 non-null  object 
 3   amount      62565 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.9+ MB


df_claim has 4 columns,pol number, property id, start date and amount, no null values

Claim dataset is our main dataset from where we can get the claim details, I will try to add other details from policy and property dataset


1) First join Property dataset to Claim 
2) secondly I will try to join policy dataset to Claim

In [17]:
df_property.columns

Index(['property', 'pol', 'state', 'sqft', 'age'], dtype='object')

In [39]:
df_merge=pd.merge(df_claim, df_property,  how='left', left_on=['property'], right_on = ['prop_id'])

In [40]:
df_merge.head()

Unnamed: 0,pol_x,property,start_date,amount,prop_id,pol_y,state,sqft,age
0,0152f838,7019,25Oct2018,157866.849557,7019,0152f838,OH,10876,67
1,0152f838,7019,25Oct2018,918867.179064,7019,0152f838,OH,10876,67
2,0152f838,7019,25Oct2018,128602.395049,7019,0152f838,OH,10876,67
3,0152f838,7019,25Oct2018,447153.652892,7019,0152f838,OH,10876,67
4,0152f838,7019,25Oct2018,221691.94043,7019,0152f838,OH,10876,67


In [41]:
df_merge=df_merge.drop(['pol_y','prop_id'],axis=1)

In [42]:
df_merge.shape

(62565, 7)

In [43]:
df_merge = pd.merge(df_merge, df_policy,  how='left', left_on=['pol_x','start_date'], right_on = ['pol','start'])

In [44]:
df_merge.head()

Unnamed: 0,pol_x,property,start_date,amount,state,sqft,age,pol,start,end
0,0152f838,7019,25Oct2018,157866.849557,OH,10876,67,0152f838,25Oct2018,25Oct2019
1,0152f838,7019,25Oct2018,918867.179064,OH,10876,67,0152f838,25Oct2018,25Oct2019
2,0152f838,7019,25Oct2018,128602.395049,OH,10876,67,0152f838,25Oct2018,25Oct2019
3,0152f838,7019,25Oct2018,447153.652892,OH,10876,67,0152f838,25Oct2018,25Oct2019
4,0152f838,7019,25Oct2018,221691.94043,OH,10876,67,0152f838,25Oct2018,25Oct2019


In [45]:
# dropping columns which are not required for furthr processing 
df_merge=df_merge.drop(['pol_x','start_date','pol','start','end'],axis=1)

In [46]:
df_merge.head()

Unnamed: 0,property,amount,state,sqft,age
0,7019,157866.849557,OH,10876,67
1,7019,918867.179064,OH,10876,67
2,7019,128602.395049,OH,10876,67
3,7019,447153.652892,OH,10876,67
4,7019,221691.94043,OH,10876,67


In [47]:
# calculating the claims for each property 
df_merge.groupby(['property','state','sqft','age']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,amount
property,state,sqft,age,Unnamed: 4_level_1
0,AZ,88972,83,13
1,OH,43434,11,11
2,FL,66035,8,9
3,AZ,18145,59,12
4,AZ,74404,50,9
...,...,...,...,...
8408,OH,65374,39,18
8409,AZ,84879,45,2
8410,AZ,65986,25,8
8411,OH,16746,85,22


In [48]:
df_final=pd.DataFrame(df_merge.groupby(['property','state','sqft','age']).count())
df_final=df_final.reset_index()

In [49]:
df_final.head()

Unnamed: 0,property,state,sqft,age,amount
0,0,AZ,88972,83,13
1,1,OH,43434,11,11
2,2,FL,66035,8,9
3,3,AZ,18145,59,12
4,4,AZ,74404,50,9


In [53]:
# Calculating the exposure
df_final['exposure']=df_final['sqft']/1000

In [51]:
df_final.rename(columns={'amount': 'claims'}, inplace=True)

In [54]:
# freq whcih we be our dependent varable as part of the modelling
df_final['freq']=df_final['claims']/df_final['exposure']

In [70]:
df_final['age_bins'] = pd.cut(x=df_final['age'], bins=[-np.inf,0, 10, 20, 30, 40, 50,60,70,80,90,100],labels=['0','10','20', '30', '40','50','60','70','80','90','100'])

In [71]:
df_final.head()

Unnamed: 0,property,state,sqft,age,claims,exposure,freq,age_bins
0,0,AZ,88972,83,13,88.972,0.146113,90
1,1,OH,43434,11,11,43.434,0.253258,20
2,2,FL,66035,8,9,66.035,0.136291,10
3,3,AZ,18145,59,12,18.145,0.661339,60
4,4,AZ,74404,50,9,74.404,0.120961,50


In [72]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8209 entries, 0 to 8208
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   property  8209 non-null   int64   
 1   state     8209 non-null   object  
 2   sqft      8209 non-null   int64   
 3   age       8209 non-null   int64   
 4   claims    8209 non-null   int64   
 5   exposure  8209 non-null   float64 
 6   freq      8209 non-null   float64 
 7   age_bins  8209 non-null   category
dtypes: category(1), float64(2), int64(4), object(1)
memory usage: 457.5+ KB


In [73]:
df_final.to_csv('Output_files/Data_pre.csv',index=False)