# Exploratory Data Analysis for Car Insurance Fraud

The following report will use an open-source dataset uploaded by Oracle. This data sets includes  more then 15000 data samples about insurance claims, they include really important variables that could be correlated with the target variable of fraudulent or non-fraudulent claims.

In [1]:
# import all dependencies and data
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import seaborn as sns
import calendar

import utils

%matplotlib notebook
pd.set_option('chained_assignment',None)
sns.set_style("white")

data = pd.read_csv('claims.csv')
data = utils.clean_data(data)

Percentage of fraudulent claims:

* Fraudulent is equal to "1"
* Non Fraudulent is equal to "0"

In [5]:
counted = data.groupby(["FraudFound_P"]).size()
print(counted / counted.sum() * 100)

FraudFound_P
0    93.990684
1     6.009316
dtype: float64


In [8]:
counted = data.groupby([data.FraudFound_P ,data["AgeOfPolicyHolder"]]).size()

In [12]:
(counted / counted.groupby(level=1).sum()) * 100

FraudFound_P  AgeOfPolicyHolder
0             16 to 17             89.932886
              18 to 20             85.714286
              21 to 25             84.761905
              26 to 30             94.693201
              31 to 35             93.525699
              36 to 40             94.120590
              41 to 50             94.862647
              51 to 65             95.054545
              over 65              94.035785
1             16 to 17             10.067114
              18 to 20             14.285714
              21 to 25             15.238095
              26 to 30              5.306799
              31 to 35              6.474301
              36 to 40              5.879410
              41 to 50              5.137353
              51 to 65              4.945455
              over 65               5.964215
dtype: float64

In [3]:
utils.fraud_percentage(data, "AgeOfPolicyHolder")

FraudFound_P  AgeOfPolicyHolder
0             16 to 17              1.870594
              18 to 20              0.083758
              21 to 25              0.621205
              26 to 30              3.985482
              31 to 35             36.197390
              36 to 40             26.258114
              41 to 50             18.559363
              51 to 65              9.122636
              over 65               3.301459
1             16 to 17              3.275109
              18 to 20              0.218341
              21 to 25              1.746725
              26 to 30              3.493450
              31 to 35             39.192140
              36 to 40             25.655022
              41 to 50             15.720524
              51 to 65              7.423581
              over 65               3.275109
dtype: float64

In [4]:
utils.fraud_percentage(data, "Delay").plot(style = 'o');
plt.xlim(0,80)
sns.despine()
plt.show()

<IPython.core.display.Javascript object>