# Auto Insurance Claims Fraud Detection - Feature Engineering

After completing the initial data cleansing in notebook `1-data-cleansing`, the next step is to manipulate the data and derive features.

## 1.2 Feature extraction

The following none exhaustive list gives you some guidelines for __feature transformation__:

* __Imputing__ <br>
Fill missing values based on their value distribution, as some algorithms are sensitive to missing data
* __Imputed time-series quantization__ <br>
For time series data with measurements at different timestamps, quantize measurements to a common interval and impute corresponding values
* __Scaling / Normalizing / Centering__ <br>
Center data around zero and scale values to have a standard deviation of one to address algorithm sensitivity to differences in value ranges
*  __Filtering__ <br>
Delete low-quality records if imputing values doesn't yield satisfactory results
* __Discretizing__ <br>
Convert continuous fields into discrete categories, as discrete age ranges may perform better than continuous values, particularly with simpler models or smaller datasets

The following none exhaustive list gives you some guidelines for __feature creation__:
* __One-hot-encoding__ <br>
Transform categorical integer features into "one-hot" vectors, adding additional columns for each distinct category
* __Time-to-Frequency transformation__ <br>
Convert time-series or sequence data from the time domain to the frequency domain using techniques like FFT (Fast Fourier Transformation)
* __Month-From-Date__ <br>
Create an additional feature indicating the month independently from the date to capture seasonal aspects, and optionally further discretize into quarters
* __Aggregate-on-Target__ <br>
Aggregate fields with respect to the target variable or other relevant fields to improve performance. For example, count the number of data points per ZIP code or calculate the median of all values by geographical region

In [1]:
# imports
import numpy as np
import pandas as pd
#!pip install pandas-profiling
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000) 

In [2]:
def to_csv(df, path):
    # Prepend dtypes to the top of df
    df2 = df.copy()
    df2.loc[-1] = df2.dtypes
    df2.index = df2.index + 1
    df2.sort_index(inplace=True)
    # Then save it to a csv
    df2.to_csv(path, index=False)
    
def read_csv(path):
    # Read types first line of csv
    dtypes = {key:value for (key,value) in pd.read_csv(path,    
              nrows=1).iloc[0].to_dict().items() if 'date' not in value}

    parse_dates = [key for (key,value) in pd.read_csv(path, 
                   nrows=1).iloc[0].to_dict().items() if 'date' in value]
    # Read the rest of the lines with the types from above
    return pd.read_csv(path, dtype=dtypes, parse_dates=parse_dates, skiprows=[1])

In [3]:
# Load the data with dtypes
data = read_csv('data/insurance_claims_clean.csv')
print(data.shape)
print(data.dtypes)

(999, 36)
months_as_customer                      int64
age                                     int64
policy_bind_date               datetime64[ns]
policy_state                         category
policy_csl                           category
policy_deductable                       int64
policy_annual_premium                 float64
umbrella_limit                          int64
insured_zip                             int64
insured_sex                          category
insured_education_level              category
insured_occupation                   category
insured_hobbies                      category
insured_relationship                 category
capital-gains                           int64
capital-loss                            int64
incident_date                  datetime64[ns]
incident_type                        category
collision_type                       category
incident_severity                    category
authorities_contacted                category
incident_state          

In [4]:
pd.options.mode.chained_assignment = None  # default='warn'

In [5]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split
y = data['fraud_reported']
X = data.drop('fraud_reported', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Feature transformation:
__Imputing__ <br>
Some algorithms are very sensitive to missing values. Therefore, imputing allows for filling of empty fields based on its value distribution

NA. No missing values.

__Imputed time-series quantization__ <br>
Time series often contain streams with measurements at different timestamps. Therefore, it is beneficial to quantize measurements to a common “heart beat” and impute the corresponding values. This can be done by sampling from the source time series distributions on the respective quantized time steps

NA. Data is not longitudinal time series.

__Filtering__ <br>
Sometimes imputing values doesn’t perform well, therefore deletion of low quality records is a better strategy

NA.

__Discretizing__ <br>
Continuous fields might confuse the model, e.g. a discrete set of age ranges sometimes performs better than continuous values, especially on smaller amounts of data and with simpler models



In [6]:
# Bin age into young driver, adult, senior (ordered categorical)
bins = [18,25,45,64]

X_train['age_bins'] = pd.cut(X_train['age'], bins)
X_test['age_bins'] = pd.cut(X_test['age'], bins)
# Categories (3, interval[int64]): [(0, 25] < (25, 45] < (45, 64]]

In [7]:
# Create additional 'other' column if hobbies are not chess and cross-fit
X_train['insured_hobbies'] = X_train['insured_hobbies'].apply(lambda x: 'other' if x!='chess' and x!='cross-fit' else x)
X_test['insured_hobbies'] = X_test['insured_hobbies'].apply(lambda x: 'other' if x!='chess' and x!='cross-fit' else x)

In [8]:
# Create 'other' column if insured_occupation is not exec-managerial
# ToDo - test

__Scaling / Normalizing / Centering__ <br>
Some algorithms are very sensitive differences in value ranges for individual fields. Therefore, it is best practice to center data around zero and scale values to a standard deviation of one.

Note: Fitting _must_ be done on the train data to avoid 'leaking' from test data.

In [9]:
# For now try without scaling numerical features.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

col_names = ['total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim']

#ct = ColumnTransformer([('sc', StandardScaler(), col_names)], remainder='passthrough')

#X_train = ct.fit_transform(X_train)
#X_test = ct.transform(X_test)

### Feature creation:
__One-hot-encoding__ <br>
Categorical integer features should be transformed into “one-hot” vectors. In relational terms this results in addition of additional columns – one column for each distinct category.

Note: this can be done on data before spliting into train and test to cover all categorical values.

In [11]:
# Avoiding mismatch between train and test after one-hot-encoding
X_train['train'] = 1
X_test['train'] = 0

# concat train and test set
combined = pd.concat([X_train, X_test], axis=0)

# one-hot-encode
combined = pd.get_dummies(combined, drop_first=True)

# split back into train and test, drop column train
X_train = combined[combined['train']==1]
X_test = combined[combined['train']==0]
X_train.drop(['train'], axis=1, inplace=True)
X_test.drop(['train'], axis=1, inplace=True)

__Time-to-Frequency transformation__ <br>
Time-series (and sometimes also sequence data) is recorded in the time domain but can easily transformed into the frequency domain e.g. using FFT (Fast Fourier Transformation)

NA

__Month-From-Date__ <br>
Creating an additional feature containing the month independent from data captures seasonal aspects. Sometimes further discretization in to quarters helps as well

In [12]:
# Create additional column with incident_month_of_year
X_train['incident_month_of_year'] = X_train['incident_date'].dt.month
X_test['incident_month_of_year'] = X_test['incident_date'].dt.month

In [13]:
# Create additional column with incident_day_of_year
X_train['incident_day_of_week'] = X_train['incident_date'].dt.dayofweek
X_test['incident_day_of_week'] = X_test['incident_date'].dt.dayofweek

In [14]:
# Create additional to signal weekend vs. weekday
X_train['weekday'] = X_train['incident_day_of_week'].isin([0,1,2,3,4]).astype('int')
X_test['weekday'] = X_test['incident_day_of_week'].isin([0,1,2,3,4]).astype('int')

Note: this can be done on the data before splitting into train and test.

__Aggregate-on-Target__ <br>
Simply aggregating fields the target variable (or even other fields) can improve performance, e.g. count number of data points per ZIP code or take the median of all values by geographical region

No idea.

### Store data

In [15]:
to_csv(X_train.join(y_train), 'data/insurance_claims_train_features.csv')
to_csv(X_test.join(y_test), 'data/insurance_claims_test_features.csv')

In [16]:
!pwd
!ls

'pwd' is not recognized as an internal or external command,
operable program or batch file.
'ls' is not recognized as an internal or external command,
operable program or batch file.
