# Data Preprocessing

In this file, I will perform data cleaning and feature engineering. 

We also considered using some macroeconomic data (such as unemployment rate, the price of S&P 500, etc), as we though that a bad economy might motivate more people to commit fraud, but since the prediction did not improve much, we did not include them in our final model and they won't be discussed further here.

## Set up

Load dependencies.

In [1]:
import numpy as np
import pandas as pd

Load datasets.

In [2]:
train = pd.read_csv('../data/raw/train.csv')
test = pd.read_csv('../data/raw/test.csv')

## Data Cleaning

Remove the observations whose the target variable `fraud` is equal to -1.

In [3]:
train = train[train['fraud'] != -1]

For values that match the following conditions, treat them as missing values to be imputed later.

- `age_of_driver > 100`
- `annual_income = -1`
- `zip_code = -1`

According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), the oldest living person is 115, as of 2018. I think it is reasonable to assume that any `age_of_driver > 100` in this dataset is a clerical error.

In [4]:
for df in [train, test]:
    df.loc[df['age_of_driver'] > 100, 'age_of_driver'] = np.nan
    df.loc[df['annual_income'] == -1, 'annual_income'] = np.nan
    df.loc[df['zip_code'] == 0, 'zip_code'] = np.nan

Now, we will do an imputation for the missing values. Since there is only a very small percentage of missing values, we will simply do a mean/mode imputation for the continuous/categorical variables.

In [5]:
for df in [train, test]:
    # mean imputation for continuous variables
    for feature in ['age_of_driver', 'annual_income', 'claim_est_payout', 'age_of_vehicle']:
        feature_mean = df.loc[:, feature].mean(skipna=True)
        df[feature].fillna(int(feature_mean), inplace=True)

    # mode imputation for categorical variables
    for feature in ['marital_status', 'witness_present_ind', 'zip_code']:
        feature_mode = df.loc[:, feature].mode(dropna=True)
        df[feature].fillna(feature_mode.values[0], inplace=True)

## Feature Engineering

Transform `zip_code` into `latitude` and `longitude` using the data from [UnitedStatesZipCodes.org](https://www.unitedstateszipcodes.org/zip-code-database/).

In [6]:
zip_code_database = pd.read_csv('../data/external/zip_code_database.csv')
latitude_and_longitude_lookup = {
    row.zip: (row.latitude, row.longitude) for row in zip_code_database.itertuples()
}

for df in [train, test]:
    df['latitude'] = df['zip_code'].apply(lambda x: latitude_and_longitude_lookup[x][0])
    df['longitude'] = df['zip_code'].apply(lambda x: latitude_and_longitude_lookup[x][1])
    df.drop(columns=['zip_code'], inplace=True)

I can't imagine how `claim_date` and `claim_day_of_week` are related to insurance fraud, so I decide to drop them to prevent adding noise to our model.

In [7]:
for df in [train, test]:
    df.drop(columns=['claim_date', 'claim_day_of_week'], inplace=True)

## Export processed data

In [8]:
train.to_csv('../data/processed/train.csv', index=False)
test.to_csv('../data/processed/test.csv', index=False)