# Data Preprocessing

This file shows how I performed data cleaning and feature engineering. 

## Set up

Import libraries.

In [89]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Load datasets.

In [90]:
df_train_full = pd.read_csv("../data/raw/train.csv")
df_test = pd.read_csv("../data/raw/test.csv")

Since the test set provided does not have the target variable, so we have to create an internal validation set to evaluate the model performance.

In [91]:
df_train, df_val = train_test_split(df_train_full, test_size=0.2, random_state=99)

## Data Cleaning

Remove the observations whose the target variable `fraud` is equal to -1.

In [92]:
df_train = df_train[df_train["fraud"] != -1]
df_val = df_val[df_val["fraud"] != -1]

For values that match the following conditions, treat them as missing values to be imputed later.

- `age_of_driver > 100`
- `annual_income = -1`
- `zip_code = -1`

According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), the oldest living person is 115, as of 2018. I think it is reasonable to assume that any `age_of_driver > 100` in this dataset is a clerical error.

In [93]:
for df in [df_train, df_val, df_test]:
    df.loc[df["age_of_driver"] > 100, "age_of_driver"] = np.nan
    df.loc[df["annual_income"] == -1, "annual_income"] = np.nan
    df.loc[df["zip_code"] == 0, "zip_code"] = np.nan

Now, we will do an imputation for the missing values. Since there is only a very small percentage of missing values, we will simply do a mean/mode imputation for the continuous/categorical variables. Note that the mean/mode is computed based on the training set only to prevent data leakage.

In [94]:
for df in [df_train, df_val, df_test]:
    # mean imputation for continuous variables
    for feature in ["age_of_driver", "annual_income", "claim_est_payout", "age_of_vehicle"]:
        feature_mean = df_train.loc[:, feature].mean(skipna=True)
        df[feature].fillna(int(feature_mean), inplace=True)

    # mode imputation for categorical variables
    for feature in ["marital_status", "witness_present_ind", "zip_code"]:
        feature_mode = df_train.loc[:, feature].mode(dropna=True)
        df[feature].fillna(feature_mode.values[0], inplace=True)

## Feature Engineering

Remove features that do not seem to be related to the target variable (based on common sense).

In [95]:
for df in [df_train, df_val, df_test]:
    df.drop(columns=["claim_date", "claim_day_of_week", "vehicle_color"], inplace=True)

There are many unique `zip_code`. Creating dummy variables for `zip_code` will increase the dimensionality of the data too much. One idea is to transform it into `latitude` and `longitude` using the data from [UnitedStatesZipCodes.org](https://www.unitedstateszipcodes.org/zip-code-database/).

In [96]:
zip_code_database = pd.read_csv("../data/external/zip_code_database.csv")
latitude_and_longitude_lookup = {
    row.zip: (row.latitude, row.longitude) for row in zip_code_database.itertuples()
}

for df in [df_train, df_val, df_test]:
    df["latitude"] = df["zip_code"].apply(lambda x: latitude_and_longitude_lookup[x][0])
    df["longitude"] = df["zip_code"].apply(lambda x: latitude_and_longitude_lookup[x][1])

Another idea is to use [target encoding](https://maxhalford.github.io/blog/target-encoding/), but after a few experiments it seems to perform worse than just transforming it to `latitude` and `longitude`.

In [None]:
#from category_encoders.target_encoder import TargetEncoder
#
#target_encoder = TargetEncoder(cols=["zip_code"], smoothing=10)
#target_encoder.fit(df_train["zip_code"], df_train["fraud"])
#
#for df in [df_train, df_val, df_test]:
#    df["zip_code_target_encoded"] = target_encoder.transform(df["zip_code"])

Now we can drop `zip_code`.

In [98]:
for df in [df_train, df_val, df_test]:
    df.drop(columns=["zip_code"], inplace=True)

## Export processed data

In [99]:
df_train.to_csv("../data/processed/train.csv", index=False)
df_val.to_csv("../data/processed/val.csv", index=False)
df_test.to_csv("../data/processed/test.csv", index=False)