# Data Preprocessing

This file shows how I performed data cleaning and feature engineering. 

## Set up

Import libraries.

In [1]:
import numpy as np
import pandas as pd
from category_encoders.target_encoder import TargetEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Load datasets.

In [2]:
full_train_df = pd.read_csv("../data/raw/train.csv")
test_df = pd.read_csv("../data/raw/test.csv")

Since the test set provided does not have the target variable, so we have to create an internal validation set to evaluate the model performance.

In [3]:
train_df, val_df = train_test_split(full_train_df, test_size=0.2, stratify=full_train_df["fraud"], random_state=30)

## Data Cleaning

Remove the observations whose the target variable `fraud` is equal to -1.

In [4]:
train_df = train_df[train_df["fraud"] != -1]

For values that match the following conditions, treat them as missing values to be imputed later.

- `age_of_driver > 100`
- `annual_income = -1`
- `zip_code = -1`

According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), the oldest living person is 115, as of 2018. I think it is reasonable to assume that any `age_of_driver > 100` in this dataset is a clerical error.

In [5]:
for df in [train_df, val_df, test_df]:
    df.loc[df["age_of_driver"] > 100, "age_of_driver"] = np.nan
    df.loc[df["annual_income"] == -1, "annual_income"] = np.nan
    df.loc[df["zip_code"] == 0, "zip_code"] = np.nan

Now, we will do an imputation for the missing values. Since there is only a very small percentage of missing values, we will simply do a mean/mode imputation for the continuous/categorical variables.

In [6]:
for df in [train_df, val_df, test_df]:
    # mean imputation for continuous variables
    for feature in ["age_of_driver", "annual_income", "claim_est_payout", "age_of_vehicle"]:
        feature_mean = df.loc[:, feature].mean(skipna=True)
        df[feature].fillna(int(feature_mean), inplace=True)

    # mode imputation for categorical variables
    for feature in ["marital_status", "witness_present_ind", "zip_code"]:
        feature_mode = df.loc[:, feature].mode(dropna=True)
        df[feature].fillna(feature_mode.values[0], inplace=True)

## Feature Engineering

Remove `claim_date`.

In [7]:
for df in [train_df, val_df, test_df]:
    df.drop(columns=["claim_date"], inplace=True)

Some numerical features are measured in very different scales, so they should be re-scaled.

In [8]:
numerical_features = [
    "age_of_driver", "annual_income", "safty_rating", "past_num_of_claims", 
    "liab_prct", "claim_est_payout", "age_of_vehicle", "vehicle_price", "vehicle_weight"
]
scaler = MinMaxScaler()
scaler.fit(train_df[numerical_features])

for df in [train_df, val_df, test_df]:
    df[numerical_features] = scaler.transform(df[numerical_features])

There are many unique `zip_code`. Creating dummy variables for `zip_code` will increase the dimensionality of the data too much. One idea is to transform it into `latitude` and `longitude` using the data from [UnitedStatesZipCodes.org](https://www.unitedstateszipcodes.org/zip-code-database/).

In [9]:
zip_code_database = pd.read_csv("../data/external/zip_code_database.csv")
latitude_and_longitude_lookup = {
    row.zip: (row.latitude, row.longitude) for row in zip_code_database.itertuples()
}

for df in [train_df, val_df, test_df]:
    df["latitude"] = df["zip_code"].apply(lambda x: latitude_and_longitude_lookup[x][0])
    df["longitude"] = df["zip_code"].apply(lambda x: latitude_and_longitude_lookup[x][1])

Another idea is to use [target encoding](https://maxhalford.github.io/blog/target-encoding/).

In [10]:
target_encoder = TargetEncoder(cols=["zip_code"])
target_encoder.fit(train_df["zip_code"], train_df["fraud"])

for df in [train_df, val_df, test_df]:
    df["zip_code_target_encoded"] = target_encoder.transform(df["zip_code"])



Now we can drop `zip_code`.

In [11]:
for df in [train_df, val_df, test_df]:
    df.drop(columns=["zip_code"], inplace=True)

Create dummy variables for the other the categorical features.

In [12]:
df = pd.get_dummies(df)

## Export processed data

In [13]:
train_df.to_csv("../data/processed/train.csv", index=False)
val_df.to_csv("../data/processed/val.csv", index=False)
test_df.to_csv("../data/processed/test.csv", index=False)