# Feature Engineering
### Features to engineer
- trans_date_trans_time
    - break each value up into year, month, day, hour, min, sec columns
    - seasonality indicators
- dob
    - break each value up into year, month, day columns
    - create age column: calculate it from current datetime
- lat, long, merch_lat, merch_long
    - create geohash column: use python-geohash package to generate geohash for each pair of latitude and longitude
    - create x, y, z columns: x = cos(lat) * cos(long), y = cos(lat) * sin(long), z = sin(lat) 
    - clustering: k-means, DBSCAN, hierarchical clustering
    - https://fritz.ai/working-with-geospatial-data-in-machine-learning/
    - google places api
- cc_num
    - create card_issuing_bank column by categorisation: usually the first few digits of a credit card number are used to identify the issuing bank, we need to check that this rule applies in our dataset as it is synthetic
    - clustering?
    - otherwise, we might have to remove this feature as it may lead to model overfitting

## Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Load Datasets

#### Train set

In [2]:
train_df = pd.read_csv('../data/processed/train_under_nm.csv')
train_df.head()

Unnamed: 0,merchant_fraud_Abbott-Rogahn,merchant_fraud_Abbott-Steuber,merchant_fraud_Abernathy and Sons,merchant_fraud_Abshire PLC,"merchant_fraud_Adams, Kovacek and Kuhlman",merchant_fraud_Adams-Barrows,"merchant_fraud_Altenwerth, Cartwright and Koss",merchant_fraud_Altenwerth-Kilback,merchant_fraud_Ankunding LLC,merchant_fraud_Ankunding-Carroll,...,merch_lat,merch_long,trans_day,trans_hour,trans_min,trans_month,trans_sec,trans_year,unix_time,is_fraud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,32.838176,-101.76948,9.0,1.0,51.0,12.0,28.0,2019.0,0.638276,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,47.155217,-100.563004,4.0,1.0,6.0,7.0,5.0,2019.0,0.344267,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,36.29623,-82.774458,5.0,21.0,2.0,1.0,27.0,2020.0,0.689995,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,46.54699,-117.861457,6.0,21.0,19.0,4.0,23.0,2019.0,0.180255,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,37.750413,-75.811784,11.0,20.0,31.0,10.0,38.0,2019.0,0.529957,0.0


#### Validation set

In [3]:
validation_df = pd.read_csv('../data/processed/validation.csv')
validation_df.head()

Unnamed: 0,merchant_fraud_Abbott-Rogahn,merchant_fraud_Abbott-Steuber,merchant_fraud_Abernathy and Sons,merchant_fraud_Abshire PLC,"merchant_fraud_Adams, Kovacek and Kuhlman",merchant_fraud_Adams-Barrows,"merchant_fraud_Altenwerth, Cartwright and Koss",merchant_fraud_Altenwerth-Kilback,merchant_fraud_Ankunding LLC,merchant_fraud_Ankunding-Carroll,...,merch_lat,merch_long,trans_day,trans_hour,trans_min,trans_month,trans_sec,trans_year,unix_time,is_fraud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,40.078344,-83.744323,20.0,3.0,35.0,10.0,59.0,2019.0,0.545389,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,42.763597,-102.985891,1.0,2.0,23.0,4.0,28.0,2019.0,0.169485,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,42.990737,-78.61645,9.0,4.0,6.0,6.0,24.0,2019.0,0.297989,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,37.385643,-93.539664,2.0,15.0,57.0,11.0,47.0,2019.0,0.570533,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,39.340593,-121.292761,30.0,2.0,50.0,3.0,2.0,2020.0,0.844861,0.0


#### Test set

In [4]:
test_df = pd.read_csv('../data/processed/test.csv')
test_df.head()

Unnamed: 0,merchant_fraud_Abbott-Rogahn,merchant_fraud_Abbott-Steuber,merchant_fraud_Abernathy and Sons,merchant_fraud_Abshire PLC,"merchant_fraud_Adams, Kovacek and Kuhlman",merchant_fraud_Adams-Barrows,"merchant_fraud_Altenwerth, Cartwright and Koss",merchant_fraud_Altenwerth-Kilback,merchant_fraud_Ankunding LLC,merchant_fraud_Ankunding-Carroll,...,merch_lat,merch_long,trans_day,trans_hour,trans_min,trans_month,trans_sec,trans_year,unix_time,is_fraud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,35.511039,-86.376035,7.0,3.0,50.0,7.0,41.0,2020.0,1.029124,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,37.820026,-81.083742,8.0,14.0,5.0,12.0,29.0,2020.0,1.316427,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,38.497704,-77.890586,11.0,1.0,41.0,12.0,18.0,2020.0,1.321046,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,40.028643,-77.146722,11.0,12.0,29.0,7.0,1.0,2020.0,1.037235,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,38.044333,-85.847865,6.0,17.0,43.0,7.0,15.0,2020.0,1.028339,0.0
