In [12]:
import json 
import numpy as np
import pandas as pd
from tqdm import tqdm

tqdm.pandas()

%matplotlib inline

# Description

We have attached a sample dataset (only 10 rows in each) that simulates Brex transaction data. There are two tables for you to work with.

**Chargebacks** (chargebacks_table_analytics_challenge_sample.csv): A list of all transactions that have been chargedback. This means the customer flagged these transactions for fraud and we gave the customer their money back. `chargeback_id` is the primary key in this table and transaction_id is the foreign key.

**Transactions** (transactions_table_analytics_challenge_sample.csv): A list of all transactions what were attempted on Brex cards. All fraudulent transactions would be included in this data set. `transaction _id` is the primary key in this table.

More details on industry specific data is below:
- MID: merchant_id; this is a unique ID that references a specific merchant
- MCC: merchant category code; this is a code that references a category of merchants. For example, restaurants may have one MCC while financial services may have a different MCC.
- card_presence: This tells you if the customer was physically present at the time of their transaction or not.
- Pan_entry_mode: This tells you how the transaction was processed (i.e. chip, magstripe, ecommerce, etc.)
- Fraud_Score: this refers to how risky a transaction may be. The higher the score, the more likely it is fraudulent.


Reference: https://www.kaggle.com/c/ieee-fraud-detection/overview

In [2]:
chargebacks = pd.read_csv("chargebacks_table_analytics_challenge_sample.csv")
transactions = pd.read_csv("transactions_table_analytics_challenge_sample.csv")

In [3]:
chargebacks

Unnamed: 0,transaction_id,chargeback_id,amount,currency,reason,accrual_time
0,ntx_cjqem0h7l5eqo01yhiidesxqi,ncb_cjqy4sp3q184k01y9pyi10w2u,2995,USD,fraud,1/15/19 7:09 PM
1,ntx_cjql12p1j2t5l01xmslhyxdx9,ncb_cjr6o9w2ykfop0117w5s43gpj,7900,USD,fraud,1/21/19 6:36 PM
2,ntx_cjqnpm5s10kwb012q7pkv79ap,ncb_cjr6oh2epkfvq01xs68l3i2o4,3770,USD,fraud,1/21/19 6:42 PM
3,ntx_cjqnpm04a0l0i01071pdkwus3,ncb_cjr6oi98wkfxx012am2r0lal4,3770,USD,fraud,1/21/19 6:43 PM
4,ntx_cjqnplvbf0kvz011i1m72ga2x,ncb_cjr6ok2jrkfzs01xshkjykomu,3770,USD,fraud,1/21/19 6:44 PM
5,ntx_cjqnplqlz0kv5013lrkhvuc8b,ncb_cjr6op2ggkg6d012a3gt9iu0c,3770,USD,fraud,1/21/19 6:48 PM
6,ntx_cjqnplhi70kut011ia54g2937,ncb_cjr6or27ukg8601zlzww6t7k1,3770,USD,fraud,1/21/19 6:50 PM
7,ntx_cjqnplc9m0ktk012qwz2nrmh0,ncb_cjr6os2x3kgde0117b9z8by50,3770,USD,fraud,1/21/19 6:50 PM
8,ntx_cjqnpl7jb0kxt01071062vcuu,ncb_cjr6ous6xkgeg012amqgy8lsr,3770,USD,fraud,1/21/19 6:53 PM
9,ntx_cjqnpl2xm0ksv013l8iwgu2yr,ncb_cjr6owde8kgkk0117dld6ccx6,3770,USD,fraud,1/21/19 6:54 PM


In [4]:
transactions

Unnamed: 0,transaction_od,accrual_time,currency,amount,is_reccurring,fraud_score,card_id,transaction_data
0,ntx_cblgvoqla0oce012x89kpzfam,1/4/19 8:32 PM,USD,3175,False,33.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
1,ntx_cblgvoqla0oce012x89kpzfam,1/5/19 4:01 AM,USD,3176,False,44.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
2,ntx_cblgvoqla0oce012x89kpzfam,1/6/19 1:33 AM,USD,6351,False,38.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
3,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:08 PM,USD,12703,False,32.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
4,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:09 PM,USD,9527,False,31.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
5,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:09 PM,USD,9527,False,28.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
6,ntx_cblgvoqla0oce012x89kpzfam,1/11/19 7:32 PM,USD,44456,False,42.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
7,ntx_cblgvoqla0oce012x89kpzfam,1/11/19 11:51 PM,USD,25624,False,41.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""..."
8,ntx_cblhjeush0og1011k9wdwrwtu,1/5/19 3:42 AM,USD,57106,False,66.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806609"", ""mcc"" : ""7513"", ""..."
9,ntx_cjqbet2v52js801xnzqvt068j,1/1/19 1:01 AM,USD,2395,False,17.0,ncard_cui3n8edp076m1xry4z0t7b9,"{""terminal_id"" : ""001 "", ""mcc"" : ""3509"", ""..."


In [11]:
# JSON type
transactions["transaction_data"][0]

'{"terminal_id" : "70806603", "mcc" : "7513", "mid" : "4445001800508", "merchant" : "U-HAUL-CTR-BAYSHORE #7086", "country" : "USA", "card_presence" : "present", "pan_entry_mode" : "manual"}'

In [28]:
# Extend transaction_data into columns
pd.concat([transactions, transactions["transaction_data"].apply(lambda x: pd.Series(json.loads(x)))], axis=1)

Unnamed: 0,transaction_od,accrual_time,currency,amount,is_reccurring,fraud_score,card_id,transaction_data,card_presence,country,mcc,merchant,mid,pan_entry_mode,terminal_id
0,ntx_cblgvoqla0oce012x89kpzfam,1/4/19 8:32 PM,USD,3175,False,33.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
1,ntx_cblgvoqla0oce012x89kpzfam,1/5/19 4:01 AM,USD,3176,False,44.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
2,ntx_cblgvoqla0oce012x89kpzfam,1/6/19 1:33 AM,USD,6351,False,38.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
3,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:08 PM,USD,12703,False,32.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
4,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:09 PM,USD,9527,False,31.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
5,ntx_cblgvoqla0oce012x89kpzfam,1/7/19 11:09 PM,USD,9527,False,28.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
6,ntx_cblgvoqla0oce012x89kpzfam,1/11/19 7:32 PM,USD,44456,False,42.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
7,ntx_cblgvoqla0oce012x89kpzfam,1/11/19 11:51 PM,USD,25624,False,41.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806603"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806603
8,ntx_cblhjeush0og1011k9wdwrwtu,1/5/19 3:42 AM,USD,57106,False,66.0,ncard_cmlzqr2sm07031xerv1xb4ov,"{""terminal_id"" : ""70806609"", ""mcc"" : ""7513"", ""...",present,USA,7513,U-HAUL-CTR-BAYSHORE #7086,4445001800508,manual,70806609
9,ntx_cjqbet2v52js801xnzqvt068j,1/1/19 1:01 AM,USD,2395,False,17.0,ncard_cui3n8edp076m1xry4z0t7b9,"{""terminal_id"" : ""001 "", ""mcc"" : ""3509"", ""...",present,USA,3509,MARRIOTT INDY 2552,30000105652001,manual,1


# Fraud Detection Model

**Some key points for feature engineering:**

- Fraud transaction or Fraud credit card? How the Fraud get labeled? 
- It takes time for customer to flag frad, so some Y may be labeled incorrectly. isFruad != isFlaggedFruad
- The dataset is probably imbalanced, methods: resampling, modify weights, fake data, robust algo, change metric(AUC).
- Time-series data, methods: embedding, RNN. However, time is not most important.Fraud is not because the nature of fraud changes radically over time but rather because the customers in the dataset change radically over time.
- More data can be helpful, i.e. card information, distance, how many addresses are found to be associated with the payment card, days between previous transaction, IP, digital signature, fraud history, etc.
- Feature importance: PCA, LGBM
- Drop categorical columns (id, address, email) that can identify customer to prevent overfitting, since train/test set probably share a few customers.
- Add aggregate features by groups, make clusterings for customers(credit cards) can help to predict.


**Some key points for modeling:**

- Considerable model: XGBoost, LGBM, (NN), accept catagorical features.
- Cross Validation, need to try multiple strategies base on time.
- Better to blend multiple models