# Business Problem and Contextualization

We will be developing a fraud detection solution for a company specialized on this service. Their business model is the following:
* The company receives 25% of the transaction value labeled correctly as fraud;
* The company receives 5% of the transaction value labeled as fraud which is in reality a legitimate one;
* The company will give back 100% of the transaction value labeled as legitimate which is in reality a fraudulent one.

This is an agressive business model which depends a lot on the capability of the solution to minimize false negatives.

The delivery method will be an API endpoint which receives a transaction, classifies it and returns to the client if it is a legitimate or fraudulent one.

# 0. Imports

In [16]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split

# 1. Configs and Helper Functions 

# 2. Load Data 

In [3]:
df = pd.read_csv('data/PS_20174392719_1491204439457_log.csv')

In [4]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


## 2.1. Train Test Split

In [17]:
Counter(df['isFraud'])

Counter({0: 6354407, 1: 8213})

In [21]:
Counter(df['isFraud'])[1] / (Counter(df['isFraud'])[0] + Counter(df['isFraud'])[1])

0.001290820448180152

As we have an imbalanced dataset (only ~0.13% of rows are fraudulent) we need to do a stratified train-test split to keep the same proportion of legitimate and fraudulent transactions on train and test sets.

In [22]:
X = df.drop(['isFraud'], axis=1)
y = df['isFraud'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4, stratify=y)

In [24]:
print('Proportion of fraudulent transactions on train set: ', Counter(y_train)[1] / (Counter(y_train)[0] + Counter(y_train)[1]))
print('Proportion of fraudulent transactions on test set: ', Counter(y_test)[1] / (Counter(y_test)[0] + Counter(y_test)[1]))

Proportion of fraudulent transactions on train set:  0.0012908885972289176
Proportion of fraudulent transactions on test set:  0.0012906820849992737
