# Business Problem and Contextualization

We will be developing a fraud detection solution for a company specialized on this service. Their business model is the following:
* The company receives 25% of the transaction value labeled correctly as fraud;
* The company receives 5% of the transaction value labeled as fraud which is in reality a legitimate one;
* The company will give back 100% of the transaction value labeled as legitimate which is in reality a fraudulent one.

This is an agressive business model which depends a lot on the capability of the solution to minimize false negatives.

The delivery method will be an API endpoint which receives a transaction, classifies it and returns to the client if it is a legitimate or fraudulent one.

The dataset used is from a mobile money service, which contains a wallet and enables users to transfer money between them.

# 0. Imports

In [None]:
import pandas as pd
import plotly.express as px

from collections import Counter
from sklearn.model_selection import train_test_split

# 1. Configs and Helper Functions 

## 1.1. Configs

In [None]:
pd.options.display.float_format = '{:.4f}'.format

# 2. Load Data 

In [None]:
df = pd.read_csv('data/PS_20174392719_1491204439457_log.csv')

In [None]:
df.head()

## 2.1. Train Test Split

In [None]:
Counter(df['isFraud'])

In [None]:
Counter(df['isFraud'])[1] / (Counter(df['isFraud'])[0] + Counter(df['isFraud'])[1])

As we have an imbalanced dataset (only ~0.13% of rows are fraudulent) we need to do a stratified train-test split to keep the same proportion of legitimate and fraudulent transactions on train and test sets.

In [None]:
X = df.drop(['isFraud'], axis=1)
y = df['isFraud'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4, stratify=y)

In [None]:
print('Proportion of fraudulent transactions on train set: ', Counter(y_train)[1] / (Counter(y_train)[0] + Counter(y_train)[1]))
print('Proportion of fraudulent transactions on test set: ', Counter(y_test)[1] / (Counter(y_test)[0] + Counter(y_test)[1]))

# 3. Data Description

In [None]:
# first, lets merge X_train and y_train for better data manipulation and analysis
df = pd.concat([X_train, y_train], axis = 1)

df.head()

## 3.1. Columns content

* step: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days);
* type: the type of the transaction being made. We will assume the following meaning for each:
  * CASH-IN: depositing money on the wallet;
  * CASH-OUT: withdraw money from the wallet;
  * DEBIT: similar to a payment;
  * PAYMENT: paying some invoice;
  * TRANSFER: transfering money to another user;
* amount: amount of the transaction in local currency;
* nameOrig: source customer who made the transaction;
* oldbalanceOrg: balance of the source customer before the transaction;
* newbalanceOrig: balance of the source customer after the transaction;
* nameDest: destination customer who is the recipient of the transaction;
* oldbalanceDest: balance of the destination customer before the transaction. Important: for customers that start with M (Merchants), this information is not filled;
* newbalanceDest: balance of the destination customer after tbe transaction. Important: for customers that start with M (Merchants), this information is not filled;
* isFlaggedFraud: the business model aims to control massive transfers from one account to another and flags suspicious attempts. A suspicious attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction;
* isFraud: flags if the transaction is fraudulent (1) or genuine (0). In this dataset, the fraudulent behavior of the agents aims to profit by taking control of customers accounts and trying to empty the funds by transferring to another account, which then cashes out of the system.

## 3.2. Data Dimension

In [None]:
print('Number of rows: ', df.shape[0])
print('Number of columns: ', df.shape[1])

## 3.3. Data Types

In [None]:
df.dtypes

No need for adjustments on data types

## 3.4. Check NAs

In [None]:
df.isna().sum()

No NAs on dataset

## 3.5. Descriptive Statistics

In [None]:
num_attributes = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
cat_attributes = ['type', 'nameOrig', 'nameDest', 'isFlaggedFraud', 'isFraud']

### 3.5.1. Numerical attributes

In [None]:
df[num_attributes].describe()

In [None]:
df[num_attributes].skew()

In [None]:
df[num_attributes].kurtosis()

#### 3.5.1.1. amount

In [None]:
fig = px.histogram(df, x='amount', nbins=10)
fig.show()

By looking at the statistics and the graph, we can already detect a huge asymmetry on the amount value distribution, we will have to investigate further this variable.

### 3.5.2. Categorical attributes

In [None]:
df[cat_attributes].head()

#### 3.5.2.1. type

In [None]:
Counter(df['type'])

In [None]:
fig = px.histogram(df, x='type')

fig.show()

In [None]:
fig = px.histogram(df, x='type', color='isFraud', barmode = 'group')

fig.show()

In [None]:
df.groupby(['type', 'isFraud'])['isFraud']