# Goal

The main aim of these transformations is to prepare the dataset for model training, while also not making too many assumptions about the 'real' data. This is why some of the transformations might appear unecessary or even detrimental to the accuracy of the model (e.g. creating artifical time features), but are expected to better mach how the data comming through the pipeline could look like.  

# 0. Setup

### Necessary imports

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

### Variable setup

In [None]:
TRANSACTIONS_FILE = "../data/transactions.csv"
# Set different sample rate < 1 for easier local exploration
SAMPLE_RATE = 1

# 1. Exploratory Data Analysis

In [None]:
df = pd.read_csv(TRANSACTIONS_FILE)

skip_index = int(1 / SAMPLE_RATE)
df = df.iloc[::skip_index, :]

Let's check for any missing data first

In [None]:
df.isnull().values.any()

In [None]:
df.head()

While there are no NaN values in the dataset, there are lots of rows with 0 value for oldBalanceDest and newBalanceDest columns. As the authors of the dataset explain, these values are unavailable for the accounts of Merchants (name starting with M). Will deal with those later

Let's check which transaction types are the most likely to be fraudulent

In [None]:
df.loc[(df.isFraud == 1)].type.value_counts()

The fraudulent transactions only occur with two types of transactions. 'TRANSFER' where money is sent to a customer / fraudster and 'CASH_OUT' where money is sent to a merchant who pays the customer / fraudster in cash. It appears to be the effect of assumed modus operandi - the fraud is commited by first transferring out funds to another account, which cashes it out.

In [None]:
df_flagged = df[(df.isFlaggedFraud == 1)]
print(
    f"Amount of isFlaggedFraud: {len(df_flagged)}\n"
    f"Amount range {int(df_flagged.amount.min())}$ - {int(df_flagged.amount.max())}$"
)

df_flagged_wrong = df.loc[
    (df.type == "TRANSFER") & (df.amount > 200000) & (df.isFlaggedFraud == 0)
]
print(
    f"Number of transactions above 200,000$ without isFlaggedFraud: {len(df_flagged_wrong)}"
)

The feature 'isFlaggedFraud' does not seem to be clear. As per the dataset description, isFlaggedFraud should be 1 an attempt is made to 'TRANSFER' an 'amount' greater than 200,000. As shown below, there are some cases where it is not true. Since it is used in only 16 cases (0.00025%) with no apparent logic, it can be dropped without loosing too much information. 

### Merchant/Customers and their payments

In [None]:
df_merchant_origin = df[(df.nameOrig.str.startswith("M"))]
print(f"Number of merchants originating transations: {len(df_merchant_origin)}")

df_merchant_dest = df[(df.nameDest.str.startswith("M"))]
print(f"Number of merchants originating transations: {len(df_merchant_dest)}")

print(f"Transaction types to Merchants: \n{df_merchant_dest.type.value_counts()}")

Merchants occur only among destination accounts, with no transaction originating from a merchant. Merchants receive only 'PAYMENT' type transactions. 

As all of the transactions are originated by consumers, the transaction type distribution for consumer originated transactions will be the same as for the entire dataset.

In [None]:
df_customer_dest = df[(df.nameDest.str.startswith("C"))]
print(f"Number of merchants originating transations: {len(df_customer_dest)}")

print(f"Transaction types to Consumers: \n{df_customer_dest.type.value_counts()}")

## Time distribution

In [None]:
df.step.hist(bins=30)

In [None]:
# TODO: hourly distribution
df_fraud = df[(df.isFraud == 1)]

df_fraud.step.hist(bins=30)

While there are several limitations to the used dataset (weird isFlaggedFraud flag, limited merchants behavior), it is still an interesting starting point for the project. As achieving a model that will perform perfectly in real-world conditions is not the main aim, the limitations are acceptable. 

# Data cleaning

In [None]:
# TODO: zero values for balance

# Feature engineering

## Convert time

One could expect that the time would play a major role in determining whether a transaction is fraudulent or not (e.g. a transaction at 03:00 on Tuesday is more susupcious than at 18:00 on Friday). The generator provides data with 'steps', that correspond to one hour passing. Unfortunately

In [None]:
columns_to_drop = [""]

In [None]:
df = df.drop(columns_to_drop, axis=1)

In [None]:
relevant.head()

In [None]:
## Create customer type

In [None]:
# as only customers can originate payments, they can be converted to IDs
relevant["source_id"] = relevant["nameOrig"].str[1:].astype(int)
relevant["dest_id"] = relevant["nameDest"].str[1:].astype(int)
relevant["dest_type"] = relevant["nameDest"].str[0].astype(str)

relevant = relevant.drop(columns=["nameOrig", "nameDest"])

In [None]:
relevant.head()

In [None]:
sum(relevant["oldbalanceDest"] == 0)
sum(relevant.dest_type == "M")

In [None]:
270439 / len(relevant)

In [None]:
relevant[(relevant["oldbalanceDest"] < 0.01) & (relevant["newbalanceDest"] < 0.01)]

In [None]:
relevant.dtypes