 # The Problem

 ### Fraud Blocker Company
 Fraud Blocker Company is specialized in the detection of fraud in financial transactions made through mobile devices. The company has a service called __Fraud Blocker__ which guarantees the blocking of fraudulent transactions.

 The product runs as a service, with monetization made by the performance provided, that is, the user pays a fixed fee on the success in detecting fraud in the client's transactions.

 However, Fraud Blocker Company is expanding in Brazil and in order to acquire customers more quickly, it has adopted a very aggressive strategy. The strategy works as follows:

 1. The company will receive 25% of the value of each transaction that is truly detected as fraud.
 2. The company will return 5% of the value of each transaction detected as fraud, however the transaction is truly legitimate.
 3. The company will return 100% of the value to the customer for each transaction detected as legitimate, however the transaction is truly a fraud.

 With this aggressive strategy, the company assumes the risks of failing to detect fraud and is remunerated for assertively detecting fraud.

 For the client, it is an excellent deal to hire Fraud Blocker Company. Although the fee charged is very high on success, 25%, the company reduces its costs with correctly detected fraudulent transactions and even the damage caused by an error in the anti-fraud service will be covered by the Fraud Blocker Company itself.

 For the company, in addition to getting many customers with this risky strategy of guaranteeing reimbursement in the event of a failure to detect customer fraud, it only depends on the precision and accuracy of the models built by its Data Scientists, that is, how much the more accurate the “Blocker Fraud” model, the higher the company's revenue. However, if the model has low precision, the company could have a huge loss.

 # Data Colection

 We will solve this problem using data from Synthetic Financial Datasets For Fraud Detection dataset hosted on kaggle.



 ## Content


 PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

 This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.



 ## Headers


 This is a sample of 1 row with headers explanation:
 1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

 step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

 type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

 amount - amount of the transaction in local currency.

 nameOrig - customer who started the transaction

 oldbalanceOrg - initial balance before the transaction

 newbalanceOrig - new balance after the transaction

 nameDest - customer who is the recipient of the transaction

 oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

 newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

 isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

 isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this
 dataset is an attempt to transfer more than 200.000 in a single transaction.

 # Load Modules and Data

In [1]:
import warnings
from itertools import product
from pathlib import Path

import numpy as np
import plotly.express as px
import seaborn as sns
from IPython.display import display
from pandasgui import show
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

from fraud_detection import pipelines

warnings.filterwarnings("ignore", category=DeprecationWarning)

df_subsample = pipelines.make_subsample()

 # Data Cleaning

 First, lets take a look on the data

 Based on the description of the data set:

 - nameOrig and nameDest has two informations: the costumer id and type. It can be useful information. Lets split that.

 - there is not information for Merchants customers but their shows 0. It would be more acurate to replace with NaN

 - step is the moment in time where a transaction occur, where each step correspond to one hour. Time passed from initial counting doesnt makes sense on the prediction of fraud. However, thiefs might have preference on the time of the day or day of the week, so we can modify this atribute and turn into possible usefull feature.

 - isFlaggedFraud is redundant from the amount information and can be discarded

 Also, after a brief analisys:

 - lets fix name inconcistency 'Org' and 'Orig' between columns

 Finally, int atributes actualy means categorical ones and are better represented as categorical type instead

In [2]:
def fix_representation(df):

    df = df.copy()

    df.rename(columns={'oldbalanceOrg': 'oldbalanceOrig'}, inplace=True)

    for endp in 'Orig', 'Dest':

        df['clientType' + endp] = df['name' + endp].apply(lambda s: s[0])

    for balance, endp in product(('old', 'new'), ('Orig', 'Dest')):

        merchants = df['clientType' + endp] == 'M'

        name = balance + 'balance' + endp

        df.loc[merchants, name] = df.loc[merchants, name].replace({0: np.nan})

    df['hourOfDay'] = df.step % 24

    df['dayOfWeek'] = (df.step // 24) % 7

    df.drop(columns=['step', 'isFlaggedFraud'], inplace=True)

    categorical = df.dtypes != float

    df.loc[:, categorical] = df.loc[:, categorical].astype('category')

    df = df.reset_index(drop=True)

    return df


pipe = Pipeline([
    ('fix_representation', FunctionTransformer(fix_representation)),
])

 Lets check if the balance changes are consistent with the amounts

In [3]:
df_subsample = pipe.transform(pipelines.make_subsample())

df_subsample[
    'changeOrig'] = df_subsample.oldbalanceOrig - df_subsample.newbalanceOrig

df_subsample[
    'changeDest'] = df_subsample.newbalanceDest - df_subsample.oldbalanceDest

px.scatter_matrix(
    data_frame=df_subsample,
    dimensions=['changeOrig', 'changeDest', 'amount'],
    color=[
        'Consistent' if v else 'Inconsistent'
        for v in (df_subsample['changeOrig'] == df_subsample.amount)
    ])

 From changeOrig, there are some fixable inconsistencies:

 - some of the changes are shown 0 but the amount is grather then 0, it would be better to be considered unknown

 - some of the changes are the exact negative of the amount and is probably some error on the balance origin and destination

In [4]:
def fix_balance_change(df):

    df = df.copy()

    df['changeOrig'] = df.oldbalanceOrig - df.newbalanceOrig

    df['changeDest'] = df.newbalanceDest - df.oldbalanceDest

    for endp in 'Orig', 'Dest':

        name = 'change' + endp

        cols = ['oldbalance' + endp, 'newbalance' + endp]

        rows = (df.amount > 0) & (df[name] == 0)

        df.loc[rows, cols] = np.full([rows.sum(), 2], np.nan)

        rows = abs((df[name] + df.amount) / (df.amount + 1)) < 1e-3

        df.loc[rows, cols] = df.loc[rows, cols[::-1]].values

    df = df.drop(['changeOrig', 'changeDest'], axis=1)

    return df


pipe = Pipeline([
    ('fix_representation', FunctionTransformer(fix_representation)),
    ('fix_balance_change', FunctionTransformer(fix_balance_change)),
])

 # Exploratory Data Analysis

 Lets take a look on how frequent the categories are on this data set

In [5]:
df_subsample = pipe.transform(pipelines.make_subsample())

df_subsample.loc[:, df_subsample.dtypes != float].describe()

Unnamed: 0,type,nameOrig,nameDest,isFraud,clientTypeOrig,clientTypeDest,hourOfDay,dayOfWeek
count,50000,50000,50000,50000,50000,50000,50000,50000
unique,5,50000,48138,2,1,2,24,7
top,CASH_OUT,C1000019422,C142211154,0,C,C,19,0
freq,17422,1,4,49939,50000,32839,5013,12106


 From this we can see that we got as many transaction origins as samples, thus it cant be used to distinguish if fraud occured. NameDest has some repetitions, but still too few for the aditional complexity of traking each costumer behaviour individualy.

In [6]:
def discard_features(df):

    df = df.copy()

    df.drop(columns=['nameOrig', 'nameDest'], inplace=True)

    return df


pipe = Pipeline([
    ('fix_representation', FunctionTransformer(fix_representation)),
    ('fix_balance_change', FunctionTransformer(fix_balance_change)),
    ('discard_features', FunctionTransformer(discard_features)),
])

 How many samples are frauds?

In [7]:
df_subsample = pipe.transform(pipelines.make_subsample())

px.pie(data_frame=df_subsample, names='isFraud', values=None, color=None)

 Only a small fraction are frauds and making visualizations from now on would be messy. Instead, we can stratify the data according to isFraud.

In [8]:
df_stratified = pipe.transform(pipelines.make_subsample(stratified=True))

df_stratified.loc[:, df_stratified.dtypes != float].describe()

Unnamed: 0,type,isFraud,clientTypeOrig,clientTypeDest,hourOfDay,dayOfWeek
count,16426,16426,16426,16426,16426,16426
unique,5,2,1,2,24,7
top,CASH_OUT,0,C,C,19,0
freq,7031,8213,16426,13618,1171,3360


 Since all clientOrig are costumers, this atribute constributes with no information and can be droped

In [9]:
def drop_clientTypeOrig(df):

    return df.drop('clientTypeOrig', axis=1)


pipe = Pipeline([
    ('fix_representation', FunctionTransformer(fix_representation)),
    ('fix_balance_change', FunctionTransformer(fix_balance_change)),
    ('discard_features', FunctionTransformer(discard_features)),
    ('drop_clientTypeOrig', FunctionTransformer(drop_clientTypeOrig)),
])

 What about quantitative features?

In [10]:
df_stratified = pipe.transform(pipelines.make_subsample(stratified=True))

px.scatter_matrix(data_frame=df_stratified,
                  dimensions=[
                      'amount', 'oldbalanceOrig', 'newbalanceOrig',
                      'oldbalanceDest', 'newbalanceDest'
                  ],
                  color='isFraud',
                  height=1000)

 We can see that the bahavior of fraudulent agents is to withdraw all the money up to some limit(10M in this case)

In [11]:
def suspicious_withdraw_feat(df):

    df = df.copy().reset_index(drop=True)

    feat = abs(df.amount - df.oldbalanceOrig) / (df.amount + 1)

    df['suspicious_withdraw_feat'] = feat

    return df

To conclude the preprocessing, we should drop the target variable

In [12]:
pipe = Pipeline([
    ('fix_representation', FunctionTransformer(fix_representation)),
    ('fix_balance_change', FunctionTransformer(fix_balance_change)),
    ('discard_features', FunctionTransformer(discard_features)),
    ('drop_clientTypeOrig', FunctionTransformer(drop_clientTypeOrig)),
    ('suspicious_withdraw_feat',
     FunctionTransformer(suspicious_withdraw_feat)),
    ('droplabel', FunctionTransformer(lambda df: df.drop('isFraud', axis=1))),
])

In [13]:
df_stratified = pipe.transform(pipelines.make_subsample(stratified=True))

df_stratified.head()

Unnamed: 0,type,amount,oldbalanceOrig,newbalanceOrig,oldbalanceDest,newbalanceDest,clientTypeDest,hourOfDay,dayOfWeek,suspicious_withdraw_feat
0,PAYMENT,6546.29,416843.0,410296.71,,,M,8,3,62.666647
1,CASH_OUT,3126.84,104654.0,101527.16,692833.0,962587.3,C,16,0,32.459192
2,PAYMENT,23556.05,52305.0,28748.95,,,M,20,4,1.220397
3,TRANSFER,239163.19,,,0.0,239163.19,C,19,6,
4,CASH_OUT,228663.06,,,1771853.88,2000516.94,C,19,1,
