In [1]:
import pymongo
import pandas as pd
import numpy as np
import json

from collections import OrderedDict

# Data Cleaning

This notebook cleans the data to be inputted into a Decision Tree model. 

The Fraudulent Transactions dataset (sourced from https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data) contains 6,362,620 transactions with the following features:  
* __step__ - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

* __type__ - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

* __amount__ - amount of the transaction in local currency.

* __nameOrig__ - customer who started the transaction

* __oldbalanceOrg__ - initial balance before the transaction

* __newbalanceOrig__ - new balance after the transaction

* __nameDest__ - customer who is the recipient of the transaction

* __oldbalanceDest__ - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

* __newbalanceDest__ - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

* __isFraud__ - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

* __isFlaggedFraud__ - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

## Uploading data into MongoDB

cd into directory where Fraud.csv is:

mongoimport --type csv -d Fraud -c FraudData --headerline Fraud.csv

## Connect to db

In [5]:
# Create client and load in Fraud database
client = pymongo.MongoClient()
db = client["Fraud"]
fraud_data = db["FraudData"]

### Example document for a transaction:

In [6]:
fraud_data.find_one()

{'_id': ObjectId('6259eb9bf3d4ffef6a8bdf11'),
 'step': 1,
 'type': 'PAYMENT',
 'amount': 7861.64,
 'nameOrig': 'C1912850431',
 'oldbalanceOrg': 176087.23,
 'newbalanceOrig': 168225.59,
 'nameDest': 'M633326333',
 'oldbalanceDest': 0.0,
 'newbalanceDest': 0.0,
 'isFraud': 0,
 'isFlaggedFraud': 0}

### Total documents:

In [7]:
fraud_data.count_documents({})

6362620

### Helper Functions

In [8]:
def cursor_df(cursor: pymongo.command_cursor.CommandCursor) -> pd.DataFrame:
    """
    Convert pymongo results cursor to Pandas DataFrame
    
    :param cursor: Pymongo results cursor
    :results: Pandas DataFrame with results from cursor
    """
    return pd.DataFrame(list(cursor))

The only transactions that are fraudulent are __CASH_OUT__ and __TRANSFER__ transactions.

## Data Cleaning and Processing

### Updating merchant balances to null values

In [9]:
# Filter only for transactions where destination is a merchant
stage_filter_merchant = {
    "nameDest": {
        "$regex": "^M"
    }
}

# Set merchant old and new balances to null values
stage_set_null = {
    "$set": {
        "oldbalanceDest": None,
        "newbalanceDest": None,
    }
}

fraud_data.update_many(stage_filter_merchant, stage_set_null)

<pymongo.results.UpdateResult at 0x1222667c0>

In [10]:
# Filter only for transactions where destination is a merchant
stage_filter_merchant = {
    "$match": {
        "nameDest": {
            "$regex": "^M",
        }
    }
}

pipeline = [
    stage_filter_merchant
]

results = fraud_data.aggregate(pipeline)

cursor_df(results)

Unnamed: 0,_id,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,6259eb9bf3d4ffef6a8bdf11,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,,,0,0
1,6259eb9bf3d4ffef6a8bdf12,1,PAYMENT,7817.71,C90045638,53860.00,46042.29,M573487274,,,0,0
2,6259eb9bf3d4ffef6a8bdf15,1,PAYMENT,3099.97,C249177573,20771.00,17671.03,M2096539129,,,0,0
3,6259eb9bf3d4ffef6a8bdf16,1,PAYMENT,2560.74,C1648232591,5070.00,2509.26,M972865270,,,0,0
4,6259eb9bf3d4ffef6a8bdf17,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2151490,6259eca1f3d4ffef6aecf3d9,718,PAYMENT,8178.01,C1213413071,11742.00,3563.99,M1112540487,,,0,0
2151491,6259eca1f3d4ffef6aecf3dc,718,PAYMENT,17841.23,C1045048098,10182.00,0.00,M1878955882,,,0,0
2151492,6259eca1f3d4ffef6aecf3dd,718,PAYMENT,1022.91,C1203084509,12.00,0.00,M675916850,,,0,0
2151493,6259eca1f3d4ffef6aecf3de,718,PAYMENT,4109.57,C673558958,5521.00,1411.43,M1126011651,,,0,0


### Updating unknown payment balances

In [11]:
# Filter only for payment transactions where old and new origin balances are not known
stage_filter_unk_payments = {
    "type": {"$eq": "PAYMENT"},
    "oldbalanceOrg": {"$eq": 0},
    "newbalanceOrig": {"$eq": 0}
}

# Set payment old and new balances to null values
stage_set_null = {
    "$set": {
        "oldbalanceOrg": None,
        "newbalanceOrig": None,
    }
}

fraud_data.update_many(stage_filter_unk_payments, stage_set_null)

<pymongo.results.UpdateResult at 0x10e253740>

In [12]:
# Filter only for payment transactions where old and new origin balances are not known
stage_filter_unk_payments = {
    "$match": {
        "$and": [
            {"type": {"$eq": "PAYMENT"}},
            {"oldbalanceOrg": {"$eq": None}},
            {"newbalanceOrig": {"$eq": None}}
        ]
    }
}

pipeline = [
    stage_filter_unk_payments
]

results = fraud_data.aggregate(pipeline)

cursor_df(results)

Unnamed: 0,_id,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,6259eb9bf3d4ffef6a8bdf2e,1,PAYMENT,3448.92,C2103763750,,,M335107734,,,0,0
1,6259eb9bf3d4ffef6a8bdf2f,1,PAYMENT,9920.52,C764826684,,,M1940055334,,,0,0
2,6259eb9bf3d4ffef6a8bdf30,1,PAYMENT,5885.56,C840514538,,,M1804441305,,,0,0
3,6259eb9bf3d4ffef6a8bdf31,1,PAYMENT,4206.84,C215078753,,,M1757317128,,,0,0
4,6259eb9bf3d4ffef6a8bdf32,1,PAYMENT,5307.88,C1768242710,,,M1971783162,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
774240,6259eca1f3d4ffef6aecdb05,709,PAYMENT,20380.42,C619971661,,,M1289697387,,,0,0
774241,6259eca1f3d4ffef6aecdb06,709,PAYMENT,6174.99,C1520545975,,,M2113450343,,,0,0
774242,6259eca1f3d4ffef6aecdb1e,709,PAYMENT,5387.46,C460753297,,,M573565290,,,0,0
774243,6259eca1f3d4ffef6aecdb21,709,PAYMENT,3837.05,C1751522910,,,M903863937,,,0,0


774,245 payment transations were updated so that instead of 0 as the old

## Building Model Dataset

Because there are no fraudulent transactions for "PAYMENT", "CASH_IN", and "DEBIT" transactions, adding these transactions to our training set would only add unnecessary noise. There for they are excluded and the model will be solely trained off of "TRANSFER" and "CASH_OUT" transactions to be able to predict whether they are fraudulent or not. With only 28 non-fraudulent transactions involving an account that originated a fradulent transaction, it does not appear useful to engieer features relating to transactional history of accounts.  

In [13]:
# Filter only for payment transactions that are TRANSFER or CASH_OUT
stage_filter_trans = {
    "$match": {
        "type": {"$in": ["TRANSFER", "CASH_OUT"]},
    }
}

stage_project = {
    "$project": {
        "is_transfer": {
            "$cond": { 
                "if": {"$eq": ["$type", "TRANSFER"]}, "then": 1, "else": 0}
        },
        "amount": "$amount",
        "oldbalanceOrg": "$oldbalanceOrg",
        "newbalanceOrig": "$newbalanceOrig",
        "oldbalanceDest": "$oldbalanceDest",
        "newbalanceDest": "$newbalanceDest",
        "isFlaggedFraud": "$isFlaggedFraud",
        "isFraud": "$isFraud",
        "_id": 0
    }
}

pipeline = [
    stage_filter_trans,
    stage_project
]

results = fraud_data.aggregate(pipeline)

model_df = cursor_df(results)
model_df.to_csv('modeldata.csv', index=False)

In [1]:
model_df

NameError: name 'model_df' is not defined