# Generate Credit Card Transactions
**This notebook generates credit card transactions and randomly injects fraud chain attacks.**


---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Generate Transactions](#Generate-Transactions)
1. [Inject Fradulent Transactions](#Inject-Fradulent-Transactions)
1. [Save Generated Data](#Save-Generated-Data)

### Background
This notebook generates random credit card transactions for 10K users over a period of 5 months. In an ideal scenario, these historical transactions would be accumulated into a data lake/store for batch processing so as to derive insights and analytics about this data. Credit card numbers can be bought in bulk on the dark web through previous leaks or hacks of organizations that store this sensitive data. Fraudsters will buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud chain attacks typically happen in a short time frame and can be easily spotted amongst historical transactions. This is because the velocity of transactions during the attack significantly differs from that of cardholder’s usual spending pattern. This notebook is optional to run. The generated data already exists in the `./data` folder for you to use. Re-run this notebook if you desire to re-populate fresh data or understand the whole process of how this dataset was generated.

### Setup

#### Prerequisites 

In [1]:
!pip install Faker

Collecting Faker
  Downloading Faker-8.2.1-py3-none-any.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 23.3 MB/s eta 0:00:01
[?25hCollecting text-unidecode==1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 12.1 MB/s  eta 0:00:01
Installing collected packages: text-unidecode, Faker
Successfully installed Faker-8.2.1 text-unidecode-1.3


#### Imports 

In [2]:
from botocore.client import ClientError
from collections import defaultdict
from faker import Faker
import pandas as pd
import numpy as np
import sagemaker
import datetime
import hashlib
import random
import boto3
import math
import os

#### Seed for Reproducibility

In [3]:
faker = Faker()
faker.seed_locale('en_US', 0)

In [4]:
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

#### Constants 

In [5]:
# Define counts
TOTAL_UNIQUE_TRANSACTIONS = 100 * 1000
TOTAL_UNIQUE_CREDIT_CARDS = 2000
TOTAL_UNIQUE_CONSUMERS = 1000

BUCKET = sagemaker.Session().default_bucket()

### Generate Transactions

#### Generate Unique Credit Card Numbers 
<p> Credit card numbers are uniquely assigned to users. Since, there are 10K users, we would want to generate 10K unique card numbers.</p>

In [6]:
def generate_unique_credit_card_numbers(n: int) -> list:
    cc_ids = set()
    for _ in range(n):
        cc_id = faker.credit_card_number(card_type='visa')
        cc_ids.add(cc_id)
    return list(cc_ids) 

In [7]:
credit_card_numbers = generate_unique_credit_card_numbers(TOTAL_UNIQUE_CREDIT_CARDS)

In [8]:
assert len(credit_card_numbers) == TOTAL_UNIQUE_CREDIT_CARDS 
assert len(credit_card_numbers[0]) == 16 # validate if generated number is 16-digit

In [9]:
# inspect random sample of credit card numbers 
random.sample(credit_card_numbers, 5)

['4956661021345975',
 '4750764047485706',
 '4057120219501079',
 '4295021621227182',
 '4134120990603661']

#### Generate Consumer IDs
Since we are training a consumer fraud detection model, we need a set of unique consumer IDs that we can randomly associate with each credit card transaction. The relationship between Consumer IDs to Credit Cards is one-to-many; a single Consumer can be tied to one or more credit instruments.

In [10]:
# We use Faker's Basic Bank Account Number (bban) module as a proxy for Consumer ID
def generate_unique_consumer_ids(n: int) -> list:
    id_list = set()
    for _ in range(n):
        bban_id = faker.unique.bban()
        id_list.add(bban_id)
    return list(id_list) 

In [11]:
consumer_ids = generate_unique_consumer_ids(TOTAL_UNIQUE_CONSUMERS)

In [12]:
assert len(consumer_ids) == TOTAL_UNIQUE_CONSUMERS

In [13]:
random.sample(consumer_ids, 5)

['ZCRN56886001005293',
 'VBLY61361832497125',
 'OPEW57074941492054',
 'JOFR79281747618483',
 'TLOI08540195188595']

#### Generate Time Series
<p>Generate one hundered thousand (100,000) random timestamps spread across a period of one month (2021-03-01 to 2021-03-31) in sorted order.</p>
<b>Note:</b> The timestamps are NOT unique themselves. We can have 2 or more transactions occurring at the same time coming from different users. 

In [14]:
# Feature Store Group requires ISO-8601 string format: yyyy-MM-dd'T'HH:mm:ssZ
# when the EventTime required attribute is type String

ISO_8601_DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%SZ'

In [15]:
def generate_timestamps(n: int) -> list:
    # Use timeframe of one month Mar 2021   
    start = datetime.datetime.strptime('2021-03-01T00:00:00Z', ISO_8601_DATETIME_FORMAT)
    end = datetime.datetime.strptime('2021-03-31T23:59:59Z', ISO_8601_DATETIME_FORMAT)

    timestamps = list()
    for _ in range(n):
        # We use .isoformat() to convert datetime to ISO-8601 Format, required by Feature Store
        iso_timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime(ISO_8601_DATETIME_FORMAT)
        timestamps.append(iso_timestamp)
    timestamps = sorted(timestamps)
    return timestamps

In [16]:
timestamps = generate_timestamps(TOTAL_UNIQUE_TRANSACTIONS)

In [17]:
assert len(timestamps) == TOTAL_UNIQUE_TRANSACTIONS

In [18]:
# inspect random sample of timestamps
random.sample(timestamps, 5)

['2021-03-02T13:13:15Z',
 '2021-03-16T11:38:49Z',
 '2021-03-22T19:19:45Z',
 '2021-03-23T20:43:56Z',
 '2021-03-14T13:41:00Z']

#### Generate Random Transaction Amounts 
<p>The transaction amounts are presumed to follow Pareto distribution, as it is logical for consumers to make many more smaller purchases than large ones. The break down of the distribution is shown in the table below.</p>


| Percentage        | Range (Amount in $)     |
| :-------------: | :----------: |
|  5\% | 0.01 to 1    |
| 7.5\%   | 1 to 10 |
| 52.5\%   | 10 to 100 |
| 25\%   | 100 to 1000 |
| 10\%   | 1000 to 10000 |

In [19]:
def get_random_transaction_amount(start: float, end: float) -> float:
    amt = round(np.random.uniform(start, end), 2)
    return amt

In [20]:
distribution_percentages = {0.05: (0.01, 1.01), 
                            0.075: (1, 11.01),
                            0.525: (10, 100.01),
                            0.25: (100, 1000.01),
                            0.10: (1000, 10000.01)}

In [21]:
amounts = []

for percentage, span in distribution_percentages.items():
    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)
    start, end = span
    for _ in range(n):
        amounts.append(get_random_transaction_amount(start, end+1))
        
random.shuffle(amounts)

In [22]:
assert len(amounts) == TOTAL_UNIQUE_TRANSACTIONS

In [23]:
# inspect random sample of transaction amounts
random.sample(amounts, 5)

[6.46, 43.12, 35.43, 89.63, 40.24]

#### Generate Credit Card Transactions
<br>
<div style="text-align: justify">
Using the random credit card numbers, timestamps and transaction amounts generated in the above steps, 
we can generate random credit card transactions by combining them. The transaction id for the transaction is the md5
hash of the above mentioned entities.
</div>

In [24]:
def generate_transaction_id(timestamp: str, credit_card_number: str, transaction_amount: float) -> str:
    hashable = f'{timestamp}{credit_card_number}{transaction_amount}'
    hexdigest = hashlib.md5(hashable.encode('utf-8')).hexdigest()
    return hexdigest

In [25]:
# Use jitter when assigning consumer_ids

def apply_jitter(iterator, increment, max_inc):
    iterator = 0
    increment += 3
    if (increment > max_inc):
        increment = 1
    return iterator, increment

In [26]:
# Create transactions
transactions = []

MAX_INCREMENT = 99
iter = 0
incr = 1
for timestamp, amount in zip(timestamps, amounts):
    consumer_id = consumer_ids[iter % TOTAL_UNIQUE_CONSUMERS]
    credit_card_number = random.choice(credit_card_numbers)
    transaction_id = generate_transaction_id(timestamp, credit_card_number, amount)
    transactions.append({'tid': transaction_id, 
                         'event_time': timestamp, 
                         'cc_num': credit_card_number, 
                         'consumer_id': consumer_id,
                         'amount': amount, 
                         'fraud_label': 0})
    iter += incr
    if (iter > TOTAL_UNIQUE_CONSUMERS):
        iter, incr = apply_jitter(iter, incr, MAX_INCREMENT)
        

In [27]:
assert len(transactions) == TOTAL_UNIQUE_TRANSACTIONS

In [28]:
# inspect random sample of credit card transactions
random.sample(transactions, 3)

[{'tid': '48ca9eb9128875d7f815aca04727c7fa',
  'event_time': '2021-03-21T07:01:57Z',
  'cc_num': '4532702060862208',
  'consumer_id': 'VNGM63871524814184',
  'amount': 4825.38,
  'fraud_label': 0},
 {'tid': 'b0be99afc27afdb3d1de6cb566a33db3',
  'event_time': '2021-03-16T14:06:19Z',
  'cc_num': '4849092754344749',
  'consumer_id': 'KINB63239048235860',
  'amount': 7409.58,
  'fraud_label': 0},
 {'tid': '996776cf97fad1a8207136829bcb9337',
  'event_time': '2021-03-15T15:35:41Z',
  'cc_num': '4995025867432789',
  'consumer_id': 'JSOE81361660812329',
  'amount': 13.11,
  'fraud_label': 0}]

### Inject Fradulent Transactions
<p> A typical fraud chain looks like the one as shown in the image below.</p>
<img src="./images/fraud_pattern.png" />

In [29]:
FRAUD_RATIO = 0.02 # was 0.0025 # percentage of transactions that are fraudulent
NUMBER_OF_FRAUDULENT_TRANSACTIONS = int(FRAUD_RATIO * TOTAL_UNIQUE_TRANSACTIONS)
ATTACK_CHAIN_LENGTHS = [3, 4, 5, 6, 7, 8, 9, 10]

#### Create Transaction Chains 

In [30]:
visited = set()
chains = defaultdict(list)

In [31]:
def size(chains: dict) -> int:
    counts = {key: len(values)+1 for (key, values) in chains.items()}
    return sum(counts.values())

In [32]:
def create_attack_chain(i: int):
    chain_length = random.choice(ATTACK_CHAIN_LENGTHS)
    for j in range(1, chain_length):
        if i+j not in visited:
            if size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS:
                break
            chains[i].append(i+j)
            visited.add(i+j)

In [33]:
while size(chains) < NUMBER_OF_FRAUDULENT_TRANSACTIONS:
    i = random.choice(range(TOTAL_UNIQUE_TRANSACTIONS))
    if i not in visited:
        create_attack_chain(i)
        visited.add(i)

In [34]:
assert size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS

#### Modify Transactions with Fraud Chain Attacks 

In [35]:
def generate_timestamps_for_fraud_attacks(timestamp: str, chain_length: int) -> list:
    timestamps = []
    timestamp = datetime.datetime.strptime(timestamp, ISO_8601_DATETIME_FORMAT)
    for _ in range(chain_length):
        # interval in seconds between fraudulent attacks
        delta = random.randint(30, 120)
        current = timestamp + datetime.timedelta(seconds=delta)
        timestamps.append(current.strftime(ISO_8601_DATETIME_FORMAT))
        timestamp = current
    return timestamps 

In [36]:
def generate_amounts_for_fraud_attacks(chain_length: int) -> list:
    amounts = []
    for percentage, span in distribution_percentages.items():
        n = math.ceil(chain_length * percentage)
        start, end = span
        for _ in range(n):
            amounts.append(get_random_transaction_amount(start, end+1))
    return amounts[:chain_length]

In [37]:
for key, chain in chains.items():
    transaction = transactions[key]
    timestamp = transaction['event_time']
    consumer_id = transaction['consumer_id']
    cc_num = transaction['cc_num']
    amount = transaction['amount']
    transaction['fraud_label'] = 1
    inject_timestamps = generate_timestamps_for_fraud_attacks(timestamp, len(chain))
    inject_amounts = generate_amounts_for_fraud_attacks(len(chain))
    random.shuffle(inject_amounts)
    for i, idx in enumerate(chain):
            original_transaction = transactions[idx]
            inject_timestamp = inject_timestamps[i]
            original_transaction['event_time'] = inject_timestamp
            original_transaction['fraud_label'] = 1
            original_transaction['cc_num'] = cc_num
            original_transaction['consumer_id'] = consumer_id
            original_transaction['amount'] = inject_amounts[i]
            original_transaction['tid'] = generate_transaction_id(inject_timestamp, cc_num, amount)
            transactions[idx] = original_transaction

#### Transform Transactions to Pandas DataFrame

In [38]:
transactions_df = pd.DataFrame(transactions)

In [39]:
fraud_transactions = transactions_df[transactions_df.fraud_label.eq(1)]
fraud_transactions.head()

Unnamed: 0,tid,event_time,cc_num,consumer_id,amount,fraud_label
502,79edd618e084b87327fff6f776000c80,2021-03-01T03:56:49Z,4861936807957845,RDUX88890356530086,56.65,1
503,fca760f10fac7d69f176f46141684f33,2021-03-01T03:58:24Z,4861936807957845,RDUX88890356530086,8.36,1
504,98c5195e6bd9025993082f76abff3fd0,2021-03-01T03:59:17Z,4861936807957845,RDUX88890356530086,1.49,1
591,96211ab09ce5148d9a11aafb0c77c8aa,2021-03-01T04:40:54Z,4626452901098169,UUEZ89591751278982,74.98,1
592,3f3d1de725805126afb4e80e285b7265,2021-03-01T04:41:35Z,4626452901098169,UUEZ89591751278982,0.44,1


In [40]:
assert fraud_transactions.count()[0] == NUMBER_OF_FRAUDULENT_TRANSACTIONS

### Save Generated Data
<p> The generated raw transactions data will be used by the next step = SageMaker PySpark Processing Job to do aggregations on the raw data columns and derive new features which are useful for model training in the later steps.
The generated data is saved locally and then copied to S3 bucket.</p>

#### Save Transactions Data to Local Folder ./data and upload to S3

In [41]:
data_dir = os.path.join(os.getcwd(), 'data/raw')
os.makedirs(data_dir, exist_ok=True)

transactions_df.to_csv(f'{data_dir}/transactions.csv', index=False)

In [42]:
BASE_PREFIX = "sagemaker-featurestore-blog"

INPUT_KEY_PREFIX = os.path.join(BASE_PREFIX, 'raw')
print (INPUT_KEY_PREFIX)

sagemaker-featurestore-blog/raw


In [43]:
transactions_df.to_csv(f's3://{BUCKET}/{INPUT_KEY_PREFIX}/transactions.csv', index=False)