# Real time streaming

In this tutorial, we will look at how generate data for a streaming, as opposed to a batch, scenario. In other words, instead of sending a batch of data at a time, data can be streamed into the model whereupon predictions are generated. We'll do this by creating a Kafka topic.

Inspiration for this example was taken from [here](https://github.com/aws-samples/amazon-sagemaker-feature-store-streaming-aggregation).

## Generating data

We will first generate some synthetic data to have something to feed the model with. For this, we will use the Faker library, which can generate data according to various specifications.

We need to generate:

- credit card numbers
- user identities
- transactions, including information on:
    - time stamps
    - amounts
    - ATM withdrawals
    
We also need to decide on a percentage of fraudulent transactions, how long the fraudulent attack chains should be, among many other choices.

#### Prerequisites 

In [5]:
#!pip install Faker

#### Imports 

In [4]:
import os
import bisect
import datetime
import hashlib
import math
import random

from collections import defaultdict
from typing import Optional, Union, Any, Dict, List, TypeVar, Tuple

import pandas as pd
import numpy as np

from faker import Faker

In [5]:
# Seed for Reproducibility
faker = Faker()
faker.seed_locale('en_US', 0)

In [6]:
SEED = 12345
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

#### Constants 

In [8]:
TOTAL_UNIQUE_USERS = 2000
TOTAL_UNIQUE_TRANSACTIONS = 54000
CASH_WITHDRAWAL_CARDS_TOTAL = 1000 
TOTAL_UNIQUE_CASH_WITHDRAWALS = 12000 
ATM_WITHDRAWAL_SEQ_LENGTH = [3, 4, 5, 6, 7, 8, 9, 10]
NORMAL_ATM_RADIUS = 0.01
START_DATE = '2022-01-01 00:00:00'
END_DATE = '2022-03-01 00:00:00'
DATE_FORMAT = '%Y-%m-%d %H:%M:%S'

AMOUNT_DISTRIBUTION_PERCENTAGES = {
                                   0.05: (0.01, 1.01), 
                                   0.075: (1, 11.01),
                                   0.525: (10, 100.01),
                                   0.25: (100, 1000.01),
                                   0.099: (1000, 10000.01),
                                   0.001: (10000, 30000.01)
                                  }

CATEGORY_PERC_PRICE = {
                       "Grocery":              (0.5, 0.01, 100), 
                       "Restaurant/Cafeteria": (0.2, 1, 100),
                       "Health/Beauty":        (0.1, 10, 500.01),
                       "Domestic Transport":   (0.1, 10, 100.01),
                       "Clothing":             (0.05, 10, 2000.01),
                       "Electronics":          (0.02, 100, 10000.01),
                       "Sports/Outdoors":      (0.015, 10, 100.01),
                       "Holliday/Travel":      (0.014, 10, 100.01),              
                       "Jewelery":             (0.001, 10, 100.01)
                       }

FRAUD_RATIO = 0.0025 # percentage of transactions that are fraudulent
NUMBER_OF_FRAUDULENT_TRANSACTIONS = int(FRAUD_RATIO * TOTAL_UNIQUE_TRANSACTIONS)
ATTACK_CHAIN_LENGTHS = [3, 4, 5, 6, 7, 8, 9, 10]

### Generate credit card numbers

<p> Credit card numbers are uniquely assigned to users. For simplicity, we will only generate VISA card numbers.</p>

In [15]:
def generate_unique_credit_card_numbers(n: int) -> list:
    cc_ids = set()
    for _ in range(n):
        cc_id = faker.credit_card_number(card_type='visa')
        cc_ids.add(cc_id)
    return list(cc_ids) 

In [16]:
credit_card_numbers = generate_unique_credit_card_numbers(TOTAL_UNIQUE_USERS)

In [17]:
assert len(credit_card_numbers) == TOTAL_UNIQUE_USERS 
assert len(credit_card_numbers[0]) == 16 # validate if generated number is 16-digit

In [18]:
# inspect random sample of credit card numbers 
random.sample(credit_card_numbers, 5)

['4482905697043510',
 '4938596853771323',
 '4646509464458779',
 '4794708616246336',
 '4494091911429599']

In [19]:
delta_time_object = datetime.datetime.strptime(START_DATE, DATE_FORMAT)
delta_time_object + datetime.timedelta(days=-728)

credit_cards = []
for cc_num in credit_card_numbers:
    credit_cards.append({'cc_num': cc_num, 'provider': 'visa', 'expires': faker.credit_card_expire(start=delta_time_object, end="+5y", date_format="%m/%y")})

In [20]:
credit_cards_pdf = pd.DataFrame.from_records(credit_cards)
credit_cards_pdf.head()

Unnamed: 0,cc_num,provider,expires
0,4496737863050092,visa,10/23
1,4150911993593191,visa,03/22
2,4639235925252168,visa,04/24
3,4877781409330336,visa,03/22
4,4158309777853778,visa,05/22


### Generate user identity 

Users have a name, sex, birthdate, city, country, age, and of course a credit card number.

In [21]:
profiles = []
for credit_card in credit_cards:
    address = faker.local_latlng(country_code = 'US')
    age = 0 
    profile = None
    while age < 18 or age > 100:
        profile = faker.profile(fields=['name', 'sex', 'mail', 'birthdate'])
        dday = profile['birthdate']
        delta = datetime.datetime.now() - datetime.datetime(dday.year, dday.month, dday.day)
        age = int(delta.days / 365)
    if age >= 18 or age <= 100:
        profile['City'] = address[2]
        profile['Country'] = address[3]
        profile['cc_num'] = credit_card['cc_num']
        profile['age'] = age
        credit_card['age'] = age
        profiles.append(profile)

In [22]:
profiles_pdf = pd.DataFrame.from_records(profiles)
profiles_pdf.drop('age', axis=1, inplace=True)
profiles_pdf.head()

Unnamed: 0,name,sex,mail,birthdate,City,Country,cc_num
0,Douglas Gonzalez,M,hevans@yahoo.com,2003-01-14,Lebanon,US,4496737863050092
1,Melvin Williams,M,angela31@gmail.com,1971-11-08,South Whittier,US,4150911993593191
2,Brandon Carr,M,bryanfields@yahoo.com,1953-10-22,Fayetteville,US,4639235925252168
3,Joel Avery,M,jaime63@gmail.com,1953-05-14,Ken Caryl,US,4877781409330336
4,Alisha Hernandez,F,mirandapierce@gmail.com,1984-10-31,Independence,US,4158309777853778


### Generate timestamps

We create random timestamps in the relevant period.

In [27]:
def generate_timestamps(n: int) -> list:
    start = datetime.datetime.strptime(START_DATE, DATE_FORMAT)
    end = datetime.datetime.strptime(END_DATE, DATE_FORMAT)
    timestamps = list()
    for _ in range(n):
        timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime(DATE_FORMAT)
        timestamps.append(timestamp)
    timestamps = sorted(timestamps)
    return timestamps

In [28]:
timestamps = generate_timestamps(TOTAL_UNIQUE_TRANSACTIONS)

In [29]:
assert len(timestamps) == TOTAL_UNIQUE_TRANSACTIONS

In [30]:
# inspect random sample of timestamps
random.sample(timestamps, 5)

['2022-01-27 16:36:04',
 '2022-01-09 19:17:34',
 '2022-01-31 20:57:24',
 '2022-01-19 14:59:21',
 '2022-02-10 03:12:14']

### Generate Random Transaction Amounts 
<p>The transaction amounts are presumed to follow Pareto distribution, as it is logical for consumers to make many more smaller purchases than large ones. The break down of the distribution is shown in the table below.</p>


| Percentage        | Range (Amount in $)     |
| :-------------: | :----------: |
|  5\% | 0.01 to 1    |
| 7.5\%   | 1 to 10 |
| 52.5\%   | 10 to 100 |
| 25\%   | 100 to 1000 |
| 10\%   | 1000 to 10000 |

In [31]:
def get_random_transaction_amount(start: float, end: float) -> float:
    amt = round(np.random.uniform(start, end), 2)
    return amt

In [32]:
amounts = []
for percentage, span in AMOUNT_DISTRIBUTION_PERCENTAGES.items():
    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)
    start, end = span
    for _ in range(n):
        amounts.append(get_random_transaction_amount(start, end+1))

categories = []        
for category, category_perc_price in CATEGORY_PERC_PRICE.items():
    percentage, min_price, max_price = category_perc_price
    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)
    for _ in range(n):
        min_price_i = bisect.bisect_left(amounts, min_price)
        max_price_i = bisect.bisect_right(amounts, max_price, lo=min_price_i)
        categories.append({"category":category, "amount":random.choice(amounts[min_price_i:max_price_i])})
        
random.shuffle(categories)

In [33]:
len(categories)

54000

In [20]:
len(amounts)

54000

In [21]:
# inspect random sample of transaction amounts
random.sample(categories, 5)

[{'category': 'Grocery', 'amount': 16.4},
 {'category': 'Grocery', 'amount': 43.19},
 {'category': 'Grocery', 'amount': 37.13},
 {'category': 'Grocery', 'amount': 0.82},
 {'category': 'Grocery', 'amount': 80.37}]

### Generate Credit Card Transactions
<br>
<div style="text-align: justify">
Using the random credit card numbers, timestamps and transaction amounts generated in the above steps, 
we can generate random credit card transactions by combining them. The transaction id for the transaction is the md5
hash of the above mentioned entities.
</div>

In [34]:
def generate_transaction_id(timestamp: str, credit_card_number: str, transaction_amount: float) -> str:
    hashable = f'{timestamp}{credit_card_number}{transaction_amount}'
    hexdigest = hashlib.md5(hashable.encode('utf-8')).hexdigest()
    return hexdigest

In [35]:
transactions = []
for timestamp, category in zip(timestamps, categories):
    credit_card_number = random.choice(credit_card_numbers)
    point_of_tr = faker.local_latlng(country_code = 'US')
    transaction_id = generate_transaction_id(timestamp, credit_card_number, category['amount'])
    transactions.append({
                         'tid': transaction_id, 
                         'datetime': timestamp, 
                         'cc_num': credit_card_number, 
                         'category': category['category'], 
                         'amount': category['amount'],
                         'latitude': point_of_tr[0], 
                         'longitude': point_of_tr[1],
                         'city': point_of_tr[2],
                         'country': point_of_tr[3],
                         'fraud_label': 0
                        }
                       )

In [36]:
# inspect random sample of credit card transactions
random.sample(transactions, 1)

[{'tid': 'd5d43f75ca492242305b6fdb3a93eb32',
  'datetime': '2022-02-26 10:46:29',
  'cc_num': '4060717626727299',
  'category': 'Grocery',
  'amount': 93.38,
  'latitude': '34.21639',
  'longitude': '-119.0376',
  'city': 'Camarillo',
  'country': 'US',
  'fraud_label': 0}]

### Generate Transaction Chains

Here we generate transaction chains representing attacks, i.e. chains of fraudulent transactions. These will be the transactions that our classifier should recognize as fraudulent.

In [37]:
visited = set()
chains = defaultdict(list)

In [38]:
def size(chains: dict) -> int:
    counts = {key: len(values)+1 for (key, values) in chains.items()}
    return sum(counts.values())

In [39]:
def create_attack_chain(i: int):
    chain_length = random.choice(ATTACK_CHAIN_LENGTHS)
    for j in range(1, chain_length):
        if i+j not in visited:
            if size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS:
                break
            chains[i].append(i+j)
            visited.add(i+j)

In [40]:
while size(chains) < NUMBER_OF_FRAUDULENT_TRANSACTIONS:
    i = random.choice(range(TOTAL_UNIQUE_TRANSACTIONS))
    if i not in visited:
        create_attack_chain(i)
        visited.add(i)

### Generate ATM cash withdrawal

Here we generate transactions corresponding to ATM cash withdrawal events. Some of these will turn out to be fraudulent. The probability of a fraudulent transaction varies by user age, so that older customers have a higher risk of being subjected to a card scam.

In [42]:
cash_amounts = []
for percentage, span in AMOUNT_DISTRIBUTION_PERCENTAGES.items():
    n = int(TOTAL_UNIQUE_CASH_WITHDRAWALS * percentage)
    start, end = span
    for _ in range(n):
        cash_amounts.append(get_random_transaction_amount(start, end+1))

In [43]:
len(cash_amounts)

12000

In [44]:
def generate_atm_withdrawal(credit_card_number: str, cash_amounts: list, length: int, delta: int, radius: float = None, country_code = 'US') -> List[Dict]:
    atms = [] 
    start = datetime.datetime.strptime(START_DATE, DATE_FORMAT)
    end = datetime.datetime.strptime(END_DATE, DATE_FORMAT)
    timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None)
    point_of_tr = faker.local_latlng(country_code = country_code)
    latitude = point_of_tr[0] 
    longitude = point_of_tr[1]
    city = point_of_tr[2]
    for _ in range(length):
        current = timestamp + datetime.timedelta(hours=delta)
        if radius is not None:
            latitude = faker.coordinate(latitude, radius) 
            longitude = faker.coordinate(longitude, radius)  
        transaction_id = generate_transaction_id(timestamp, credit_card_number, category['amount'])
        atms.append({'tid': transaction_id, 
                     'datetime': current.strftime(DATE_FORMAT), 
                     'cc_num': credit_card_number, 
                     'category': 'Cash Withdrawal', 
                     'amount': random.sample(cash_amounts, 1)[0],
                     'latitude': latitude, 
                     'longitude': longitude,
                     'city': city,
                     'country': 'US',
                     'fraud_label': 0
                     })
        timestamp = current
    return atms

In [45]:
SUSCEPTIBLE_CARDS_DISTRIBUTION_BY_AGE = {
                                   0.055: (17, 24), 
                                   0.0015: (24, 34),
                                   0.0015: (34, 44),
                                   0.02: (44, 54),
                                   0.022: (54, 64),
                                   0.1: (64, 74),
                                   0.40: (74, 84),
                                   0.40: (84, 100),
                                  }
susseptible_cards = []
visited = []
for percentage, span in SUSCEPTIBLE_CARDS_DISTRIBUTION_BY_AGE.items():
    n = int(TOTAL_UNIQUE_CASH_WITHDRAWALS * percentage) ## TODO: here total expected fraud 
    start, end = span
    for _ in range(n):
        for card in credit_cards:
            if card['age'] > start and card['age'] < end:
                if card['cc_num'] not in visited:
                    current = card
                    visited.append(card['cc_num'])
                    break
                else:
                    current = None                    
        if current is not None:
            susseptible_cards.append(current)

In [46]:
normal_atm_withdrawals = []
atm_transactions = len(cash_amounts)
cash_withdrawal_cards = random.sample(susseptible_cards, CASH_WITHDRAWAL_CARDS_TOTAL//(CASH_WITHDRAWAL_CARDS_TOTAL//len(susseptible_cards)+1))
atm_count = 0
while atm_count < atm_transactions:
    for card in cash_withdrawal_cards:
        for ATM_WITHDRAWAL_SEQ in ATM_WITHDRAWAL_SEQ_LENGTH: 
            # interval in hours between normal cash withdrawals
            delta = random.randint(6, 168)
            atm_tr = generate_atm_withdrawal(credit_card_number = card['cc_num'], cash_amounts = cash_amounts, length=ATM_WITHDRAWAL_SEQ, delta=delta, radius = NORMAL_ATM_RADIUS)         
            normal_atm_withdrawals.append(atm_tr)
            atm_count += ATM_WITHDRAWAL_SEQ

In [47]:
len(normal_atm_withdrawals)

8000

### Modify Transactions with Fraud Chain Attacks 

Here we generate timestamps, amounts etc. for the transactions in the fraud attack chains.

In [49]:
def generate_timestamps_for_fraud_attacks(timestamp: str, chain_length: int) -> list:
    timestamps = []
    timestamp = datetime.datetime.strptime(timestamp, DATE_FORMAT)
    for _ in range(chain_length):
        # interval in seconds between fraudulent attacks
        delta = random.randint(30, 120)
        current = timestamp + datetime.timedelta(seconds=delta)
        timestamps.append(current.strftime(DATE_FORMAT))
        timestamp = current
    return timestamps 

In [50]:
def generate_amounts_for_fraud_attacks(chain_length: int) -> list:
    amounts = []
    for percentage, span in AMOUNT_DISTRIBUTION_PERCENTAGES.items():
        n = math.ceil(chain_length * percentage)
        start, end = span
        for _ in range(n):
            amounts.append(get_random_transaction_amount(start, end+1))
    return amounts[:chain_length]

In [51]:
for key, chain in chains.items():
    transaction = transactions[key]
    timestamp = transaction['datetime']
    cc_num = transaction['cc_num']
    amount = transaction['amount']
    transaction['fraud_label'] = 1
    inject_timestamps = generate_timestamps_for_fraud_attacks(timestamp, len(chain))
    inject_amounts = generate_amounts_for_fraud_attacks(len(chain))
    random.shuffle(inject_amounts)
    for i, idx in enumerate(chain):
        original_transaction = transactions[idx]
        inject_timestamp = inject_timestamps[i]
        original_transaction['datetime'] = inject_timestamp
        original_transaction['fraud_label'] = 1
        original_transaction['cc_num'] = cc_num
        original_transaction['amount'] = inject_amounts[i]
        original_transaction['category'] = [category for category, category_perc_price in CATEGORY_PERC_PRICE.items() if int(inject_amounts[i]) in range(int(category_perc_price[1]), int(category_perc_price[2]))][0]
        original_transaction['tid'] = generate_transaction_id(inject_timestamp, cc_num, amount)
        transactions[idx] = original_transaction

### Modify ATM normal cash withdrawals with fraudulent ones

Similar to the above, we populate the fraudulent ATM withdrawal transactions.

In [55]:
fraudulent_atm_tr_indxs = random.sample([i for i in range(0, len(normal_atm_withdrawals))], int(FRAUD_RATIO * len(normal_atm_withdrawals)))

In [56]:
normal_atm_withdrawals[0]

[{'tid': '4fda20a6721686fb3c49d3cf52da33da',
  'datetime': '2022-02-27 03:29:57',
  'cc_num': '4034693392225656',
  'category': 'Cash Withdrawal',
  'amount': 48.06,
  'latitude': Decimal('40.561921'),
  'longitude': Decimal('-74.276664'),
  'city': 'Woodbridge',
  'country': 'US',
  'fraud_label': 0},
 {'tid': '28e4a5ad897b59dd39479e99f1dd7176',
  'datetime': '2022-03-04 04:29:57',
  'cc_num': '4034693392225656',
  'category': 'Cash Withdrawal',
  'amount': 304.47,
  'latitude': Decimal('40.569933'),
  'longitude': Decimal('-74.267624'),
  'city': 'Woodbridge',
  'country': 'US',
  'fraud_label': 0},
 {'tid': '7e013e66361f26616e32c6421cb7f21e',
  'datetime': '2022-03-09 05:29:57',
  'cc_num': '4034693392225656',
  'category': 'Cash Withdrawal',
  'amount': 66.79,
  'latitude': Decimal('40.562181'),
  'longitude': Decimal('-74.262166'),
  'city': 'Woodbridge',
  'country': 'US',
  'fraud_label': 0}]

In [58]:
for fraudulent_atm_tr_indx in fraudulent_atm_tr_indxs:
    # interval in seconds between fraudulent attacks
    delta = random.randint(1, 5)
    atm_withdrawal = normal_atm_withdrawals[fraudulent_atm_tr_indx]
    pre_fraudulent_atm_tr = atm_withdrawal[0]
    fraudulent_atm_tr = generate_atm_withdrawal(credit_card_number = pre_fraudulent_atm_tr['cc_num'], cash_amounts = cash_amounts, length=1, delta=delta, radius = None)[0]
    fraudulent_atm_location = faker.location_on_land()
    while fraudulent_atm_location[3] == 'US':
        fraudulent_atm_location = faker.location_on_land()
    fraudulent_atm_tr['datetime'] = (datetime.datetime.strptime(pre_fraudulent_atm_tr['datetime'], DATE_FORMAT) + datetime.timedelta(hours=delta)).strftime(DATE_FORMAT)
    fraudulent_atm_tr['latitude'] = fraudulent_atm_location[0]
    fraudulent_atm_tr['longitude'] = fraudulent_atm_location[1]
    fraudulent_atm_tr['city'] = fraudulent_atm_location[2]
    fraudulent_atm_tr['country'] = fraudulent_atm_location[3]
    fraudulent_atm_tr['fraud_label'] = 1 
    atm_withdrawal.append(fraudulent_atm_tr)
    normal_atm_withdrawals[fraudulent_atm_tr_indx] = atm_withdrawal

In [59]:
for atm_withdrawal in normal_atm_withdrawals:
    for withdrawal in atm_withdrawal:
        transactions.append(withdrawal)  

### Write the generated data to file

In [None]:
import pandas as pd
from hops import hdfs
from hops import pandas_helper as pandas
                                            
transactions_pdf = pd.DataFrame.from_records(transactions)
transactions_pdf.head()

pandas.write_csv(hdfs.project_path() + "/Resources/transactions.csv", transactions_pdf, index=False)
pandas.write_csv(hdfs.project_path() + "/Resources/profiles.csv", profiles_pdf, index=False)
pandas.write_csv(hdfs.project_path() + "/Resources/credit_cards.csv", credit_cards_pdf, index=False)

## Create a Kafka Producer for the transactions

Now we are finally ready to create a Kafka Producer for our transaction data. It can write events to our Kafka Topic, "credit_card_transactions", from which the events can then be read by some downstream consumer.

In [65]:
import json

from confluent_kafka import Producer
from hops import kafka
from hops import tls
from hops import hdfs

## TODO (davit): gif how to create kafka topic from the UI


In [43]:
# change this according to your settings
KAFKA_BROKER_ADDRESS = "broker.kafka.service.consul:9091"
KAFKA_TOPIC_NAME = "credit_card_transactions"

In [44]:
config = {
    "bootstrap.servers": KAFKA_BROKER_ADDRESS,
    "security.protocol": kafka.get_security_protocol(),
    "ssl.ca.location": tls.get_ca_chain_location(),
    "ssl.certificate.location": tls.get_client_certificate_location(),
    "ssl.key.location": tls.get_client_key_location(),
    "group.id": "1"
}

producer = Producer(config)

In [None]:
i = 0
for transaction in transactions:
    transaction_label = {}
    _ = transaction.pop("fraud_label")
    if i % 1000 == 0:
        print(json.dumps(transaction))
    producer.produce(KAFKA_TOPIC_NAME, json.dumps(transaction))
    producer.flush()    
    i += 1

{"tid": "11df919988c134d97bbff2678eb68e22", "datetime": "2022-01-01 00:00:24", "cc_num": "4473593503484549", "category": "Health/Beauty", "amount": 62.95, "latitude": "38.70734", "longitude": "-77.02303", "city": "Fort Washington", "country": "US"}


  producer.produce(KAFKA_TOPIC_NAME, json.dumps(transaction))


{"tid": "c5191d37308c9624b460e94c6fe81b73", "datetime": "2022-01-02 02:21:04", "cc_num": "4336399961348201", "category": "Health/Beauty", "amount": 90.86, "latitude": "41.8542", "longitude": "-87.66561", "city": "Lower West Side", "country": "US"}
{"tid": "46f65698afa66f9eb1c1cee4fb16b59e", "datetime": "2022-01-03 04:51:37", "cc_num": "4219785543443381", "category": "Health/Beauty", "amount": 17.25, "latitude": "42.35843", "longitude": "-71.05977", "city": "Boston", "country": "US"}
{"tid": "c447dc4ebdd3e91a1e641983c2aa28e6", "datetime": "2022-01-04 07:11:58", "cc_num": "4089569454422049", "category": "Electronics", "amount": 620.79, "latitude": "47.76232", "longitude": "-122.2054", "city": "Bothell", "country": "US"}
{"tid": "27531978f0de94249478ebda69db9dad", "datetime": "2022-01-05 08:25:33", "cc_num": "4826919959712584", "category": "Grocery", "amount": 13.94, "latitude": "44.27804", "longitude": "-88.27205", "city": "Kaukauna", "country": "US"}
{"tid": "8f8a668392ac92f8fe3855e94dc

## Next Steps

In the next notebook, we'll look at how to do streaming aggregations using Spark structured streaming.