# Feature Pipeline using Synthetic Data

# 1) Reading synthetic credit card data and feature engineering

In [1]:
import pandas as pd
import datetime
import hopsworks
from sml import synthetic_data
import random
pd.options.mode.chained_assignment = None

In [5]:
start_time = (
    datetime.datetime.now() - datetime.timedelta(hours=24)
    ).strftime("%Y-%m-%d %H:%M:%S")
#end_time = (datetime.datetime.now() - datetime.timedelta(hours=24)).strftime("%Y-%m-%d %H:%M:%S")
end_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Start time: {start_time}")
print(f"End time: {end_time}")

Start time: 2022-10-16 21:01:37
End time: 2022-10-17 21:01:37


In [6]:
synthetic_data.FRAUD_RATIO = random.uniform(0.001, 0.005)
synthetic_data.TOTAL_UNIQUE_USERS = 1000
synthetic_data.TOTAL_UNIQUE_TRANSACTIONS = 54000
synthetic_data.CASH_WITHRAWAL_CARDS_TOTAL = 2000 
synthetic_data.TOTAL_UNIQUE_CASH_WITHDRAWALS = 1200
synthetic_data.START_DATE=start_time
synthetic_data.END_DATE=end_time

credit_cards = synthetic_data.generate_list_credit_card_numbers()
credit_cards_df = synthetic_data.create_credit_cards_as_df(credit_cards)
profiles_df = synthetic_data.create_profiles_as_df(credit_cards)
trans_df = synthetic_data.create_transactions_as_df(credit_cards)

# 2) Feature engineering

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning you will create additional features based on these patterns. In particular, you will create two types of features:

1. Features that aggregate data from different data sources. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles.csv` with the datetime feature from `transactions.csv`.
2. Features that aggregate data from multiple time steps. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

Let's start with the first category.

In [7]:
fraud_labels = trans_df.copy()[["tid", "cc_num", "datetime", "fraud_label"]]
fraud_labels

Unnamed: 0,tid,cc_num,datetime,fraud_label
0,1a9f7aa0e2eb4d927da56e02287875b5,4176332408257688,2022-10-16 21:01:38,0
1,2b546af6798a257b236228bfff036231,4895940069843701,2022-10-16 21:01:42,0
2,29e47ed4b9abf632727107b0e1c6dc1e,4562180039969078,2022-10-16 21:01:42,0
3,089611cbb4cd6d6cecb27461a5093f03,4899899195688156,2022-10-16 21:01:43,0
4,c9d6198db1a305ed98888850a512786f,4890552643648087,2022-10-16 21:01:47,0
...,...,...,...,...
71318,031c63e933b9c92079794a7675df2837,4548269955223195,2022-11-23 02:02:02,0
71319,9c7dbefb6e349030471b48e5bfcffcfe,4548269955223195,2022-11-29 06:02:02,0
71320,197dd063fc62b6fe316e2d4df57c57a4,4548269955223195,2022-12-05 10:02:02,0
71321,4f8d625d953d194e69252c190fb019d8,4548269955223195,2022-12-11 14:02:02,0


In [8]:
from sml import cc_features

fraud_labels.datetime = fraud_labels.datetime.map(lambda x: cc_features.date_to_timestamp(x))
fraud_labels

Unnamed: 0,tid,cc_num,datetime,fraud_label
0,1a9f7aa0e2eb4d927da56e02287875b5,4176332408257688,1665954098000,0
1,2b546af6798a257b236228bfff036231,4895940069843701,1665954102000,0
2,29e47ed4b9abf632727107b0e1c6dc1e,4562180039969078,1665954102000,0
3,089611cbb4cd6d6cecb27461a5093f03,4899899195688156,1665954103000,0
4,c9d6198db1a305ed98888850a512786f,4890552643648087,1665954107000,0
...,...,...,...,...
71318,031c63e933b9c92079794a7675df2837,4548269955223195,1669168922000,0
71319,9c7dbefb6e349030471b48e5bfcffcfe,4548269955223195,1669701722000,0
71320,197dd063fc62b6fe316e2d4df57c57a4,4548269955223195,1670234522000,0
71321,4f8d625d953d194e69252c190fb019d8,4548269955223195,1670767322000,0


In [9]:
trans_df.drop(['fraud_label'], inplace = True, axis=1)
trans_df

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country
0,1a9f7aa0e2eb4d927da56e02287875b5,2022-10-16 21:01:38,4176332408257688,Electronics,456.97,36.610330,-88.314760,Murray,US
1,2b546af6798a257b236228bfff036231,2022-10-16 21:01:42,4895940069843701,Grocery,71.93,42.168080,-88.428140,Huntley,US
2,29e47ed4b9abf632727107b0e1c6dc1e,2022-10-16 21:01:42,4562180039969078,Restaurant/Cafeteria,10.87,36.208290,-115.983910,Pahrump,US
3,089611cbb4cd6d6cecb27461a5093f03,2022-10-16 21:01:43,4899899195688156,Grocery,79.64,41.503430,-74.010420,Newburgh,US
4,c9d6198db1a305ed98888850a512786f,2022-10-16 21:01:47,4890552643648087,Holliday/Travel,43.42,42.970860,-82.424910,Port Huron,US
...,...,...,...,...,...,...,...,...,...
71318,031c63e933b9c92079794a7675df2837,2022-11-23 02:02:02,4548269955223195,Cash Withdrawal,31.15,40.841212,-74.180420,Nutley,US
71319,9c7dbefb6e349030471b48e5bfcffcfe,2022-11-29 06:02:02,4548269955223195,Cash Withdrawal,147.79,40.833017,-74.178175,Nutley,US
71320,197dd063fc62b6fe316e2d4df57c57a4,2022-12-05 10:02:02,4548269955223195,Cash Withdrawal,49.92,40.832383,-74.171125,Nutley,US
71321,4f8d625d953d194e69252c190fb019d8,2022-12-11 14:02:02,4548269955223195,Cash Withdrawal,437.25,40.835260,-74.169988,Nutley,US


Next, you will create features that for each credit card aggregate data from multiple time steps.

Yoy will start by computing the distance between consecutive transactions, lets call it `loc_delta`. Here you will use the Haversine distance to quantify the distance between two longitude and latitude coordinates.

Next lets compute windowed aggregates. Here you will use 4-hour windows, but feel free to experiment with different window lengths by setting `WINDOW_LEN` below to a value of your choice.

In [10]:
trans_df = cc_features.card_owner_age(trans_df, profiles_df)
trans_df = cc_features.expiry_days(trans_df, credit_cards_df)
trans_df = cc_features.activity_level(trans_df, 1)

In [11]:
WINDOW_LEN = 4
window_aggs_df = cc_features.aggregate_activity_by_hour(trans_df, WINDOW_LEN)

In [12]:
project = hopsworks.login()
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/2266




Connected. Call `.close()` to terminate connection gracefully.


Add to the feature store

In [13]:
trans_fg = fs.get_feature_group(name="cc_trans_fraud", version=2)
trans_fg.insert(trans_df, write_options={"wait_for_job" : False})

Uploading Dataframe: 0.00% |          | Rows 0/71323 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/2266/jobs/named/cc_trans_fraud_2_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fb15f413910>, None)

In [14]:
window_aggs_fg = fs.get_feature_group(name=f"cc_trans_fraud_{WINDOW_LEN}h", version=2)
window_aggs_fg.insert(window_aggs_df, write_options={"wait_for_job" : False})

Uploading Dataframe: 0.00% |          | Rows 0/71323 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/2266/jobs/named/cc_trans_fraud_4h_2_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fb1ecb92730>, None)

In [15]:
labels_fg = fs.get_feature_group(name="transactions_fraud_label", version=2)
labels_fg.insert(fraud_labels)

Uploading Dataframe: 0.00% |          | Rows 0/71323 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/2266/jobs/named/transactions_fraud_label_2_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fb15fa88c10>, None)