# Feature Pipeline using Synthetic Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/featurestoreorg/serverless-ml-course/blob/main/src/02-module/2_cc_feature_pipeline.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

## 🗒️ This notebook is divided in 2 sections:
1. Reading the synthetic credit card data and feature engineeing,
2. Write the Pandas DataFrames to the feature groups in the feature store.


In [2]:
#!pip install -U hopsworks --quiet
!pip install -U faker --quiet

In [3]:
import pandas as pd
import datetime
import hopsworks
from sml import synthetic_data
import random
pd.options.mode.chained_assignment = None

In [4]:
start_time = (datetime.datetime.now() - datetime.timedelta(hours=24)).strftime("%Y-%m-%d %H:%M:%S")
print(start_time)

2024-03-21 12:48:11


In [5]:
#end_time = (datetime.datetime.now() - datetime.timedelta(hours=24)).strftime("%Y-%m-%d %H:%M:%S")
end_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(end_time)

2024-03-22 12:48:13


In [6]:
synthetic_data.FRAUD_RATIO = random.uniform(0.001, 0.005)
synthetic_data.TOTAL_UNIQUE_USERS = 1000
synthetic_data.TOTAL_UNIQUE_TRANSACTIONS = 54000
synthetic_data.CASH_WITHRAWAL_CARDS_TOTAL = 2000
synthetic_data.TOTAL_UNIQUE_CASH_WITHDRAWALS = 200
synthetic_data.START_DATE=start_time
synthetic_data.END_DATE=end_time

credit_cards = synthetic_data.generate_list_credit_card_numbers()
credit_cards_df = synthetic_data.create_credit_cards_as_df(credit_cards)
profiles_df = synthetic_data.create_profiles_as_df(credit_cards)
trans_df = synthetic_data.create_transactions_as_df(credit_cards)

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning you will create additional features based on these patterns. In particular, you will create two types of features:
1. **Features that aggregate data from different data sources**. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles.csv` with the `datetime` feature from `transactions.csv`.
2. **Features that aggregate data from multiple time steps**. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

Let's start with the first category.

In [7]:
fraud_labels = trans_df.copy()[["tid", "cc_num", "datetime", "fraud_label"]]
fraud_labels

Unnamed: 0,tid,cc_num,datetime,fraud_label
0,adc25e796202b5eb0ac786d4c93b6161,4118202503389390,2024-03-21 12:48:12,0
1,9db41ce9a9aaceee11ef6b496176e9a9,4786404303175968,2024-03-21 12:48:16,0
2,4d484a5e8d466d4b053d4ac3f0da68ef,4218697672505560,2024-03-21 12:48:16,0
3,2fd8246df1d45ec479f05debf66f558a,4070719751579600,2024-03-21 12:48:17,0
4,b4a1dc1234ecc463d4d5e4350cbce7fa,4076094619987013,2024-03-21 12:48:21,0
...,...,...,...,...
60081,741dd3652a7dfa360d5ba5905a89f867,4134652337490410,2024-04-09 12:27:52,0
60082,730872eed0ab5eb1e82664fec5076631,4134652337490410,2024-04-12 14:27:52,0
60083,74492f4f06c4fc9b9d9f84a9ed308fb3,4134652337490410,2024-04-15 16:27:52,0
60084,d001731d21feabbd262d3f037fdd4c2e,4134652337490410,2024-04-18 18:27:52,0


In [8]:
from sml import cc_features

fraud_labels.datetime = fraud_labels.datetime.map(lambda x: cc_features.date_to_timestamp(x))
fraud_labels

Unnamed: 0,tid,cc_num,datetime,fraud_label
0,adc25e796202b5eb0ac786d4c93b6161,4118202503389390,1711025292000,0
1,9db41ce9a9aaceee11ef6b496176e9a9,4786404303175968,1711025296000,0
2,4d484a5e8d466d4b053d4ac3f0da68ef,4218697672505560,1711025296000,0
3,2fd8246df1d45ec479f05debf66f558a,4070719751579600,1711025297000,0
4,b4a1dc1234ecc463d4d5e4350cbce7fa,4076094619987013,1711025301000,0
...,...,...,...,...
60081,741dd3652a7dfa360d5ba5905a89f867,4134652337490410,1712665672000,0
60082,730872eed0ab5eb1e82664fec5076631,4134652337490410,1712932072000,0
60083,74492f4f06c4fc9b9d9f84a9ed308fb3,4134652337490410,1713198472000,0
60084,d001731d21feabbd262d3f037fdd4c2e,4134652337490410,1713464872000,0


In [9]:
trans_df

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country,fraud_label
0,adc25e796202b5eb0ac786d4c93b6161,2024-03-21 12:48:12,4118202503389390,Electronics,456.97,26.684510,-80.667560,Belle Glade,US,0
1,9db41ce9a9aaceee11ef6b496176e9a9,2024-03-21 12:48:16,4786404303175968,Grocery,71.93,39.717340,-74.969330,Sicklerville,US,0
2,4d484a5e8d466d4b053d4ac3f0da68ef,2024-03-21 12:48:16,4218697672505560,Restaurant/Cafeteria,10.87,30.166880,-96.397740,Brenham,US,0
3,2fd8246df1d45ec479f05debf66f558a,2024-03-21 12:48:17,4070719751579600,Grocery,79.64,40.567540,-89.640660,Pekin,US,0
4,b4a1dc1234ecc463d4d5e4350cbce7fa,2024-03-21 12:48:21,4076094619987013,Holliday/Travel,43.42,36.610330,-88.314760,Murray,US,0
...,...,...,...,...,...,...,...,...,...,...
60081,741dd3652a7dfa360d5ba5905a89f867,2024-04-09 12:27:52,4134652337490410,Cash Withdrawal,62.36,34.951040,-120.410433,Santa Maria,US,0
60082,730872eed0ab5eb1e82664fec5076631,2024-04-12 14:27:52,4134652337490410,Cash Withdrawal,79.94,34.951507,-120.407653,Santa Maria,US,0
60083,74492f4f06c4fc9b9d9f84a9ed308fb3,2024-04-15 16:27:52,4134652337490410,Cash Withdrawal,2.69,34.961007,-120.410430,Santa Maria,US,0
60084,d001731d21feabbd262d3f037fdd4c2e,2024-04-18 18:27:52,4134652337490410,Cash Withdrawal,16.80,34.968168,-120.418600,Santa Maria,US,0


In [10]:
trans_df.drop(['fraud_label'], inplace = True, axis=1)

In [11]:
trans_df = cc_features.card_owner_age(trans_df, profiles_df)
trans_df = cc_features.expiry_days(trans_df, credit_cards_df)
trans_df = cc_features.activity_level(trans_df, 1)

In [12]:
window_len = 4
window_aggs_df = cc_features.aggregate_activity_by_hour(trans_df, window_len)

Next, you will create features that for each credit card aggregate data from multiple time steps.

Yoy will start by computing the distance between consecutive transactions, lets call it `loc_delta`.
Here you will use the [Haversine distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html?highlight=haversine#sklearn.metrics.pairwise.haversine_distances) to quantify the distance between two longitude and latitude coordinates.

Next lets compute windowed aggregates. Here you will use 4-hour windows, but feel free to experiment with different window lengths by setting `window_len` below to a value of your choice.

In [13]:
project = hopsworks.login()
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/547992
Connected. Call `.close()` to terminate connection gracefully.


To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to `1`.

In [14]:
trans_fg = fs.get_feature_group(name="cc_trans_fraud", version=2)
trans_fg.insert(trans_df, write_options={"wait_for_job" : False})

Uploading Dataframe: 100.00% |██████████| Rows 60086/60086 | Elapsed Time: 00:10 | Remaining Time: 00:00


Launching job: cc_trans_fraud_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/547992/jobs/named/cc_trans_fraud_2_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fe05cea3130>, None)

In [15]:
window_aggs_fg = fs.get_feature_group(name=f"cc_trans_fraud_{window_len}h", version=2)
window_aggs_fg.insert(window_aggs_df, write_options={"wait_for_job" : False})

Uploading Dataframe: 100.00% |██████████| Rows 60086/60086 | Elapsed Time: 00:07 | Remaining Time: 00:00


Launching job: cc_trans_fraud_4h_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/547992/jobs/named/cc_trans_fraud_4h_2_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fe01e529270>, None)

In [16]:

labels_fg = fs.get_feature_group(name="transactions_fraud_label", version=2)
labels_fg.insert(fraud_labels)

Uploading Dataframe: 100.00% |██████████| Rows 60086/60086 | Elapsed Time: 00:07 | Remaining Time: 00:00


Launching job: transactions_fraud_label_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/547992/jobs/named/transactions_fraud_label_2_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fe01e529360>, None)