In [None]:
!pip install -U 'git+https://github.com/logicalclocks/feature-store-api@master#egg=hsfs[python]&subdirectory=python' --quiet
!pip uninstall -y hopsworks
!pip install -U 'git+https://github.com/logicalclocks/hopsworks-api@main#egg=hopsworks&subdirectory=python' --quiet

In [None]:
import hopsworks

In [None]:
project = hopsworks.login()


💽 Loading the Data

The data we will use comes from three different CSV files:

    credit_cards.csv: credit card information such as expiration date and provider.
    transactions.csv: transaction information such as timestamp, location, and the amount. Importantly, the binary fraud_label variable tells us whether a transaction was fraudulent or not.
    profiles.csv: credit card user information such as birthdate and city of residence.

We can conceptualize these CSV files as originating from separate data sources. All three files have a credit card number column cc_num in common, which we can use for joins.

Let's go ahead and load the data.


In [None]:
import pandas as pd
credit_cards_df = pd.read_csv("https://repo.hops.works/dev/davit/card_fraud_data/credit_cards.csv")
credit_cards_df.head(3)

profiles_df = pd.read_csv("https://repo.hops.works/dev/davit/card_fraud_data/profiles.csv", parse_dates=["birthdate"])
profiles_df.head(3)

trans_df = pd.read_csv("https://repo.hops.works/dev/davit/card_fraud_data/transactions.csv", parse_dates=["datetime"])
trans_df.head(3)


🛠️ Feature Engineering

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning we will create additional features based on these patterns. In particular, we will create two types of features:

    Features that aggregate data from different data sources. This could for instance be the age of a customer at the time of a transaction, which combines the birthdate feature from profiles.csv with the datetime feature from transactions.csv.
    Features that aggregate data from multiple time steps. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

Let's start with the first category.


In [None]:
import numpy as np

# Compute age at transaction.
age_df = trans_df.merge(profiles_df, on="cc_num", how="left")
trans_df["age_at_transaction"] = (age_df["datetime"] - age_df["birthdate"]) / np.timedelta64(1, "Y")

# Compute days until card expires.
card_expiry_df = trans_df.merge(credit_cards_df, on="cc_num", how="left")
card_expiry_df["expires"] = pd.to_datetime(card_expiry_df["expires"], format="%m/%y")
trans_df["days_until_card_expires"] = (card_expiry_df["expires"] - card_expiry_df["datetime"]) / np.timedelta64(1, "D")

trans_df[["age_at_transaction", "days_until_card_expires"]].head()




Next, we create features that for each credit card aggregate data from multiple time steps.

We start by computing the distance between consecutive transactions, which we will call loc_delta. Here we use the Haversine distance to quantify the distance between two longitude and latitude coordinates.


In [None]:
from math import radians

# Do some simple preprocessing.
trans_df.sort_values("datetime", inplace=True)
trans_df[["longitude", "latitude"]] = trans_df[["longitude", "latitude"]].applymap(radians)

def haversine(long, lat):
    """Compute Haversine distance between each consecutive coordinate in (long, lat)."""

    long_shifted = long.shift()
    lat_shifted = lat.shift()
    long_diff = long_shifted - long
    lat_diff = lat_shifted - lat

    a = np.sin(lat_diff/2.0)**2
    b = np.cos(lat) * np.cos(lat_shifted) * np.sin(long_diff/2.0)**2
    c = 2*np.arcsin(np.sqrt(a + b))

    return c


trans_df["loc_delta"] = trans_df.groupby("cc_num")\
    .apply(lambda x : haversine(x["longitude"], x["latitude"]))\
    .reset_index(level=0, drop=True)\
    .fillna(0)



Next we compute windowed aggregates. Here we will use 4-hour windows, but feel free to experiment with different window lengths by setting window_len below to a value of your choice.

In [None]:
window_len = "4h"
cc_group = trans_df[["cc_num", "amount", "datetime"]].groupby("cc_num").rolling(window_len, on="datetime")

# Moving average of transaction volume.
df_4h_mavg = pd.DataFrame(cc_group.mean())
df_4h_mavg.columns = ["trans_volume_mavg", "datetime"]
df_4h_mavg = df_4h_mavg.reset_index(level=["cc_num"])
df_4h_mavg = df_4h_mavg.drop(columns=["cc_num", "datetime"])
df_4h_mavg = df_4h_mavg.sort_index()

# Moving standard deviation of transaction volume.
df_4h_std = pd.DataFrame(cc_group.mean())
df_4h_std.columns = ["trans_volume_mstd", "datetime"]
df_4h_std = df_4h_std.reset_index(level=["cc_num"])
df_4h_std = df_4h_std.drop(columns=["cc_num", "datetime"])
df_4h_std = df_4h_std.fillna(0)
df_4h_std = df_4h_std.sort_index()
window_aggs_df = df_4h_std.merge(df_4h_mavg,left_index=True, right_index=True)

# Moving average of transaction frequency.
df_4h_count = pd.DataFrame(cc_group.mean())
df_4h_count.columns = ["trans_freq", "datetime"]
df_4h_count = df_4h_count.reset_index(level=["cc_num"])
df_4h_count = df_4h_count.drop(columns=["cc_num", "datetime"])
df_4h_count = df_4h_count.sort_index()
window_aggs_df = window_aggs_df.merge(df_4h_count,left_index=True, right_index=True)

# Moving average of location difference between consecutive transactions.
cc_group = trans_df[["cc_num", "loc_delta", "datetime"]].groupby("cc_num").rolling(window_len, on="datetime").mean()
df_4h_loc_delta_mavg = pd.DataFrame(cc_group)
df_4h_loc_delta_mavg.columns = ["loc_delta_mavg", "datetime"]
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.reset_index(level=["cc_num"])
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.drop(columns=["cc_num", "datetime"])
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.sort_index()
window_aggs_df = window_aggs_df.merge(df_4h_loc_delta_mavg,left_index=True, right_index=True)

window_aggs_df = window_aggs_df.merge(trans_df[["cc_num", "datetime"]].sort_index(),left_index=True, right_index=True)
window_aggs_df.tail()


Convert date time object to unix epoch in milliseconds

In [None]:
trans_df.datetime = trans_df.datetime.values.astype(np.int64) // 10 ** 6
window_aggs_df.datetime = window_aggs_df.datetime.values.astype(np.int64) // 10 ** 6


🪄 Creating Feature Groups

A feature group can be seen as a collection of conceptually related features. In our case, we will create a feature group for the transaction data and a feature group for the windowed aggregations on the transaction data. Both will have tid as primary key, which will allow us to join them when creating a dataset in the next tutorial.

Feature groups can also be used to define a namespace for features. For instance, in a real-life setting we would likely want to experiment with different window lengths. In that case, we can create feature groups with identical schema for each window length.

Before we can create a feature group we need to connect to our feature store.


In [None]:
fs = project.get_feature_store()

To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to 1.

In [None]:
trans_fg = fs.create_feature_group(
    name="transactions",
    version="1",
    description="Transaction data",
    primary_key=['cc_num'],
    event_time=['datetime']
)



Here we have also set online_enabled=True, which enables low latency access to the data. A full list of arguments can be found in the documentation.

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the save function.


In [None]:
trans_fg.save(trans_df)

We can move on and do the same thing for the feature group with our windows aggregation.

In [None]:
window_aggs_fg = fs.create_feature_group(
    name=f"transactions_{window_len}_aggs",
    description=f"Aggregate transaction data over {window_len} windows.",
    primary_key=['cc_num'],
    event_time=['datetime']
)

In [None]:
window_aggs_fg.save(window_aggs_df)

Both feature groups are now accessible and searchable in the UI.


🔪 Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [None]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions', version=1)
window_aggs_fg = fs.get_feature_group('transactions_4h_aggs', version=1)

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"]), on="cc_num")\

ds_query.show(5)



Recall that we computed the features in transactions_4h_aggs using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the documentation for more details.

🤖 Transformation Functions </span>

We will preprocess our data using min-max scaling on numerical features and label encoding on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as min-max scaling are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.


In [None]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}


⚙️ Feature View Creation

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions. In order to create a Feature View we may use fs.create_feature_view()


In [None]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    labels=["fraud_label"],
    transformation_functions=transformation_functions
)


🏋️ Training Dataset Creation

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_training_dataset() method.

From feature view APIs we can also create training datasts based on even time filters specifing start_time and end_time


In [None]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"

# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)

td_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_jan_feb',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",    
    write_options = {'wait_for_job': True},
    coalesce = True,
    start_time = start_time,
    end_time = end_time,
)

To view and explore data in the feature view we can retrieve batch data using get_batch_data() method

In [None]:
feature_view.get_batch_data().head(5)


🪝 Training Dataset retreival

To retrieve training data from storage (already materialised) or from feature groups direcly we can use get_training_dataset_splits or get_training_dataset methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.


In [None]:
_, df = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, version = td_version)

In [None]:
df['train']

In [None]:
df['validation']