# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Pipeline</span>

**Note**: This tutorial does not support Google Colab.

This is the first part of the quick start series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to credit card transactions. 
The objective of this tutorial is to demonstrate how to work with **on-demand transformation function** in the **Hopworks Feature Store** for online data with a goal of training and deploying a model that can predict fraudulent transactions.


## 🗒️ This notebook is divided in 3 sections:
1. Loading the data and feature engineeing.
2. Create on-demand transformation functions.
4. Create feature groups with on-demand transformations and upload them to the Feature Store.

![tutorial-flow](../../../images/01_featuregroups.png)

First of all you will load the data and do some feature engineering on it.

# <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U hopsworks --quiet

In [None]:
from math import radians

import numpy as np
import pandas as pd

from features import transactions_fraud

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

First of all you will load the data and do some feature engineering on it.

# <span style='color:#ff5f27'> 📝 Feature Pipeline

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

The data you will use comes from 2 different CSV files:

- `transactions.csv`: events containing information about when a credit card was used, such as a timestamp, location, and the amount spent. A boolean fraud_label variable (True/False) tells us whether a transaction was fraudulent or not.
- `profiles.csv`: credit card user information such as birthdate and city of residence.

In a production system, these CSV files would originate from separate data sources or tables, and probably separate data pipelines. **These files have a common credit card number column cc_num, which you will use later to join features together from the different datasets.**

Now, you can go ahead and load the data.

In [None]:
# Read the profiles data from a CSV file
profiles_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_online/profiles.csv", 
    parse_dates=["birthdate"],
)

# Rename columns for clarity
profiles_df.columns = ["name", "gender", "mail", "birthdate", "City", "Country", "cc_num"]

# Display the first three rows of the DataFrame
profiles_df.head(3)

In [None]:
# Read the transactions data from a CSV file
trans_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_online/transactions.csv", 
    parse_dates=["datetime"],
)

# Display the first three rows of the DataFrame
trans_df.head(3)

In [None]:
# Filter transactions DataFrame to include only rows with category "Cash Withdrawal"
trans_df = trans_df[trans_df.category == "Cash Withdrawal"].reset_index(level=0, drop=True)

# Fill missing values in the 'country' column with "US"
trans_df["country"] = trans_df["country"].fillna("US")

# Add birthdate to trans_df for 
trans_df = trans_df.merge(profiles_df, on="cc_num")[['tid', 'datetime', 'cc_num', 'category', 'amount', 'latitude',
       'longitude', 'city', 'country', 'fraud_label', 'birthdate']]

# Filter profiles DataFrame to include only rows with credit card numbers present in the filtered transactions DataFrame
profiles_df = profiles_df[profiles_df.cc_num.isin(trans_df.cc_num.unique())].reset_index(level=0, drop=True)

In [None]:
# Sort the transactions DataFrame by 'datetime' and 'cc_num'
trans_df.sort_values(["datetime", "cc_num"], inplace=True)

---

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning you will create additional features based on these patterns. In particular, you will create two types of features:

1. **Features that aggregate data from multiple time steps**. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.
2. **Features that aggregate data from different data sources**. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles.csv` with the `datetime` feature from `transactions.csv`.

Let's start with the first category.

Now you are ready to start by computing the distance between consecutive transactions, lets call it `loc_delta`.
Here you will use the [Haversine distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html?highlight=haversine#sklearn.metrics.pairwise.haversine_distances) to quantify the distance between two longitude and latitude coordinates.

In [None]:
# Use the prepare_transactions_fraud function to process the trans_df DataFrame
trans_df = transactions_fraud.prepare_transactions_fraud(trans_df)

# Display the first three rows of the modified DataFrame
trans_df.head(3)

Next, we'll move on to the second category of features. Here, you'll calculate the age_at_transaction, which can be considered an on-demand feature.

### <span style="color:#ff5f27;"> ⚡️ On-Demand Transformation Functions </span>

On-demand features are features that can only be computed at the time of an inference request, based on certain parameters available at that moment. You can learn more in the documentation available [here](https://docs.hopsworks.ai/latest/user_guides/fs/feature_group/on_demand_transformations/).

To calculate the feature age_at_transaction, two parameters are needed: the transaction time and the date of birth of the person. The date of birth can be retrieved from an existing feature group, but the transaction time is only known when the inference request is made. As a result, the `age_at_transaction` feature is classified as an on-demand feature.

Hopsworks enables the creation of on-demand features through on-demand transformation functions. On-demand transformation functions are created by attaching transformation function to feature groups within Hopsworks.

To create a transformation function, you need to use the `@hopsworks.udf` decorator. Let's start by importing the Hopsworks library.

In [None]:
import hopsworks

Now, let's create an transformation function for computing the on-demand feature `age_at_transaction`. The transformation function below creates the on-demand feature `age_at_transaction`. Once the computation is complete, the `birthdate` is dropped to not included in the feature group, since it is already stored in the another feature group.

In [None]:
@hopsworks.udf(return_type=float, drop=["birthdate"])
def age_at_transaction(datetime, birthdate):
    return (datetime - birthdate).dt.days / 365

Now, let's test the transformation function we've defined. To do this, you'll first need to establish a connection to Hopsworks.

In [None]:
project = hopsworks.login()

fs = project.get_feature_store()

In [None]:
age_at_transaction.output_column_names = "age_at_transaction"

test_df = pd.DataFrame({
    'transaction_time': pd.to_datetime(['2022-01-01', '2022-01-15']),
    'data_of_birth': pd.to_datetime(['1998-03-21', '2000-01-30'])
})

age_at_transaction.get_udf()(test_df['transaction_time'], test_df['data_of_birth'])

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/3.0/concepts/fs/feature_group/fg_overview/) can be seen as a collection of conceptually related features. In this case, you will create a feature group for the transaction data and a feature group for the windowed aggregations on the transaction data. Both will have `cc_num` as primary key, which will allow you to join them when creating a dataset in the next tutorial.

Feature groups can also be used to define a namespace for features. For instance, in a real-life setting you would likely want to experiment with different window lengths. In that case, you can create feature groups with identical schema for each window length. 

In [None]:
fs.name

To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to `1`. 

To add the on-demand feature `age_at_transaction` to a feature group, you must create an on-demand transformation function by attaching the previously defined `age_at_transaction` transformation function to the feature group. The features to be passed to the transformation function can either be explicitly specified as parameters or, if not provided, the function will automatically use features from the feature group that match the names of the function's arguments.

In [None]:
# Get or create the 'transactions_fraud_online_fg' feature group
trans_fg = fs.get_or_create_feature_group(
    name="transactions_fraud_online_fg",
    version=1,
    description="Transaction data",
    primary_key=['cc_num'],
    event_time='datetime',
    # Attacthing transformation function `age_at_transaction` to the feature group to create on-demand feature `age_at_transaction`
    transformation_functions=[age_at_transaction],
    online_enabled=True,
)

Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you need to populate it with its associated data using the `insert` function.

When inserting data into a feature group with an on-demand transformation function, you have to include all the features required for the transformation in the DataFrame being inserted. 

Hopsworks computes all on-demand features using the transformation function when data is inserted into the feature group, allowing for backfilling of on-demand features. This backfilling process reduces the computational effort required for creating training data, as these transformations do not need to be applied repeatedly.

In [None]:
# Insert data into feature group
trans_fg.insert(trans_df)
print('✅ Done!')

In [None]:
# Update feature descriptions
feature_descriptions = [
    {"name": "tid", "description": "Transaction id"},
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "amount", "description": "Dollar amount of the transaction"},
    {"name": "country", "description": "Country in which the transaction was made"},
    {"name": "fraud_label", "description": "Whether the transaction was fraudulent or not"},
    {"name": "loc_delta_t_minus_1", "description": "Location of previous transaction"},
    {"name": "time_delta_t_minus_1", "description": "Time of previous transaction"},
    {"name": "age_at_transaction", "description": "Age of user at the time the transaction has been performed"},
]

for desc in feature_descriptions: 
    trans_fg.update_feature_description(desc["name"], desc["description"])

You can now check the UI to see that the on-demand feature `age_at_transaction` is also present in the feature group along with other feature. On-demand features in the feature group can also be used as normal feature while creating feature view for model training and inference. You will see this in the following notebook.

![tutorial-flow](images/on_demand_example.png)

You can move on and do the same thing for the profile and label feature groups.

In [None]:
# Get or create the 'profile_fraud_online_fg' feature group
profile_fg = fs.get_or_create_feature_group(
    name="profile_fraud_online_fg",
    version=1,
    description="Credit card holder demographic data",
    primary_key=['cc_num'],
    online_enabled=True,
)
# Insert data into feature group
profile_fg.insert(profiles_df)
print('✅ Done!')

In [None]:
# Update feature descriptions
feature_descriptions = [
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "gender", "description": "Gender of the credit card holder"},
]

for desc in feature_descriptions: 
    profile_fg.update_feature_description(desc["name"], desc["description"])

## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 Training Pipeline </span>

In the following notebook you will use our feature groups to create a dataset you can train a model on.