# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Pipeline</span>

**Note**: This tutorial does not support Google Colab.

This is the first part of the quick start series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to credit card transactions. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for streaming data with a goal of training and saving a model that can predict fraudulent transactions. Then try it on retrieved from Feature Store batch data.


## 🗒️ This notebook is divided in 3 sections:
1. Loading the data and feature engineeing,
2. Connect to the Hopsworks Feature Store,
3. Create feature groups and upload them to the Feature Store.

![tutorial-flow](../../images/01_featuregroups.png)

First of all you will load the data and do some feature engineering on it.

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
import datetime
from pyspark.sql import SparkSession

from pyspark.sql.types import * 
from pyspark.sql.functions import * 

from pyspark.sql.functions import pandas_udf

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
8,application_1710149630761_0019,pyspark,idle,Link,Link


SparkSession available as 'spark'.


## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In this tutorial a simulated data stream was created using Hopsworks internal Kafka.

Hopsworks allows to access internal Kafka using the storage connector api. See more information in the [documention](https://docs.hopsworks.ai/feature-store-api/3.7/generated/api/storage_connector_api/#kafka).

In [2]:
import hopsworks
from hsfs.core.storage_connector_api import StorageConnectorApi

project = hopsworks.login()

fs = project.get_feature_store()

sc_api = StorageConnectorApi()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192
Connected. Call `.close()` to terminate connection gracefully.

Let get the kafka configurations needed for read from hopsworks internal Kafka using the storage connector api.

In [3]:
sc_api = StorageConnectorApi()
kafka_connector = sc_api.get_kafka_connector(feature_store_id=fs.id, external=False)
kafka_config = kafka_connector.spark_options()

## <span style="color:#ff5f27;"> 🗂 Reading from Kakfa Stream </span>

After obatining the Kafka configurations we can use it along with the topic name to create a streaming dataframe.

In [5]:
KAFKA_TOPIC_NAME = "transactions_topic"

df_read = spark \
    .readStream \
    .format("kafka") \
    .options(**kafka_config) \
    .option("startingOffsets", "earliest") \
    .option("maxOffsetsPerTrigger", 100) \
    .option("subscribe", KAFKA_TOPIC_NAME) \
    .load()

To extract the requierd data from streaming dataframe the correct schema has be defined and used

In [6]:
parse_schema = StructType([StructField("tid", StringType(), True),
                           StructField("datetime", TimestampType(), True),
                           StructField("cc_num", LongType(), True),
                           StructField("category", StringType(), True),
                           StructField("amount", DoubleType(), True),
                           StructField("latitude", DoubleType(), True),
                           StructField("longitude", DoubleType(), True),
                           StructField("city", StringType(), True),
                           StructField("country", StringType(), True),
                           StructField("fraud_label", StringType(), True),
                           ])

Extracting data from the streaming dataframe and casting it to get the required schema

In [7]:
# Deserialize data from and create streaming query
streaming_df = df_read.selectExpr("CAST(value AS STRING)") \
    .select(from_json("value", parse_schema).alias("value")) \
    .select("value.tid",
            "value.datetime",
            "value.cc_num",
            "value.category",
            "value.amount",
            "value.latitude",
            "value.longitude",
            "value.city",
            "value.country",
            "value.fraud_label") \
    .selectExpr("CAST(tid as string)",
                "CAST(datetime as timestamp)",
                "CAST(cc_num as long)",
                "CAST(category as string)",
                "CAST(amount as double)",
                "CAST(latitude as double)",
                "CAST(longitude as double)",
                "CAST(city as string)",
                "CAST(country as string)",
                "CAST(fraud_label as string)"
                )

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

Now that we have a streaming dataframe that contains the data we can use it to engineer features. We would need the also need profiles to effectively engineer features.

So next you can read data from profiles feature group

In [8]:
profile_fg = fs.get_or_create_feature_group(
    name="profile",
    version=1)

profile_df = profile_fg.read()

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning you will create additional features based on these patterns. In particular, you will create two types of features:
1. **Features that aggregate data from different data sources**. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles` with the `datetime` feature from `transactions`.
2. **Features that aggregate data from multiple time steps**. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

In [9]:
transaction_streaming_df = (
    streaming_df.join(profile_df.drop("city"), on="cc_num", how="left")
        .withColumn("cc_expiration_date", to_timestamp("cc_expiration_date", "mm/dd"))
        .withColumn("age_at_transaction", datediff(col("datetime"),col("birthdate")))
        .withColumn("days_until_card_expires", datediff(col("datetime"),col("cc_expiration_date")))
        .select(["tid", "datetime", "cc_num", "category", "amount", "latitude", "longitude", "city", "country", "fraud_label", "age_at_transaction", "days_until_card_expires", "cc_expiration_date"])
)

Next, you will create features that aggregate credit card data over a period of time.

Here for simplicity we take the average, standard deviation and frequency of transaction amount over a period of 4 hours

In [10]:
windowed_4h_aggregation_df = (
    streaming_df.withWatermark("datetime", "168 hours")
        .groupBy(window("datetime", "4 hours", "1 hour"), "cc_num")
        .agg(
            avg("amount").alias("avg_amt_per_4h"),
            stddev("amount").alias("stdev_amt_per_4h"),
            count("cc_num").alias("num_trans_per_4h"),
        )
        .na.fill({"stdev_amt_per_4h":0})
        .selectExpr(
            "cc_num",
            "current_timestamp() as event_time",
            "num_trans_per_4h",
            "avg_amt_per_4h",
            "stdev_amt_per_4h",
        )
)

## <span style="color:#ff5f27;">👮🏻‍♂️ Great Expectations </span>

In [11]:
import great_expectations as ge
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

# Set the expectation suite name to "transactions_suite"
expectation_suite_transactions = ge.core.ExpectationSuite(
    expectation_suite_name="transactions_suite"
)

In [12]:
# Check binary fraud_label column to be in set [0,1]
expectation_suite_transactions.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_distinct_values_to_be_in_set",
        kwargs={
            "column": "fraud_label",
            "value_set": [0, 1],
        }
    )
)

# Check amount column to be not negative
expectation_suite_transactions.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "amount",
            "min_value": 0.0,
        }
    )
)

# Loop through specified columns ('tid', 'datetime', 'cc_num') and add expectations for null values
for column in ['tid', 'datetime', 'cc_num']:
    expectation_suite_transactions.add_expectation(
        ExpectationConfiguration(
            expectation_type="expect_column_values_to_be_null",
            kwargs={
                "column": column,
                "mostly": 0.0,
            }
        )
    )

{"kwargs": {"column": "tid", "mostly": 0.0}, "expectation_type": "expect_column_values_to_be_null", "meta": {}}
{"kwargs": {"column": "datetime", "mostly": 0.0}, "expectation_type": "expect_column_values_to_be_null", "meta": {}}
{"kwargs": {"column": "cc_num", "mostly": 0.0}, "expectation_type": "expect_column_values_to_be_null", "meta": {}}

## <span style="color:#ff5f27;"> 💾 Storing streaming dataframes in Hopsworks Feature Store </span>

### <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/3.0/concepts/fs/feature_group/fg_overview/) can be seen as a collection of conceptually related features. In this case, you will create a feature group for the transaction data and a feature group for the windowed aggregations on the transaction data. Both will have `cc_num` as primary key, which will allow you to join them when creating a dataset in the next tutorial.

Feature groups can also be used to define a namespace for features. For instance, in a real-life setting you would likely want to experiment with different window lengths. In that case, you can create feature groups with identical schema for each window length. 

Before you can create a feature group you need to connect to Hopsworks feature store.

In [13]:
trans_fg = fs.get_or_create_feature_group(
    name="transactions_fraud_streaming_fg",
    version=1,
    description="Transaction data",
    primary_key=["cc_num"],
    event_time="datetime",
    online_enabled=True,
    stream=True,
    expectation_suite=expectation_suite_transactions,
)

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To insert a streaming dataframe into a feature group you can use the streaming feature `insert_stream` function. You can find more details in the [documentation](https://docs.hopsworks.ai/feature-store-api/3.7/generated/api/feature_group_api/#insert_stream).

In [14]:
# Insert data into feature group
trans_fg_query = trans_fg.insert_stream(transaction_streaming_df)

Feature Group created successfully, explore it at 
https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192/fs/3140/fg/3118


The `insert_stream` function returns a `StreamingQuery` object which be used to check the status of the streaming query.


In [16]:
trans_fg_query.status

{'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}

The `insert_stream` function inserts the data into the online feature store so to materialize the data in the offline store you need to manually run the materialization job. The materialization job can also be run on schedule using the `schedule` function. You can find more details in the [documentation](https://docs.hopsworks.ai/hopsworks-api/3.7/generated/api/jobs/#schedule).

In [15]:
trans_fg.materialization_job.schedule(cron_expression = "0 10 * ? * * *", start_time=datetime.datetime.now(tz=timezone.utc))

Launching job: transactions_fraud_streaming_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192/jobs/named/transactions_fraud_streaming_fg_1_offline_fg_materialization/executions

In [17]:
# Update feature descriptions
feature_descriptions = [
    {"name": "tid", "description": "Transaction id"},
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "category", "description": "Expense category"},
    {"name": "amount", "description": "Dollar amount of the transaction"},
    {"name": "latitude", "description": "Transaction location latitude"},
    {"name": "longitude", "description": "Transaction location longitude"},
    {"name": "city", "description": "City in which the transaction was made"},
    {"name": "country", "description": "Country in which the transaction was made"},
    {"name": "fraud_label", "description": "Whether the transaction was fraudulent or not"},
    {"name": "age_at_transaction", "description": "Age of the card holder when the transaction was made"},
    {"name": "days_until_card_expires", "description": "Card validity days left when the transaction was made"},
]

for desc in feature_descriptions: 
    trans_fg.update_feature_description(desc["name"], desc["description"])

<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>
<hsfs.feature_group.FeatureGroup object at 0x7f1b871bec80>

At the creation of the feature group, you will be prompted with an URL that will directly link to it; there you will be able to explore some of the aspects of your newly created feature group.

You can move on and do the same thing for the feature group with our windows aggregation.

In [18]:
# Get or create the 'transactions' feature group with specified window aggregations
window_aggs_streaming_fg = fs.get_or_create_feature_group(
    name=f"transactions_aggs_fraud_streaming_fg",
    version=1,
    description=f"Aggregate transaction data over 5 minute windows.",
    primary_key=["cc_num"],
    event_time="event_time",
    online_enabled=True,
    stream=True,
)

In [19]:
window_aggs_streaming_fg_query = window_aggs_streaming_fg.insert_stream(windowed_4h_aggregation_df)

Feature Group created successfully, explore it at 
https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192/fs/3140/fg/3119

In [20]:
window_aggs_streaming_fg_query.status

{'message': 'Getting offsets from KafkaV2[Subscribe[transactions_topic]]', 'isDataAvailable': False, 'isTriggerActive': True}

In [21]:
window_aggs_streaming_fg.materialization_job.schedule(cron_expression = "0 10 * ? * * *", start_time=datetime.datetime.now(tz=timezone.utc))

An error was encountered:
The Hopsworks Job failed, use the Hopsworks UI to access the job logs
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.10/site-packages/hsfs/core/job.py", line 124, in run
    self._wait_for_job(await_termination=await_termination)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.10/site-packages/hsfs/core/job.py", line 242, in _wait_for_job
    raise FeatureStoreException(
hsfs.client.exceptions.FeatureStoreException: The Hopsworks Job failed, use the Hopsworks UI to access the job logs



## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Training Pipeline
 </span> 

In the following notebook you will use your feature groups to create a dataset you can train a model on.