# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Pipeline</span>

**Note**: This tutorial does not support Google Colab.

This is the first part of the quick start series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to credit card transactions. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for batch data with a goal of training and saving a model that can predict fraudulent transactions. Then try it on retrieved from Feature Store batch data.


## 🗒️ This notebook is divided in 3 sections:
1. Loading the data and feature engineeing,
2. Connect to the Hopsworks Feature Store,
3. Create feature groups and upload them to the Feature Store.

![tutorial-flow](../images/01_featuregroups.png)

First of all you will load the data and do some feature engineering on it.

In [60]:
# Hosted notebook environments may not have the local features package
import os
from IPython import get_ipython


def need_download_modules():
    if 'google.colab' in str(get_ipython()):
        return True
    if 'HOPSWORKS_PROJECT_ID' in os.environ:
        return True
    return False

if need_download_modules():
    print("⚙️ Downloading modules...")
    os.system('mkdir -p synthetic_data')
    os.system('cd synthetic_data && wget https://raw.githubusercontent.com/manu-sj/hopsworks-tutorials/FSTORE-1107/advanced_tutorials/pyspark_streaming/synthetic_data/synthetic_data.py')
    os.system('cd synthetic_data && wget https://raw.githubusercontent.com/manu-sj/hopsworks-tutorials/FSTORE-1107/advanced_tutorials/pyspark_streaming/synthetic_data/create_transaction_stream.py')
    os.system('cd synthetic_data && wget https://raw.githubusercontent.com/manu-sj/hopsworks-tutorials/FSTORE-1107/advanced_tutorials/pyspark_streaming/synthetic_data/init_kafka.py')
    os.system('cd synthetic_data && wget https://raw.githubusercontent.com/manu-sj/hopsworks-tutorials/FSTORE-1107/advanced_tutorials/pyspark_streaming/synthetic_data/__init__.py')
    print('✅ Done!')
else:
    print("Local environment")

⚙️ Downloading modules...
0
0
0
0
0
✅ Done!

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
from pyspark.sql import SparkSession

from pyspark.sql.functions import (
    from_json,
    window,
    avg,
    count,
    stddev,
    explode,
    date_format,
    col,
    mean,
    pandas_udf,
    PandasUDFType)

from pyspark.sql.types import (
    LongType,
    DoubleType,
    StringType,
    TimestampType,
    StructType,
    StructField,
)

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
11,application_1709431794369_0071,pyspark,idle,Link,Link


SparkSession available as 'spark'.


## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [30]:
import hopsworks
from hsfs.core.storage_connector_api import StorageConnectorApi

project = hopsworks.login()
fs = project.get_feature_store()
sc_api = StorageConnectorApi()

Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://staging.cloud.hopsworks.ai/p/124
Connected. Call `.close()` to terminate connection gracefully.

In [31]:
sc_api = StorageConnectorApi()
kafka_connector = sc_api.get_kafka_connector(feature_store_id=fs.id, external=False)
kafka_config = kafka_connector.spark_options()

## <span style="color:#ff5f27;"> 🗂 Reading from Kakfa Stream </span>

In [32]:
KAFKA_TOPIC_NAME = "transactions_topic"

In [45]:
# get data from the source
df_read = spark \
    .readStream \
    .format("kafka") \
    .options(**kafka_config) \
    .option("startingOffsets", "earliest") \
    .option("maxOffsetsPerTrigger", 100) \
    .option("subscribe", KAFKA_TOPIC_NAME) \
    .load()

In [46]:
parse_schema = StructType([StructField("tid", StringType(), True),
                           StructField("datetime", TimestampType(), True),
                           StructField("cc_num", LongType(), True),
                           StructField("category", StringType(), True),
                           StructField("amount", DoubleType(), True),
                           StructField("latitude", DoubleType(), True),
                           StructField("longitude", DoubleType(), True),
                           StructField("city", StringType(), True),
                           StructField("country", StringType(), True),
                           StructField("fraud_label", StringType(), True),
                           ])

In [47]:
# Deserialize data from and create streaming query
transaction_streaming_df = df_read.selectExpr("CAST(value AS STRING)") \
    .select(from_json("value", parse_schema).alias("value")) \
    .select("value.tid",
            "value.datetime",
            "value.cc_num",
            "value.category",
            "value.amount",
            "value.latitude",
            "value.longitude",
            "value.city",
            "value.country",
            "value.fraud_label") \
    .selectExpr("CAST(tid as string)",
                "CAST(datetime as timestamp)",
                "CAST(cc_num as long)",
                "CAST(category as string)",
                "CAST(amount as double)",
                "CAST(latitude as double)",
                "CAST(longitude as double)",
                "CAST(city as string)",
                "CAST(country as string)",
                "CAST(fraud_label as string)"
                )

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

In [None]:
# read profile data to cogroup with streaming dataframe
profile_fg = fs.get_or_create_feature_group(
    name="profile",
    version=1)

profile_df = profile_fg.read()

In [None]:
windowed_transaction_df = transaction_streaming_df \
    .selectExpr("tid",
                "datetime",
                "cc_num",
                "category",
                "CAST(amount as double)",
                "radians(latitude) as latitude",
                "radians(longitude) as longitude",
                "city",
                "country") \
    .withWatermark("datetime", "24 hours") \
    .groupBy(window("datetime", "168 hours")) \
    .applyInPandas(transactions.loc_delta_t_minus_1, schema=schema1) \
    .withWatermark("datetime", "24 hours") \
    .groupBy(window("datetime", "168 hours")) \
    .applyInPandas(transactions.time_delta_t_minus_1, schema=schema2) \
    .withColumn("month", udf(col("datetime"))) \
    .withWatermark("datetime", "24 hours") \
    .groupBy(window("datetime", "168 hours")) \
    .cogroup(profile_df.groupby("cc_provider")) \
    .applyInPandas(transactions.card_owner_age, schema=schema3) \
    .withWatermark("datetime", "24 hours") \
    .groupBy(window("datetime", "168 hours")) \
    .cogroup(profile_df.groupby("cc_provider")) \
    .applyInPandas(transactions.expiry_days, schema=schema4)

In [48]:
trans_fg = fs.get_or_create_feature_group(
    name="transactions",
    version=1,
    description="Transaction data",
    primary_key=['cc_num'],
    #partition_key=['month'],
    stream=True,
    online_enabled=True
)

In [49]:
q = trans_fg.insert_stream(transaction_streaming_df)

Feature Group created successfully, explore it at 
https://staging.cloud.hopsworks.ai/p/124/fs/72/fg/154

In [58]:
q.stop()

In [56]:
trans_fg.materialization_job.run()

Launching job: transactions_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://staging.cloud.hopsworks.ai/p/124/jobs/named/transactions_1_offline_fg_materialization/executions