# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">

<span style="font-width:bold; font-size: 1.4rem;">This notebook creates a data stream using Hopsworks Internal Kafka</span>

## 🗒️ This notebook is divided into the following sections:

1. Creating Simulated Data
2. Creating Kafka Topic and Schema in Hopsworks Feature Store
3. Sending Data to Kafka

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install faker --quiet

In [None]:
import pandas as pd
from synthetic_data import synthetic_data
from confluent_kafka import Producer

## <span style="color:#ff5f27;"> ✏️ Creating Simulated Data </span>

A simulated dataset for credit card Transactions is created so that the data can be send using a Kafka stream. The data created is split into two different dataframes:

* profiles_df: credit card user information such as birthdate and city of residence, along with credict card information such as the expiration date and provider.
* trans_df: events containing information about when a credit card was used, such as a timestamp, location, and the amount spent. A boolean fraud_label variable (True/False) tells us whether a transaction was fraudulent or not.

In a production system, these data would originate from separate data sources or tables, and probably separate data pipelines. Both files have a common credit card number column cc_num, which you will use later to join features together from the different datasets.

Now you can go ahead and create the data.

In [None]:
data_simulater = synthetic_data.synthetic_data()

profiles_df, trans_df = data_simulater.create_simulated_transactions()

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

After creating the simulated data let us connect with Hopsworks Feature Store.

Hopsworks provides an internal Kafka which can be accessed using the KafkaAPI. See [documentation](https://docs.hopsworks.ai/3.7/user_guides/projects/kafka/create_schema/#introduction) for more details.

In [None]:
import hopsworks

project = hopsworks.login()

kafka_api = project.get_kafka_api()

fs = project.get_feature_store()

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

Profiles data can be directly inserted as a feature group directly since they are not update fequently.

To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to `1`.

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

In [None]:
profile_fg = fs.get_or_create_feature_group(
        name="profile",
        primary_key=["cc_num"],
        partition_key=["cc_provider"],
        version=1)

profile_df = profile_fg.insert(profiles_df)

## <span style="color:#ff5f27;"> ⚙️ Kafka Topic and Schema Creation </span>

To create a Kafka stream for transactions a topic and schema must be create. The schema used must follow Apache Avro specification, more details can be found in the [documentation](https://avro.apache.org/docs/1.11.1/specification/).


In [None]:
# create kafka topic
KAFKA_TOPIC_NAME = "transactions_topic"
SCHEMA_NAME = "transactions_schema"

In [None]:
schema = {
    "type": "record",
    "name": SCHEMA_NAME,
    "namespace": "io.hops.examples.pyspark.example",
    "fields": [
        {
            "name": "tid",
            "type": [
                "null",
                "string"
            ]
        },
        {
            "name": "datetime",
            "type": [
                "null",
                {
                    "type": "long",
                    "logicalType": "timestamp-micros"
                }
            ]
        },
        {
            "name": "cc_num",
            "type": [
                "null",
                "long"
            ]
        },
        {
            "name": "category",
            "type": [
                "null",
                "string"
            ]
        },
        {
            "name": "amount",
            "type": [
                "null",
                "double"
            ]
        },
        {
            "name": "latitude",
            "type": [
                "null",
                "double"
            ]
        },
        {
            "name": "longitude",
            "type": [
                "null",
                "double"
            ]
        },
        {
            "name": "city",
            "type": [
                "null",
                "string"
            ]
        },
        {
            "name": "country",
            "type": [
                "null",
                "string"
            ]
        },
        {
            "name": "fraud_label",
            "type": [
                "null",
                "string"
            ]
        },
    ]
}

After the schema is created the topic and the associated schema must be registered in Hopsworks so that the topic can be used.

In [None]:
if KAFKA_TOPIC_NAME not in [topic.name for topic in kafka_api.get_topics()]:
    kafka_api.create_schema(SCHEMA_NAME, schema)
    kafka_api.create_topic(KAFKA_TOPIC_NAME, SCHEMA_NAME, 1, replicas=1, partitions=1)

## <span style="color:#ff5f27;"> 📡 Sending Data using created Kafka Topic </span>

While sending data through Kafka we must make sure that the data types are in the same format specified in the schema. 

Let's make sure that the dataframe has all the components in the correct format.

In [None]:
trans_df["tid"] = trans_df["tid"].astype("string")
trans_df["datetime"] = pd.to_datetime(trans_df["datetime"])
trans_df["cc_num"] = trans_df["cc_num"].astype("int64")
trans_df["category"] = trans_df["cc_num"].astype("string")
trans_df["amount"] = trans_df["amount"].astype("double")
trans_df["latitude"] = trans_df["latitude"].astype("double")
trans_df["longitude"] = trans_df["longitude"].astype("double")
trans_df["city"] = trans_df["city"].astype("string")
trans_df["country"] = trans_df["country"].astype("string")
trans_df["fraud_label"] = trans_df["fraud_label"].astype("string")

Lets get the configuration needed for the producer to used Hopsworks internal kafka using the KafkaAPI

In [None]:
kafka_config = kafka_api.get_default_config()

Finally, lets create a producer using the Kafka configuration and send data into it.

It is important to note that the data passed to the producer must be a json or it must be avro encoded.

In [None]:
producer = Producer(kafka_config)

for index, transaction in trans_df.iterrows():
    producer.produce(KAFKA_TOPIC_NAME, transaction.to_json())
    producer.flush()

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 01: Feature Pipeline</span>

In the following notebook you will use the created Kafka stream to insert data into a Feature Group