### Create feature groups¶
In this notebook we are going to read raw datasets, perform feature engineering and write to feature stores as feature groups. 

![Feature Stores](./images/online_offline_fs.png)

In [None]:
spark

***

# Feature engineering
First we will import the multiple libraries we need for feature engineering and writing into the Feature Store.

**The process will then be:**
1. Define the feature engineering utiliy functions
2. Load transactions datasets 
3. Load alert transactions datasets 
4. Load party datasets

In [4]:
# Import necessary libraries for feature engineering
# common libaries for hashing and date time conversions

import hashlib
from datetime import datetime

In [5]:
# pyspark functions for feature engineering 

from pyspark.sql import functions as F
from pyspark.sql.types import FloatType, StringType

In [6]:
# Hops hdfs utility library for reading and writing files from HopsFs

from hops import hdfs

### 1. Define feature engineering utility functions
Creating users defined functions.

In [3]:
def action_2_code(input_str):
    x = input_str.split("-")[0]
    if (x == "CASH_IN"):
        node_type = 0
    elif (x == "CASH_OUT"):
        node_type = 1
    elif (x == "DEBIT"):
        node_type = 2
    elif (x == "PAYMENT"):
        node_type = 3
    elif (x == "TRANSFER"):
        node_type = 4
    elif (x == "DEPOSIT"):
        node_type = 4        
    else:
        node_type = 99
    return node_type

def party_2_code(x):
    if (x == "Organization"):
        party_type = 0
    elif (x == "Individual"):
        party_type = 1
    else:    
        party_type = 99
    return party_type

def timestamp_2_time(x):
    dt_obj = datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S')
    return dt_obj.strftime("%b-%d") 

action_2_code_udf = F.udf(action_2_code)
party_2_code_udf = F.udf(party_2_code)
timestamp_2_time_udf = F.udf(timestamp_2_time)

### 2. Load transactions datasets as spark dataframe and perform feature engineering

In [11]:
transactions_df = spark.read\
             .option("inferSchema", "true")\
             .option("header", "true")\
             .format("csv")\
             .load("hdfs:///Projects/{}/Resources/transactions.csv".format(hdfs.project_name()))

In [None]:
transactions_df.show()

In [13]:
transactions_df = transactions_df.withColumn('tx_type', action_2_code_udf(F.col('tx_type')))\
                                 .withColumn('tran_timestamp', timestamp_2_time_udf(F.col('tran_timestamp')).cast(StringType()))\
                                 .withColumnRenamed("src","source")\
                                 .withColumnRenamed("dst","target")\
                                 .select("source","target","tran_timestamp","tran_id","tx_type","base_amt")
transactions_df.show()

+--------+--------+--------------+-------+-------+--------+
|  source|  target|tran_timestamp|tran_id|tx_type|base_amt|
+--------+--------+--------------+-------+-------+--------+
|3aa9646b|1e46e726|        Jan-01|    496|      4|  858.77|
|49203bc3|a74d1101|        Jan-01|   1342|      4|  386.86|
|616d4505|99af2455|        Jan-02|   1580|      4|  616.43|
|39be1ea2|e7ec7bdb|        Jan-02|   2866|      4|  146.44|
|e2e0d938|afc399a9|        Jan-03|   3997|      4|  439.09|
|75c9a805|d7a317f6|        Jan-04|   5518|      4|   361.0|
|c14f4989|733a496b|        Jan-06|   7340|      4|  768.98|
|576eb672|aa49b0eb|        Jan-07|   9376|      4|   943.4|
|847a9cf6|b070a6bb|        Jan-08|  10362|      4|   668.3|
|12a388ff|586377aa|        Jan-08|  10817|      4|  139.84|
|b36f9c84|1b467848|        Jan-08|  11317|      4|  499.47|
|362e42e0|385afb8b|        Jan-09|  11748|      4|  357.96|
|572014da|acd60eca|        Jan-10|  13285|      4|   630.9|
|5ff2d9a7|31976e38|        Jan-11|  1483

### 3. Load alert transactions datasets as spark dataframe and perform feature engineering

In [15]:
alert_transactions = spark.read\
             .option("inferSchema", "true")\
             .option("header", "true")\
             .format("csv")\
             .load("hdfs:///Projects/{}/Resources/alert_transactions.csv".format(hdfs.project_name()))
alert_transactions.show()

+--------+--------------+------+-------+
|alert_id|    alert_type|is_sar|tran_id|
+--------+--------------+------+-------+
|      47|gather_scatter|  true|  11873|
|      47|gather_scatter|  true|  11874|
|      47|gather_scatter|  true|  11875|
|      47|gather_scatter|  true|  13151|
|      47|gather_scatter|  true|  23148|
|      17|scatter_gather|  true|  23779|
|      17|scatter_gather|  true|  23780|
|      17|scatter_gather|  true|  26441|
|      17|scatter_gather|  true|  26442|
|      47|gather_scatter|  true|  28329|
|      47|gather_scatter|  true|  31581|
|      47|gather_scatter|  true|  34310|
|      17|scatter_gather|  true|  34433|
|      58|gather_scatter|  true|  36131|
|      17|scatter_gather|  true|  36563|
|      17|scatter_gather|  true|  41430|
|      17|scatter_gather|  true|  42363|
|      58|gather_scatter|  true|  42511|
|      58|gather_scatter|  true|  44370|
|      58|gather_scatter|  true|  46176|
+--------+--------------+------+-------+
only showing top

In [16]:
alert_transactions = alert_transactions.select("alert_id","alert_type","is_sar","tran_id").withColumn("is_sar",F.when(F.col("is_sar") == "true", 1).otherwise(0))
alert_transactions.orderBy("tran_id").show()

+--------+--------------+------+-------+
|alert_id|    alert_type|is_sar|tran_id|
+--------+--------------+------+-------+
|      47|gather_scatter|     1|  11873|
|      47|gather_scatter|     1|  11874|
|      47|gather_scatter|     1|  11875|
|      47|gather_scatter|     1|  13151|
|      47|gather_scatter|     1|  23148|
|      17|scatter_gather|     1|  23779|
|      17|scatter_gather|     1|  23780|
|      17|scatter_gather|     1|  26441|
|      17|scatter_gather|     1|  26442|
|      47|gather_scatter|     1|  28329|
|      47|gather_scatter|     1|  31581|
|      47|gather_scatter|     1|  34310|
|      17|scatter_gather|     1|  34433|
|      58|gather_scatter|     1|  36131|
|      17|scatter_gather|     1|  36563|
|      17|scatter_gather|     1|  41430|
|      17|scatter_gather|     1|  42363|
|      58|gather_scatter|     1|  42511|
|      58|gather_scatter|     1|  44370|
|      58|gather_scatter|     1|  46176|
+--------+--------------+------+-------+
only showing top

### 4. Load party datasets as spark dataframe and ingest into hsfs

In [17]:
party = spark.read\
             .option("inferSchema", "true")\
             .option("header", "true")\
             .format("csv")\
             .load("hdfs:///Projects/{}/Resources/party.csv".format(hdfs.project_name()))
party.show()

+--------+------------+
| partyId|   partyType|
+--------+------------+
|5628bd6c|Organization|
|a1fcba39|Organization|
|f56c9501|  Individual|
|9969afdd|Organization|
|b356eeae|  Individual|
|3406706a|Organization|
|26c56102|Organization|
|e386ebf7|  Individual|
|8c094b0d|  Individual|
|939235aa|  Individual|
|de6bf2a5|Organization|
|33a8ff5b|Organization|
|a32807a1|  Individual|
|2906ef08|Organization|
|c2a01b8d|  Individual|
|5a99160f|  Individual|
|8b9017b8|Organization|
|fcf3bbf3|  Individual|
|5132aa4d|Organization|
|68b90958|  Individual|
+--------+------------+
only showing top 20 rows

In [18]:
party=party.withColumn('partyType', party_2_code_udf(F.col('partyType'))).toDF("id","type")
party.show()

+--------+----+
|      id|type|
+--------+----+
|5628bd6c|   0|
|a1fcba39|   0|
|f56c9501|   1|
|9969afdd|   0|
|b356eeae|   1|
|3406706a|   0|
|26c56102|   0|
|e386ebf7|   1|
|8c094b0d|   1|
|939235aa|   1|
|de6bf2a5|   0|
|33a8ff5b|   0|
|a32807a1|   1|
|2906ef08|   0|
|c2a01b8d|   1|
|5a99160f|   1|
|8b9017b8|   0|
|fcf3bbf3|   1|
|5132aa4d|   0|
|68b90958|   1|
+--------+----+
only showing top 20 rows

***

# Register feature groups

### 1. Instantiate a connection and get the `project` feature store handler 

In [23]:
import hsfs
from hsfs.rule import Rule
# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

#### Data Validation
Before we define [feature groups](https://docs.hopsworks.ai/latest/generated/feature_group/) lets define [validation rules](https://docs.hopsworks.ai/latest/generated/feature_validation/) for features. We do expect some of the features to comply with certain *rules* or *expectations*. For example: a transacted amount must be a positive value. In the case of a transacted amount arriving as a negative value we can decide whether to stop it from `write` into a feature group and throw an error or allow it to be written but provide a warning. In the next section we will create feature store `expectations`, attach them to feature groups, and apply them to dataframes being appended to said feature group.

#### Data validation rules supported in Hopsworks
Hopsworks comes shipped with a set of data validation rules. These rules are **immutable**, uniquely identified by **name** and are available across all feature stores. These rules are used to create feature store expectations which can then be attached to individual feature groups.

In [None]:
# Get all rule definitions available in Hopsworks
rules = connection.get_rules()
[print(rule.to_dict()) for rule in rules]

In [21]:
# Get a rule definition by name
is_positive = connection.get_rule("IS_POSITIVE")
print(is_positive.to_dict())

{'name': 'IS_POSITIVE', 'predicate': 'VALUE', 'acceptedType': 'Boolean', 'featureType': None, 'description': 'Assert on a feature containing non negative values.'}

### 2. Create Expectations based on Hopsworks rules
Expectations are created at the feature store level. Multiple expectations can be created per feature store.
An expectation consist of one or multiple rules and can refer to one or multiple features. An expectation can be utilized by attaching it to a feature group, as shown below.

In [24]:
expectation_amount = fs.create_expectation("amount",
                                           features=["base_amt"], 
                                           description="validate amount correctness",
                                           rules=[Rule(name="IS_POSITIVE", level="ERROR", value=True)])
expectation_amount.save()

expectation.rules[0].to_dict(){'name': 'IS_POSITIVE', 'level': 'ERROR', 'min': None, 'max': None, 'value': True, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.to_dict(){'name': 'amount', 'description': 'validate amount correctness', 'features': ['base_amt'], 'rules': [<hsfs.rule.Rule object at 0x7ffa1d46eb50>]}
ExpectationsApi.expectation.rules[0].to_dict(){'name': 'IS_POSITIVE', 'level': 'ERROR', 'min': None, 'max': None, 'value': True, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.payload{"name": "amount", "description": "validate amount correctness", "features": ["base_amt"], "rules": [{"name": "IS_POSITIVE", "level": "ERROR", "min": null, "max": null, "value": true, "pattern": null, "acceptedType": null, "legalValues": null}]}

### 3. Start the creation of feature groups. 

> ####  a. Create transactions feature group metadata and save it in to hsfs
We are going to create time travel enabled feature groups. For this hopsworks uses Apache Hudi. By default, Hudi tends to over-partition input. 

In [25]:
# Recommended shuffle parallelism for hoodie.[insert|upsert|bulkinsert].shuffle.parallelism is atleast input_data_size/500MB
extra_hudi_options = {
    "hoodie.bulkinsert.shuffle.parallelism":"1", 
    "hoodie.insert.shuffle.parallelism":"1", 
    "hoodie.upsert.shuffle.parallelism":"1",
    "hoodie.parquet.compression.ratio":"0.5"
}

In [30]:
transactions_fg = fs.create_feature_group(name="transactions_fg",
                                       version=1,
                                       primary_key=["tran_id"],
                                       partition_key=["tran_timestamp"],   
                                       description="transactions features",
                                       time_travel_format="HUDI",  
                                       online_enabled=True,  
                                       validation_type="STRICT",
                                       expectations= [expectation_amount],                                          
                                       statistics_config={"enabled": True, "histograms": True, "correlations": True})

transactions_fg.save(transactions_df, extra_hudi_options)

> #### b. Create alert transactions feature group and save it in to hsfs

In [33]:
alert_transactions_fg = fs.create_feature_group(name="alert_transactions_fg",
                                       version=1,
                                       primary_key=["tran_id"],
                                       partition_key=["alert_type"],         
                                       description="alert transactions",
                                       time_travel_format="HUDI",     
                                       online_enabled=True,                                                
                                       statistics_config={"enabled": True, "histograms": True, "correlations": False})
alert_transactions_fg.save(alert_transactions, extra_hudi_options)

> #### c. Create party feature group and save it in to hsfs

In [32]:
party_fg = fs.create_feature_group(name="party_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="party fg",
                                       time_travel_format="HUDI",
                                       online_enabled=True,
                                       statistics_config={"enabled": True, "histograms": True, "correlations": True})
party_fg.save(party, extra_hudi_options)

### Feature groups exploration from the user interface

##### Hopsworks provides user interface that enables exploration and discovery of available Feature Groups and related features. Bellow screenshot demonstrates how one can preview the list of available features in `transactions_fg` and get basic information such as identify feature types and which one are used as primary and partition keys.    

![Incremental Feature Engineering](./images/feature_list.png)

##### We can also preview the data itself. This is similar to `.head()` method   

![Incremental Feature Engineering](./images/data_preview.png)

##### One of the important steps of feature group exploration is the discovery distribution of ist features and whether they are correlated or not. Since we enabled statistics to be computed during the feature group creation we can easily preview descriptive statistics.  

![Incremental Feature Engineering](./images/statistics.png)

##### We can also disover additional properties. Here we can see if there are expection attached to this feature group or not. Please note that we attached expectation using python API but it is also possible too attach and manage feature group expectations from the UI itself.

![Incremental Feature Engineering](./images/expecation.png)

##### Hopsworks UI also provide access to feature group activity timeline metadata.  
![Incremental Feature Engineering](./images/activity.png)

---
**NOTE**:

All of the above UI functionality is also available via hsfs API. For more details please refer to the [Hopsworks Feature Store documentation.](https://docs.hopsworks.ai/latest/generated/feature_store/)

---
