# Time travel operations in Hopsworks Feature Store

In this notebook we will introduce time travel operations in Hopsworks Feature Store (HSFS). Currently HSFS supports Apache Hudi (http://hudi.apache.org/) a storage abstraction/library for doing **incremental** data ingestion to a Hopsworks Feature Store.

## Background

### Motivation

Traditional ETL typically involves taking a snapshot of a production database and doing a full load into a data lake (typically stored on a distributed file system). Using the snapshot approach for ETL is simple since the snapshot is immutable and can be loaded as an atomic unit into the data lake. However, the con of taking this approach to doing data ingestion is that it is *slow*. Even if just a single record have been updated since the last data ingestion, the entire table has to be re-written. If you are working with Big Data (TB or PB size datasets) then this will introduce significant *data latency* and *wasted resources* (majority of the writes when ingesting the snapshot is redundant as most of the records have not been updated since the last ETL step). 

This motivates the use-case for **incremental** data ingestion. Incremental data ingestion means that only deltas/changelogs since the last ingestion are inserted. With incremental processing, you process data in *mini-batches* and run the spark job frequently. The incremental model makes better use of resources and makes it easier to do complex processing and joins.

In addition data is rarely immutable in practice. A bank transaction might be reverted, a customer might change his or her home adress, and a customer review might be updated, to give a few examples. This is where Hudi comes into the picture. Hudi stands for `Hadoop Upserts anD Incrementals` and brings two new primitives for data engineering on distributed file systems (in addition to append/read):

- `Upsert`: the ability to do insertions (appends) and updates efficiently. 
- `Incremental reads`: the ability to read datasets incrementally using the notion of "commits".

### How Hopsworks Feature Store time travel operations can be used for ML and Feature Pipelines

Hudi is integrated in the Hopsworks Feature Store for doing incremental feature computation and for point-in-time correctness and backfilling of feature data.

![Incremental Feature Engineering](./../images/featurestore_incremental_pull.png "Incremetal Feature Engineering")

## Examples

### Create HUDI time travel enabled feature group and Bulk Insert Sample Dataset

For this demo we will use small sample of the Agarwal Generator that is a widely used dataset. It contains the hypothetical data of people applying for a loan. `Rakesh Agrawal, Tomasz Imielinksi, and Arun Swami, "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, 5(6), December 1993. <br/><br/>`

##### For simplicity of demo purposes we will split Agarwal dataset into 3 freature groups and manualy create datasets: 
* `economy_fg` with customer id, salary, loan, value of house, age of house, commission and type of car features; 
* `demographic_fg` with customer id, age, education level, zip code,
* `class_fg` which will contain labels wether loan was approved `class B` or rejected `class A`.

### Importing necessary libraries 

In [1]:
import hsfs
from hsfs.rule import Rule
from hsfs.RuleName import RuleName
import datetime
from pyspark.sql import DataFrame, Row
from pyspark.sql.types import *
from pyspark.sql.functions import unix_timestamp, from_unixtime

connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store();

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
54,application_1610566077204_0016,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

In [2]:
economy_fg_schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("salary", FloatType(), True),
  StructField("commission", FloatType(), True),
  StructField("car", StringType(), True), 
  StructField("hvalue", FloatType(), True),      
  StructField("hyears", IntegerType(), True),     
  StructField("loan", FloatType(), True),
  StructField("year", IntegerType(), True)    
])

### Create spark dataframes for each Feature groups

In [3]:
economy_bulk_insert_data = [
    Row(1, 110499.73, 0.0,  "car15",  235000.0, 30, 354724.18, 2020),
    Row(2, 140893.77, 0.0,  "car20",  135000.0, 2, 395015.33, 2020),
    Row(3, 119159.65, 0.0,  "car1", 145000.0, 22, 122025.08, 2020),
    Row(4, 20000.0, 52593.63, "car9", 185000.0, 30, 99629.62, 2020)    
]

economy_bulk_insert_df = spark.createDataFrame(economy_bulk_insert_data, economy_fg_schema)

In [4]:
economy_bulk_insert_df.show()

+---+---------+----------+-----+--------+------+---------+----+
| id|   salary|commission|  car|  hvalue|hyears|     loan|year|
+---+---------+----------+-----+--------+------+---------+----+
|  1|110499.73|       0.0|car15|235000.0|    30| 354724.2|2020|
|  2|140893.77|       0.0|car20|135000.0|     2|395015.34|2020|
|  3|119159.65|       0.0| car1|145000.0|    22|122025.08|2020|
|  4|  20000.0|  52593.63| car9|185000.0|    30| 99629.62|2020|
+---+---------+----------+-----+--------+------+---------+----+

# Data Validation

The next sections shows you how to create feature store expectations, attach them to feature groups, and apply them to dataframes being appended to the feature group.  

### Discover data validation rules supported in Hopsworks
Hopsworks comes shipped with a set of data validation rules. These rules are **immutable**, uniquely identified by **name** and are available across all feature stores. These rules are used to create feature store expectations which can then be attached to feature groups.

In [5]:
# Get all rule definitions available in Hopsworks
rules = connection.get_rules()
[print(rule.to_dict()) for rule in rules]

{'name': 'HAS_MIN', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': "Validate the feature's min"}
{'name': 'HAS_MEAN', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': "Validate the feature's mean"}
{'name': 'HAS_SUM', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': "Validate the feature's sum"}
{'name': 'HAS_MAX', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': "Validate the feature's max"}
[None, None, None, None]

In [6]:
# Get a rule definition by name
rule_max = connection.get_rule(RuleName.HAS_MAX.name)
print(rule_max[0].to_dict())

{'name': 'HAS_MAX', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': "Validate the feature's max"}

### Create Expectations based on Hopsworks rules

Expectations are created at the feature store level. Multiple expectations can be created per feature store.

An expectation is comprised from one or multiple rules and can refer to one or multiple features. An expectation can be utilized by attaching it to a feature group, as shown in the next sections

In [7]:
expectation_sales = fs.create_expectation("sales",
                                          description="min and max sales limits",
                                          features=["salary", "commission"], 
                                          rules=[Rule(name="HAS_MIN", level="WARNING", min=0), Rule(name="HAS_MAX", level="ERROR", max=1000000)])
expectation_sales.save()

expectation_year = fs.create_expectation("year",
                                         features=["year"], 
                                         description="validate year correctness",
                                         rules=[Rule(name="HAS_MIN", level="ERROR", min=2018), Rule(name="HAS_MAX", level="WARNING", max=2021)])

expectation_year.save()

### Discover Feature Store Expectations

Using the Python API you can easily find out which expectations are availeble in this feature store.

In [8]:
# Get all Feature Store expectations
fs_expectations = fs.get_expectations()
[print(expectation.to_dict()) for expectation in fs_expectations]

[None, None, None]

In [9]:
# Get an expectation by its unique name
fs_expectation = fs.get_expectation("year")
print(fs_expectation.to_dict())



### Create feature group with expectations and validation type

Feature store expectations can be attached and detached from feature groups. That enables ingestions pipelines to validate incoming data against expectations. Expectations can be set when creating a feature group. 
Later in the notebook we describe the possible validation type values and what that means for the feature group ingestion. For the moment, we initialize the validation type to STRICT

In [10]:
economy_fg = fs.create_feature_group(
    name = "economy_fg_p37",
    description = "Hudi Household Economy Feature Group",
    version=1,
    primary_key = ["id"], 
    partition_key = ["year"], 
    time_travel_format = "HUDI",
    validation_type="STRICT",
    expectations= [expectation_sales, expectation_year]
)

### Bulk insert data into the feature group
Since we have not yet saved any data into newly created feature groups we will use Apache hudi terminology and `Bulk Insert` data. In HSFS its just issuing `save` method.

Data will be validated prior to being persisted into the Feature Store.

In [11]:
economy_fg.save(economy_bulk_insert_df)

feature_group_commit_instance.validation_id
1099
feature_group_commit_instance.json
{"commitID": null, "commitDateString": "20210121161525", "rowsInserted": 4, "rowsUpdated": 0, "rowsDeleted": 0, "validationId": 1099}
feature_group_commit_instance.to_dict
{'commitID': None, 'commitDateString': '20210121161525', 'rowsInserted': 4, 'rowsUpdated': 0, 'rowsDeleted': 0, 'validationId': 1099}
<hsfs.feature_group.FeatureGroup object at 0x7fa0d129cad0>

### Attach expectations to Feature Groups

Expectations can be attached and detached from feature groups even after the latter are created. If an expectation is attached to a feature group, it will be used when inserted data is validated. An expectation can be attached to multiple feature groups, as long as the expectation's features exist in that feature group.

In [12]:
# Detach expectation by using its name or the metadata object, example shows the latter
economy_fg.detach_expectation(expectation_year)

In [13]:
# Attach expectation by using its name or the metadata object, example shows the former
economy_fg.attach_expectation(expectation_year)

### Validations

#### You can also validate the dataframe without having to insert the data into a feature group

In [14]:
economy_fg.validate(economy_bulk_insert_df)

<hsfs.feature_group_validation.FeatureGroupValidation object at 0x7fa0d1373a10>

#### You get retrieve all the validations of a feature group

In [22]:
economy_fg_validations = economy_fg.get_validations()
[print(validation.to_dict()) for validation in economy_fg_validations]

[None, None]

#### ... or retrieve a validation by validation or commit time. 

Validation time is the timestamp when the validation started.

Commit time is the time data was peristed in the time travel enabled feature group

In [23]:
commit_time = economy_fg.get_validations()[0].commit_time

In [24]:
validation_time = economy_fg.get_validations()[0].validation_time

In [25]:
# Get validation by validation time
validation = economy_fg.get_validations(validation_time=validation_time)[0]
print(validation.to_dict())



In [26]:
# Get validation by commit time
validation = economy_fg.get_validations(commit_time=commit_time)[0]
print(validation.to_dict())



#### Get the status of a validation

In [27]:
print("Validation status: {}".format(validation.status))

Validation status: SUCCESS

### Upsert new invalid data into a Feature Group

Now we will try to upsert some invalid data (year feature does not meet the maximum expectation). An error is returned to the client along with the failed expectation

In [28]:
economy_upsert_data = [
    Row(1, 120499.73, 0.0, "car17", 205000.0, 30, 564724.18, 2022),    #update
    Row(2, 160893.77, 0.0, "car10", 179000.0, 2, 455015.33, 2020),     #update
    Row(5, 93956.32, 0.0, "car15",  135000.0, 1, 458679.82, 2020),     #insert
    Row(6, 41365.43, 52809.15, "car7", 135000.0, 19, 216839.71, 2020), #insert
    Row(7, 94805.61, 0.0, "car17", 135000.0, 23, 233216.07, 2022)      #insert    
]

economy_upsert_df = spark.createDataFrame(economy_upsert_data, economy_fg_schema)

economy_upsert_df.show(5)

+---+---------+----------+-----+--------+------+---------+----+
| id|   salary|commission|  car|  hvalue|hyears|     loan|year|
+---+---------+----------+-----+--------+------+---------+----+
|  1|120499.73|       0.0|car17|205000.0|    30| 564724.2|2022|
|  2|160893.77|       0.0|car10|179000.0|     2|455015.34|2020|
|  5| 93956.32|       0.0|car15|135000.0|     1| 458679.8|2020|
|  6| 41365.43|  52809.15| car7|135000.0|    19| 216839.7|2020|
|  7| 94805.61|       0.0|car17|135000.0|    23|233216.06|2022|
+---+---------+----------+-----+--------+------+---------+----+

In [29]:
# Insert call will fail as invalid data (year feature) is about to be ingested. Error shows the expectation that was not met
economy_fg.insert(economy_upsert_df)

An error was encountered:
Metadata operation error: (url: https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/120/featurestores/68/featuregroups/1068/validations). Server response: 
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/feature_group.py", line 656, in insert
    write_options,
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/feature_group_engine.py", line 93, in insert
    validation = feature_group.validate(feature_dataframe)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/feature_group.py", line 817, in validate
    return self._data_validation_engine.validate(self, dataframe)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/data_validation_engine.py", line 115, in validate
    return self._feature_group_validation_api.put(feature_group, validation_python)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-pa

### Validation type
The validation type determines the validation behavior. Available types are:
- STRICT: Data validation is performed and data is ingested into feature group is updated only if validation status is "SUCCESS"
- WARNING: Data validation is performed and data is ingested into the feature group only if validation status is "WARNING" or "SUCCESS"
- ALL: Data validation is performed and data is ingested into the feature group regardless of the validation status
- NONE: Data validation not performed on feature group

The validation type can easily be changed for a feature group

In [30]:
# The previous economy_upsert_df contains invalid data but we still want to persist the data, so we set the validation type from STRICT to ALL
economy_fg.update_validation_type("ALL")

<hsfs.feature_group.FeatureGroup object at 0x7fa0d129cad0>

In [31]:
# We try to insert the invalid df again
economy_fg.insert(economy_upsert_df)

feature_group_commit_instance.validation_id
1101
feature_group_commit_instance.json
{"commitID": null, "commitDateString": "20210121161900", "rowsInserted": 4, "rowsUpdated": 1, "rowsDeleted": 0, "validationId": 1101}
feature_group_commit_instance.to_dict
{'commitID': None, 'commitDateString': '20210121161900', 'rowsInserted': 4, 'rowsUpdated': 1, 'rowsDeleted': 0, 'validationId': 1101}