# HudiOnHops

In this notebook we will introduce time travel operations on HSFS. Currently HSFS supports the Apache Hudi thats is storage abstraction/library (http://hudi.apache.org/) for doing **incremental** data ingestion to data lakes stored on Hops (e.g a Hopsworks Feature Store).

TLDR; Hudi is a storage abstraction/library build on top of Spark. A Hudi dataset stores data in Parquet files and maintains additional metadata to make upserts efficient. A Hudi ingest job is intended to be run as a streaming ingest job, on an interval such as every 15 minutes, reading deltas from a message-bus like Kafka and ingesting the deltas **incrementally** into a data lake.

![Incremental ETL](./../images/incr_load.png "Incremetal ETL")

## Background

### Motivation

Hudi is an open-source library for doing incremental ingestion of data for large analytical datasets stored on distributed file systems. The library was originally developed at Uber to improve their data latency, but  it is now an Apache project.

The main motivation for Hudi is that it reduces the **data latency** for ingesting large datasets into data lakes. Traditional ETL typically involves taking a snapshot of a production database and doing a full load into a data lake (typically stored on a distributed file system). Using the snapshot approach for ETL is simple since the snapshot is immutable and can be loaded as an atomic unit into the data lake. However, the con of taking this approach to doing data ingestion is that it is *slow*. Even if just a single record have been updated since the last data ingestion, the entire table has to be re-written. If you are working with Big Data (TB or PB size datasets) then this will introduce significant *data latency* (up to 24 hours in Uber's case) and *wasted resources* (majority of the writes when ingesting the snapshot is redundant as most of the records have not been updated since the last ETL step). 

This motivates the use-case for **incremental** data ingestion. Incremental data ingestion means that only deltas/changelogs since the last ingestion are inserted. 

Incremental ingestion lies in-between traditional batch ingestion and the streaming use-case. It can provide data latency as low as *minutes* for petabyte-scale datasets. The incremental mode for processing introduces new trade-offs compared to streaming and batch. It has lower data latency than traditional batch processing, but a slightly higher latency than stream processing. Why not go full-streaming instead of the incremental processing? It boils down to your requirements and trade-offs. If you need data latency in the order of seconds, then you have to use stream processing (e.g fraud detection). However if your business can do with data latency in the order of say 5 minutes (applications which are fine with this latency could be feature engineering pipelines, building dashboards, or doing near-real-time analytics), then incremental processing really shines. 

With incremental processing, you process data in *mini-batches* and run the spark job frequently, every 15 minutes or so. By using mini-batches rather than record-by-record streaming, the incremental model makes better use of resources and makes it easier to do complex processing and joins which are more suited for the batch-style of processing rather than stream-processing.

![Near Real Time](./../images/near_real_time.jpg "Near Real Time")

If the data is immutable by design, incremental processing can be done without any additional ingestion library, just use the *append* primitive supported in HDFS through some HDFS client, such as Spark, e.g:

```scala
newRecordsDf = (...)
newRecordsDf.write.format("hive").mode("append").insertInto(tableName)
```

Unfortunately, data is rarely immutable in practice. A bank transaction might be reverted, a customer might change his or her home adress, and a customer review might be updated, to give a few examples. This is where Hudi comes into the picture. Hudi stands for `Hadoop Upserts anD Incrementals` and brings two new primitives for data engineering on distributed file systems (in addition to append/read):

- `Upsert`: the ability to do insertions (appends) and updates efficiently. 
- `Incremental reads`: the ability to read datasets incrementally using the notion of "commits".

![Upserts](./../images/upsert_illustration.png "Upserts")

Lets consider the process of updating a single record in a data lake of Parquet files stored on a distributed file system. Without using Hudi, this would entail scanning the entire dataset to find the record in order to do the update and then rewrite the entire dataset: 

```scala
updatedRecordsDf = (...)
updatedRecordsDf.write.format("hive").mode("overwrite").insertInto(tableName) 
```

This does not scale and HDFS/Parquet is not designed for this use-case. With Hudi, the upsert operation is a first-class primitive in the ingestion framework and it is optimized to be fast using index-lookups and atomic updates. We will see how we can use Hudi for this purpose later on in the notebook, but essentially it is as simple as :

```scala
updatedRecordsDf = (...)
upsertDf.write.format("org.apache.hudi")
              .option("hoodie.datasource.write.operation", "upsert")
              ...
```

### What is Hudi

Hudi is a Spark library that is intended to be run as a streaming ingest job, and ingests data as mini-batches (typically on the order of one to two minutes). A Hudi job generally reads delta-updates from a message-bus like Kafka, and upserts them into a data lake stored on a distributed file system. By maintaining bloom indexes and commit logs, Hudi provide ACID transactions, time-travel and scalable upserts.

![Hudi Dataset](./../images/hudi_dataset.png "Hudi Dataset")

### How Hudi can be used for ML and Feature Pipelines

Hudi is integrated in the Hopsworks Feature Store for doing incremental feature computation and for point-in-time correctness and backfilling of feature data.

![Incremental Feature Engineering](./../images/featurestore_incremental_pull.png "Incremetal Feature Engineering")

## Examples

import com.logicalclocks.hsfs._
import scala.collection.JavaConversions._
import collection.JavaConverters._

import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._

import java.sql.Date
import java.sql.Timestamp

val connection = HopsworksConnection.builder().build();
val fs = connection.getFeatureStore();

### Bulk Insert of Sample Dataset into a Hudi Dataset

Lets first create new feature group with time travel format `HUDI` and ingest some sample data. 

#### Generate the sample data

In [2]:
val bulkInsertData = Seq(
    Row(1, Date.valueOf("2019-02-30"), 0.4151f, "Sweden"),
    Row(2, Date.valueOf("2019-05-01"), 1.2151f, "Ireland"),
    Row(3, Date.valueOf("2019-08-06"), 0.2151f, "Belgium"),
    Row(4, Date.valueOf("2019-08-06"), 0.8151f, "Russia")
)
val schema = 
 scala.collection.immutable.List(
  StructField("id", IntegerType, true),
  StructField("date", DateType, true),
  StructField("value", FloatType, true),
  StructField("country", StringType, true) 
)
val bulkInsertDf = spark.createDataFrame(
  spark.sparkContext.parallelize(bulkInsertData),
  StructType(schema)
)
bulkInsertDf.show(5)

bulkInsertData: Seq[org.apache.spark.sql.Row] = List([1,2019-03-02,0.4151,Sweden], [2,2019-05-01,1.2151,Ireland], [3,2019-08-06,0.2151,Belgium], [4,2019-08-06,0.8151,Russia])
schema: List[org.apache.spark.sql.types.StructField] = List(StructField(id,IntegerType,true), StructField(date,DateType,true), StructField(value,FloatType,true), StructField(country,StringType,true))
bulkInsertDf: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+



#### Bulk load the sample data into a new Feature group.

We will create a dataset/table called `hello_hudi` :

```
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+
```
and the dataset with primary key `id` and will be partitioned on the `date` column. 

val hudi_fg = fs.createFeatureGroup().
name("hello_hudi").
description("Sample feature group").
version(1).
primaryKeys(Seq("id")).
partitionKeys(Seq("date")).
timeTravelFormat(TimeTravelFormat.HUDI).
onlineEnabled(false).
build();
hudi_fg.save(bulkInsertDf);

#### Hudi Commits

Hudi introduces the notion of `commits` which means that it supports certain properties of traditional databases such as single-table transactions, snapshot isolation, atomic upserts and savepoints for data recovery. If an ingestion fails for some reason, no partial results will be written rather the ingestion will be roll-backed. The commit is implemented using atomic `mv` operation in HDFS. 

Currently, the hudi dataset contains only a single commit as we've just done a single bulk-insert:

In [6]:
hudi_fg.commitDetails()

res5: String = 20190904114951


### Upsert new data into a Feature Group

So far we have not done anything time travel special, we simply did a regular bulk-insert of some data into a Hudi enabled feature group. We could have done the same thing using just regular None Hudi enabled Feature group. However now we will look into how we can do upserts, and how HSFS with Hudi enables us to do this efficiently.

#### Generate Sample Upserts Data

In [13]:
val upsertData = Seq(
    Row(5, Date.valueOf("2019-02-30"), 0.7921f, "Northern Ireland"), //Insert
    Row(1, Date.valueOf("2019-05-01"), 1.151f, "Norway"), //Update
    Row(3, Date.valueOf("2019-08-06"), 0.999f, "Belgium"), //Update
    Row(6, Date.valueOf("2019-08-06"), 0.0151f, "France") //Insert
)
val upsertDf = spark.createDataFrame(
  spark.sparkContext.parallelize(upsertData),
  StructType(schema)
)
upsertDf.show(5)

upsertData: Seq[org.apache.spark.sql.Row] = List([5,2019-03-02,0.7921,Northern Ireland], [1,2019-05-01,1.151,Norway], [3,2019-08-06,0.999,Belgium], [6,2019-08-06,0.0151,France])
upsertDf: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+----------------+
| id|      date| value|         country|
+---+----------+------+----------------+
|  5|2019-03-02|0.7921|Northern Ireland|
|  1|2019-05-01| 1.151|          Norway|
|  3|2019-08-06| 0.999|         Belgium|
|  6|2019-08-06|0.0151|          France|
+---+----------+------+----------------+



#### Make the Upsert using HSFS API

In [14]:
fg1.insert(upsertDf)

trustStore: String = t_certificate
pw: String = EJBVJ7UBVK9O0ZFHQAGPMACAYF01PPWQU470BDIMCQAFYLW6G98ACVYKK0B9NRU3
keyStore: String = k_certificate
hiveDb: String = demo_featurestore_admin000_featurestore
jdbcUrl: String = jdbc:hive2://10.0.2.15:9085/demo_featurestore_admin000_featurestore;auth=noSasl;ssl=true;twoWay=true;sslTrustStore=t_certificate;trustStorePassword=EJBVJ7UBVK9O0ZFHQAGPMACAYF01PPWQU470BDIMCQAFYLW6G98ACVYKK0B9NRU3;sslKeyStore=k_certificate;keyStorePassword=EJBVJ7UBVK9O0ZFHQAGPMACAYF01PPWQU470BDIMCQAFYLW6G98ACVYKK0B9NRU3
writer: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@6e378681


#### Inspect the results

Notice that although Hudi stores the old value of the records from the previous commit, when you query the hive table using the `org.apache.hudi` file format, it will only return the values of the latest commit.

+---+------+-------------+----------------+
| id| value|         date|         country|
+---+------+-------------+----------------+
|  3| 0.999|1565049600000|         Belgium|
|  1|0.4151|1551484800000|          Sweden|
|  5|0.7921|1551484800000|Northern Ireland|
|  2|1.2151|1556668800000|         Ireland|
|  1| 1.151|1556668800000|          Norway|
|  4|0.8151|1565049600000|          Russia|
|  6|0.0151|1565049600000|          France|
+---+------+-------------+----------------+



#### Inspect the updated commit timeline

In [16]:
hudi_fg.commitDetails()

res16: String = 20190904115157


In [17]:
HoodieDataSourceHelpers.allCompletedCommitsCompactions(FileSystem.get(sc.hadoopConfiguration), 
                                     s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_hudi_1").toString

res17: String = org.apache.hudi.common.table.timeline.HoodieDefaultTimeline: [20190904114951__commit__COMPLETED],[20190904115157__commit__COMPLETED]


### Time Travel

Using the timeline metadata we can inspect the value of a table at a specific point in time. We can pull changes incrementally from Hudi. 

In [20]:
fg1.read("2020-10-20 07:31:36").show()

+---+------+-------------+----------------+
| id| value|         date|         country|
+---+------+-------------+----------------+
|  3| 0.999|1565049600000|         Belgium|
|  5|0.7921|1551484800000|Northern Ireland|
|  1| 1.151|1556668800000|          Norway|
|  6|0.0151|1565049600000|          France|
+---+------+-------------+----------------+



HSFS also has a feature for incremental reads:


In [21]:
// Pull changes that happened *after* the first commit

fg1.readChanges("2020-10-20 07:31:36", "2020-10-20 07:34:11").show()

incrementalDf: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+------+-------------+----------------+
| id| value|         date|         country|
+---+------+-------------+----------------+
|  3| 0.999|1565049600000|         Belgium|
|  5|0.7921|1551484800000|Northern Ireland|
|  1| 1.151|1556668800000|          Norway|
|  6|0.0151|1565049600000|          France|
+---+------+-------------+----------------+



In [22]:
// Pull changes that include both commits (from 2017):
fg1.readChanges("2020-10-20 07:31:36", "2020-10-20 07:34:11").show()

incrementalDf: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+------+-------------+----------------+
| id| value|         date|         country|
+---+------+-------------+----------------+
|  3| 0.999|1565049600000|         Belgium|
|  1|0.4151|1551484800000|          Sweden|
|  5|0.7921|1551484800000|Northern Ireland|
|  2|1.2151|1556668800000|         Ireland|
|  1| 1.151|1556668800000|          Norway|
|  4|0.8151|1565049600000|          Russia|
|  6|0.0151|1565049600000|          France|
+---+------+-------------+----------------+



In [23]:
//Pull only the first commit
fg1.read("2020-10-20 07:31:36").show()

incrementalDf: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+------+-------------+-------+
| id| value|         date|country|
+---+------+-------------+-------+
|  2|1.2151|1556668800000|Ireland|
|  4|0.8151|1565049600000| Russia|
|  3|0.2151|1565049600000|Belgium|
|  1|0.4151|1551484800000| Sweden|
+---+------+-------------+-------+



In [24]:
//Pull only the second commit
fg1.read("2020-10-20 07:31:36").show()

incrementalDf: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+------+-------------+----------------+
| id| value|         date|         country|
+---+------+-------------+----------------+
|  3| 0.999|1565049600000|         Belgium|
|  5|0.7921|1551484800000|Northern Ireland|
|  1| 1.151|1556668800000|          Norway|
|  6|0.0151|1565049600000|          France|
+---+------+-------------+----------------+



### Create training datasets based on time travel queries

#### join featuregroups that correspond to specific point in time

In [None]:
val joined_features = (fg1.select(Seq("value","id","label"))
                   .join(fg2.select(Seq("value2")), Seq("id"), JoinType.INNER)
                   .join(fg3.select(Seq("value3")), Seq("id"), JoinType.INNER)
                   .asOf("2020-10-21"))  