### Create training dataset
In this notebook We are going to create training datasets and register to Hopsworks Feature Store. This training dataset will be later used to train for graph emmbedings model.
![Training Dataset](./images/create_training_dataset.png)

In [None]:
spark

In [2]:
import hashlib
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
import hsfs

### Instantiate a connection and get the project feature store handler 

In [3]:
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

### Get transactions feature group handle

In [4]:
transactions_fg = fs.get_feature_group("transactions_fg", 1)
transactions_fg.show(5)

+--------------+--------+--------+-------+-------+--------+
|tran_timestamp|  source|  target|tran_id|tx_type|base_amt|
+--------------+--------+--------+-------+-------+--------+
|        Aug-18|35d9e14d|ea20f3ea| 328889|      4|  341.23|
|        Aug-18|ecaba057|42e760b1| 328890|      4|  846.74|
|        Aug-18|43e1ce2b|d85de509| 328891|      4|  269.45|
|        Aug-18|2d76caaa|9dff007b| 328892|      4|  180.32|
|        Aug-18|b2efe3c6|92cf8d33| 328893|      4|  507.09|
+--------------+--------+--------+-------+-------+--------+
only showing top 5 rows

In [5]:
transactions_fg.read().count()

438386

### Load alert transactions feature group  handle

In [6]:
alert_transactions_fg = fs.get_feature_group("alert_transactions_fg", 1)
alert_transactions_fg.show(5)

+--------------+--------+------+-------+
|    alert_type|alert_id|is_sar|tran_id|
+--------------+--------+------+-------+
|scatter_gather|      32|     1|1000145|
|scatter_gather|      18|     1|1002861|
|scatter_gather|      34|     1| 100431|
|scatter_gather|      18|     1|1006056|
|scatter_gather|      32|     1|1006362|
+--------------+--------+------+-------+
only showing top 5 rows

In [7]:
alert_transactions_fg.read().count()

915

### Load party feature group  handle

In [8]:
party_fg = fs.get_feature_group("party_fg", 1)
party_fg.read().show()

+--------+----+
|      id|type|
+--------+----+
|0016359b|   0|
|0019b8d0|   0|
|001dcc27|   1|
|00298665|   1|
|003cd8f3|   0|
|003e2533|   0|
|00403fbd|   1|
|00498ec2|   1|
|0049ee5b|   0|
|0054a022|   0|
|00575ac9|   0|
|005c0c19|   1|
|006ac170|   1|
|006cc052|   0|
|0075d230|   1|
|007749eb|   0|
|00794932|   1|
|007f2674|   0|
|007f76dc|   1|
|0081b086|   0|
+--------+----+
only showing top 20 rows

## Create training datasets
To create training datasest we will use hsfs `Query` object. Training dataset's metadata, created from hsfs `Query` object, contains information such as: 
* which feature groups it was created from; 
* commit id of these feature froups;
* the order of features. 

This will give us possibility to 
* track back and see which feature were used to create this training; 
* perform time-travel and see how features looked like when this training dataset was created ;
* reconstruct feature order during model inferencing.     

### Create graph edge training datasets 

In [9]:
edges = transactions_fg.select(["source","target","tran_id","tx_type","base_amt"]).join(alert_transactions_fg.select(["is_sar"]),["tran_id"],"left")

In [10]:
edges.read().show()

+--------+--------+-------+-------+--------+------+
|  source|  target|tran_id|tx_type|base_amt|is_sar|
+--------+--------+-------+-------+--------+------+
|72555c71|6c344249|  36131|      4| 2775.22|     1|
|b0fe7e18|e6c76032| 811151|      4|  112.19|     1|
|5a89d195|2a348960| 864390|      4|  107.84|     1|
|78fd2e68|4f8b7770| 252225|      4| 2502.73|     1|
|a1b4f889|4f8b7770| 256932|      4| 2502.73|     1|
|c5d0e6ca|8fab72e6| 333603|      4| 2657.36|     1|
|ab638a8a|d429553b| 507116|      4| 2816.58|     1|
|6d4543d6|ea80e43e| 514090|      4| 2534.92|     1|
|a1b3bc5e|396e2618| 769337|      4| 2721.35|     1|
|a670af3d|313c12f6| 778521|      4| 2491.18|     1|
|67d2ab78|1a211334| 867611|      4| 2830.43|     1|
|4c52d76b|0c81ba35| 137352|      4|   94.95|     1|
|f48fcd16|36377e59| 601420|      4| 2516.74|     1|
|75178e11|fa1ca6b5| 832202|      4| 2867.47|     1|
|69d84dfc|053485ef| 852249|      4| 2580.73|     1|
|edfe718c|19857dba| 856332|      4|   103.5|     1|
|b98c31fe|3d

In [11]:
edges_td_meta = fs.create_training_dataset(name="edges_td",
                                           version=1,
                                           data_format="csv",
                                           label = ["is_sar"],   
                                           description="edges training dataset",
                                           coalesce=True,
                                           statistics_config={"enabled": True, "histograms": True, "correlations": False}
                                          )
edges_td_meta.save(edges)

### Create graph node training dataset

In [18]:
nodes = party_fg.select_all()
nodes.read().show()

node_td_meta = fs.create_training_dataset(name="node_td",
                                          version=1,
                                          data_format="csv",   
                                          description="node training dataset",
                                          statistics_config={"enabled": True, "histograms": True, "correlations": False},
                                          coalesce=True)
node_td_meta.save(nodes)

+--------+----+
|      id|type|
+--------+----+
|0016359b|   0|
|0019b8d0|   0|
|001dcc27|   1|
|00298665|   1|
|003cd8f3|   0|
|003e2533|   0|
|00403fbd|   1|
|00498ec2|   1|
|0049ee5b|   0|
|0054a022|   0|
|00575ac9|   0|
|005c0c19|   1|
|006ac170|   1|
|006cc052|   0|
|0075d230|   1|
|007749eb|   0|
|00794932|   1|
|007f2674|   0|
|007f76dc|   1|
|0081b086|   0|
+--------+----+
only showing top 20 rows

## create derived feature group `alert_nodes_fg`, wheater nodes were part of previously known money laundering scheme or not

In [13]:
alert_edges = edges.read().where(F.col("is_sar")==1)
alert_sources = alert_edges.select(["source"]).toDF("id")
alert_targets = alert_edges.select(["target"]).toDF("id")
alert_nodes = alert_sources.union(alert_targets).dropDuplicates(subset=["id"])
alert_nodes = alert_nodes.withColumn("is_sar",F.lit(1))
alert_nodes.cache()
alert_nodes.show()

+--------+------+
|      id|is_sar|
+--------+------+
|33a8ff5b|     1|
|43e028ef|     1|
|fcf3bbf3|     1|
|8b9017b8|     1|
|9c187eed|     1|
|65636b63|     1|
|68c0230d|     1|
|550a25ff|     1|
|d73e5230|     1|
|c0be245b|     1|
|cdbd2ed5|     1|
|963b978f|     1|
|84563a83|     1|
|da77c74b|     1|
|840701de|     1|
|dc37f73b|     1|
|b0f4351c|     1|
|dd2ebcf1|     1|
|c29d75dc|     1|
|d7c99aa5|     1|
+--------+------+
only showing top 20 rows

In [19]:
alert_nodes_df = nodes.read().join(alert_nodes,["id"], "left").withColumn("is_sar",F.when(F.col("is_sar") == 1, F.col("is_sar")).otherwise(0))
alert_nodes_df.cache()
alert_nodes_df.show()

+--------+----+------+
|      id|type|is_sar|
+--------+----+------+
|0016359b|   0|     0|
|0019b8d0|   0|     0|
|001dcc27|   1|     0|
|00298665|   1|     0|
|003cd8f3|   0|     0|
|003e2533|   0|     0|
|00403fbd|   1|     0|
|00498ec2|   1|     0|
|0049ee5b|   0|     0|
|0054a022|   0|     0|
|00575ac9|   0|     0|
|005c0c19|   1|     0|
|006ac170|   1|     0|
|006cc052|   0|     0|
|0075d230|   1|     0|
|007749eb|   0|     1|
|00794932|   1|     0|
|007f2674|   0|     1|
|007f76dc|   1|     0|
|0081b086|   0|     0|
+--------+----+------+
only showing top 20 rows

In [20]:
alert_nodes_df.where(F.col("is_sar") == 1).count()

816

In [21]:
alert_nodes_df.where(F.col("is_sar") == 0).count()

6531

In [22]:
# we are going to use hudi options that is suited to the size of our dataset 
extra_hudi_options = {
    "hoodie.bulkinsert.shuffle.parallelism":"1", 
    "hoodie.insert.shuffle.parallelism":"1", 
    "hoodie.upsert.shuffle.parallelism":"1",
    "hoodie.parquet.compression.ratio":"0.5"
}

alert_nodes_fg = fs.create_feature_group(name="alert_nodes_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="node embeddings from transactions, derived fg",
                                       time_travel_format="HUDI",     
                                       online_enabled=True,                                                
                                       statistics_config=False)

alert_nodes_fg.save(alert_nodes_df, extra_hudi_options)

### Training datasets exploration from the user interface

##### Hopsworks provides user interface that enables to discover and explore avaibale Training datasets and related features. Bellow screen shot demonstrates how one can preview list of available features in `edges_td` and get basic information such us identify feature types and which one is as a label.   

![Incremental Feature Engineering](./images/td_features.png)

##### One of the important steps of training dataset exploration is discover distribution of ist features. Since we enabled statistics to be computed durring training dataset creation we can easily preview descriptive statitsics. If training dataset has splits we can also preview statitics for each split separately and campare distributions to make sure that train and test splits have similar distibutions.   


![Incremental Feature Engineering](./images/td_stats.png)

##### Hopsworks UI also give access to training dataset activity timeline metadata.  
![Incremental Feature Engineering](./images/td_activity.png)