# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

![tutorial-flow](images/02_training-dataset.png) 

### Create a connection to hsfs

In [1]:
import hsfs
from hops import hdfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
6,application_1651410992845_0007,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

### We start by selecting all the features we want to include for model training/inference.

In the next notebook we are going to train [gan for anomaly detection](https://arxiv.org/pdf/1905.11034.pdf). Durring training step  we will provide only features of accounts that have never been reported for money laundering behaviour.  But we will disclose previously reported accounts to the model only in evaluation step.   


In [None]:
### Retrieve alert nodes feature group from hsfs
transactions_monthly_fg = fs.get_feature_group("transactions_monthly_fg", 1)
graph_embeddings_fg = fs.get_feature_group("graph_embeddings_fg", 1) 
party_fg = fs.get_feature_group("party_fg", 1)

In [None]:
non_sar_transac_query = party_fg.select(["tran_id", "target"])\
                                .join(transactions_monthly_fg.select(["monthly_in_std_amount",
                                                                       "monthly_in_mean_amount",
                                                                       "monthly_out_mean_amount",
                                                                       "monthly_in_count",
                                                                       "monthly_in_total_amount",
                                                                       "monthly_out_count",
                                                                       "monthly_out_total_amount",
                                                                       "monthly_out_std_amount"])\
                                .join(graph_embeddings_fg.select(["graph_embeddings"]))      
                                .filter(alert_nodes_fg.is_sar == 0))

In [None]:
non_sar_emb_query.head(5)

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`** 



## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [6]:
non_sar_td = fs.create_training_dataset(name="gan_non_sar_training_df",
                                       version=1,
                                       data_format="tfrecord",
                                       label=["is_sar"], 
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}, 
                                       splits={'train': 0.8, 'test': 0.2},
                                       coalesce=True,
                                       description="non sar dataset for gan training")
non_sar_td.save(non_sar_emb_query)



## For testing and evaluation we will include known SAR nodes to measure anomaly score  

In [7]:
non_sar_td = fs.get_training_dataset("gan_non_sar_training_df", 1)
non_sar_test_df = non_sar_td.read(split="test")

In [8]:
sar_emb_query = node_embeddings_fg.select(["embedding"])\
                                  .join(alert_nodes_fg.select(["is_sar"])\
                                  .filter(alert_nodes_fg.is_sar == 1))

In [9]:
sar_df = sar_emb_query.read()
sar_df = sar_df.select(*non_sar_test_df.columns)
eval_df = non_sar_test_df.union(sar_df)
eval_df.cache()
eval_df.show()

+--------------------+------+
|           embedding|is_sar|
+--------------------+------+
|[-0.9998507499694...|     0|
|[-0.9986557960510...|     0|
|[-0.9984421730041...|     0|
|[-0.9970214366912...|     0|
|[-0.9947502613067...|     0|
|[-0.9934816360473...|     0|
|[-0.9908211231231...|     0|
|[-0.9882340431213...|     0|
|[-0.9830579757690...|     0|
|[-0.9823658466339...|     0|
|[-0.9815602302551...|     0|
|[-0.9814951419830...|     0|
|[-0.9812114238739...|     0|
|[-0.9809970855712...|     0|
|[-0.9808425903320...|     0|
|[-0.9780859947204...|     0|
|[-0.9746136665344...|     0|
|[-0.9715352058410...|     0|
|[-0.9711296558380...|     0|
|[-0.9682199954986...|     0|
+--------------------+------+
only showing top 20 rows

In [10]:
non_sar_test_df.count()

1267

In [11]:
sar_df.count()

816

In [12]:
eval_df.count()

2083

In [None]:
gan_eval_ds = fs.create_training_dataset(name="gan_eval_df",
                                       version=1,
                                       data_format="tfrecord",
                                       label=["is_sar"], 
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}, 
                                       coalesce = True,
                                       description="evaluation dataset for gan training")
gan_eval_ds.save(eval_df)

## Training dataset provenance
![Training dataset provenance](./images/provenance_td.png)