# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset** for training anomaly detection model.

![tutorial-flow](images/02_training-dataset.png) 

### Create a connection to hsfs

In [1]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

### We start by selecting all the features we want to include for model training/inference.

In [2]:
### Retrieve alert nodes feature group from hsfs
transactions_monthly_fg = fs.get_feature_group("transactions_monthly_fg", 1)
graph_embeddings_fg = fs.get_feature_group("graph_embeddings_fg", 1) 
party_fg = fs.get_feature_group("party_fg", 1)

In [3]:
# AML model query 
aml_model_query = party_fg.select(["type", "is_sar"])\
                            .join(transactions_monthly_fg.select(["monthly_in_count", 
                                                                  "monthly_in_total_amount", 
                                                                  "monthly_in_mean_amount", 
                                                                  "monthly_in_std_amount", 
                                                                  "monthly_out_count", 
                                                                  "monthly_out_total_amount", 
                                                                  "monthly_out_mean_amount", 
                                                                  "monthly_out_std_amount"]))\
                            .join(graph_embeddings_fg.select(["graph_embeddings"]))


In [None]:
# uncommnet this line if you would like to see query results
#aml_model_query.show(5)

### Transformation Functions
Transformation functions are a mathematical mapping of input data that may be stateful - requiring statistics from the partent feature view (such as number of instances of a category, or mean value of a numerical feature)

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [5]:
# Load built in transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")

# Map features to transformations.
transformation_functions = {
    "monthly_in_count": min_max_scaler,
    "monthly_in_total_amount": min_max_scaler,
    "monthly_in_mean_amount": min_max_scaler,
    "monthly_in_std_amount": min_max_scaler,
    "monthly_out_count": min_max_scaler,
    "monthly_out_total_amount": min_max_scaler,
    "monthly_out_mean_amount": min_max_scaler,
    "monthly_out_std_amount": min_max_scaler
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models. The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [7]:
feature_view = fs.create_feature_view(
    name='aml_feature_view',
    query=aml_model_query,
    labels=["is_sar"],
    transformation_functions=transformation_functions
)

Feature view created successfully, explore it at 
https://hopsworks0.logicalclocks.com/p/119/fs/67/fv/aml_feature_view/version/1


## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.create_training_data()` or `feature_view.get_training_data()` methods.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`** 



In [9]:
train_x, train_y = feature_view.training_data(description = 'aml training dataset')

2022-06-19 13:30:16,871 INFO: USE `aml_demo_featurestore`
2022-06-19 13:30:17,660 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg2`.`type` `type`, `fg2`.`is_sar` `is_sar`, `fg2`.`id` `join_pk_id`, `fg2`.`tran_timestamp` `join_evt_tran_timestamp`, `fg0`.`monthly_in_count` `monthly_in_count`, `fg0`.`monthly_in_total_amount` `monthly_in_total_amount`, `fg0`.`monthly_in_mean_amount` `monthly_in_mean_amount`, `fg0`.`monthly_in_std_amount` `monthly_in_std_amount`, `fg0`.`monthly_out_count` `monthly_out_count`, `fg0`.`monthly_out_total_amount` `monthly_out_total_amount`, `fg0`.`monthly_out_mean_amount` `monthly_out_mean_amount`, `fg0`.`monthly_out_std_amount` `monthly_out_std_amount`, RANK() OVER (PARTITION BY `fg2`.`id`, `fg2`.`tran_timestamp` ORDER BY `fg0`.`tran_timestamp` DESC) pit_rank_hopsworks
FROM `aml_demo_featurestore`.`party_fg_1` `fg2`
INNER JOIN `aml_demo_featurestore`.`transactions_monthly_fg_1` `fg0` ON `fg2`.`id` = `fg0`.`id` AND `fg2`.`tran_timestamp` >= `fg0`.`tran_timest



## <span style="color:#ff5f27;"> 👓 Exploration </span>

### Similar to Feature Groups Feature Views and Training Tatasets are now accessible and searchable in the UI
![fv-overview](images/feature_views_explore.gif)

## 📊 Statistics
We can explore feature statistics in the feature views. 

![fv-stats](images/feature_view_stats.gif)
