# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on.
2. **How to preprocess the features.**
3. **Feature Views.**
4. **Training Datasets.**

![tutorial-flow](images/02_training-dataset.png) 

---

## 🧑🏻‍🏫 HSFS `Feature Views` and `Training Datasets`

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">🪝 Retrieving Feature Groups </span>

We can use the **Feature Groups** created in the previous notebook to create a `Feature Views`.

> In order to retrieve necessary Feature Group we can use `FeatureStore.get_feature_group()` method.

#### ⛳️ Application Train Feature Group

In [2]:
application_train_fg = fs.get_feature_group(
    name = 'application_train_fg',
    version = 1
)

application_train_fg.select_all().show(5)



2022-06-06 14:46:06,821 INFO: USE `credit_scores_featurestore`
2022-06-06 14:46:07,516 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,179948,0,Cash loans,F,N,Y,0,90000.0,265851.0,14418.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,278123,0,Cash loans,M,Y,N,1,180000.0,1462720.5,58140.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,448796,0,Cash loans,F,N,Y,0,157500.0,454500.0,23337.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
3,377844,1,Cash loans,M,N,Y,0,157500.0,393543.0,21478.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
4,181715,1,Cash loans,F,N,N,0,135000.0,1024740.0,45270.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


#### ⛳️ Previous Loan Counts Feature Group

In [3]:
previous_loan_counts_fg = fs.get_feature_group(
    name = 'previous_loan_counts_fg',
    version = 1
)

previous_loan_counts_fg.select_all().show(5)



2022-06-06 14:46:16,052 INFO: USE `credit_scores_featurestore`
2022-06-06 14:46:16,741 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`previous_loan_counts` `previous_loan_counts`
FROM `credit_scores_featurestore`.`previous_loan_counts_fg_1` `fg0`


Unnamed: 0,sk_id_curr,previous_loan_counts
0,179948,1
1,111658,8
2,206938,1
3,155506,7
4,168624,8


---

## <span style="color:#ff5f27;">🕵🏻‍♂️ Feature Groups Investigation</span>

We can use `FeatureGroup.show()` method to select top n rows. 

Also we use method `FeatureGroup.read()` in order **to aggregate queries**, which are the output of next methods:

- `FeatureGroup.get_feture()` to get specific feature from our Feature Group.

- `FeatureGroup.select()` to get a few features from our Feature Group.

- `FeatureGroup.select_all()` to get all features from our Feature Group.

- `FeatureGroup.select_except()` to get all features except a few from our Feature Group.

- `FeatureGroup.filter()` in order to apply specific filter to the feature group.

In [4]:
application_train_fg



<hsfs.feature_group.FeatureGroup at 0x7f1236f01d30>

In [5]:
application_train_fg.select_all()

<hsfs.constructor.query.Query at 0x7f1165e429d0>

In [6]:
application_train_fg.read().head()

2022-06-06 14:46:18,574 INFO: USE `credit_scores_featurestore`
2022-06-06 14:46:19,262 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,179948,0,Cash loans,F,N,Y,0,90000.0,265851.0,14418.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,278123,0,Cash loans,M,Y,N,1,180000.0,1462720.5,58140.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,448796,0,Cash loans,F,N,Y,0,157500.0,454500.0,23337.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
3,377844,1,Cash loans,M,N,Y,0,157500.0,393543.0,21478.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
4,181715,1,Cash loans,F,N,N,0,135000.0,1024740.0,45270.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


---

## <span style="color:#ff5f27;">🤖 Transformation functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to training datasets.

Hopsworks Feature Store also comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

In [7]:
[t_func.name for t_func in fs.get_transformation_functions()]



['standard_scaler', 'min_max_scaler', 'label_encoder', 'robust_scaler']

We can retrieve transformation function we need .

To attach transformation function to training dataset provide transformation functions as dict, where key is feature name and value is online transformation function name.

Also training dataset must be created from the Query object. Once attached transformation function will be applied on whenever save, insert and get_serving_vector methods are called on training dataset object.

In [8]:
application_train_df = application_train_fg.read()
application_train_df.head()

2022-06-06 14:46:28,303 INFO: USE `credit_scores_featurestore`
2022-06-06 14:46:28,992 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,179948,0,Cash loans,F,N,Y,0,90000.0,265851.0,14418.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,278123,0,Cash loans,M,Y,N,1,180000.0,1462720.5,58140.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,448796,0,Cash loans,F,N,Y,0,157500.0,454500.0,23337.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
3,377844,1,Cash loans,M,N,Y,0,157500.0,393543.0,21478.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
4,181715,1,Cash loans,F,N,N,0,135000.0,1024740.0,45270.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


In [9]:
label_encoder = fs.get_transformation_function(name = 'label_encoder')

mapping_transformer = {}
cat_cols = application_train_df.dtypes[application_train_df.dtypes == 'object'].index

for col in cat_cols:
    mapping_transformer[col] = label_encoder

mapping_transformer



{'name_contract_type': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'code_gender': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'flag_own_car': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'flag_own_realty': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'name_type_suite': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'name_income_type': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'name_education_type': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'name_family_status': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'name_housing_type': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'weekday_appr_process_start': <hsfs.transformation_function.TransformationFunction at 0x7f1236f1c370>,
 'organization_type': <hsfs.transformation_function.T

---

## <span style="color:#ff5f27;"> 💼 Query Preparation</span>

In [10]:
fg_query = application_train_fg.select_all()\
                        .join(
                            previous_loan_counts_fg.select_all()
                        )
fg_query.show(5)

2022-06-06 14:46:37,121 INFO: USE `credit_scores_featurestore`
2022-06-06 14:46:37,801 INFO: SELECT `fg1`.`sk_id_curr` `sk_id_curr`, `fg1`.`target` `target`, `fg1`.`name_contract_type` `name_contract_type`, `fg1`.`code_gender` `code_gender`, `fg1`.`flag_own_car` `flag_own_car`, `fg1`.`flag_own_realty` `flag_own_realty`, `fg1`.`cnt_children` `cnt_children`, `fg1`.`amt_income_total` `amt_income_total`, `fg1`.`amt_credit` `amt_credit`, `fg1`.`amt_annuity` `amt_annuity`, `fg1`.`amt_goods_price` `amt_goods_price`, `fg1`.`name_type_suite` `name_type_suite`, `fg1`.`name_income_type` `name_income_type`, `fg1`.`name_education_type` `name_education_type`, `fg1`.`name_family_status` `name_family_status`, `fg1`.`name_housing_type` `name_housing_type`, `fg1`.`region_population_relative` `region_population_relative`, `fg1`.`days_birth` `days_birth`, `fg1`.`days_employed` `days_employed`, `fg1`.`days_registration` `days_registration`, `fg1`.`days_id_publish` `days_id_publish`, `fg1`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year,previous_loan_counts
0,140282,0,Cash loans,F,N,N,0,112500.0,1046142.0,30717.0,...,0,0,0,0.0,0.0,0.0,0.0,2.0,1.0,2
1,196458,0,Revolving loans,F,N,Y,0,76500.0,225000.0,11250.0,...,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0,3
2,194978,0,Cash loans,M,Y,Y,0,112500.0,1585224.0,42637.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,8
3,118340,0,Cash loans,F,N,Y,1,135000.0,1147500.0,31684.5,...,0,0,0,0.0,0.0,0.0,0.0,2.0,5.0,3
4,205876,0,Cash loans,F,N,N,0,225000.0,396000.0,26505.0,...,0,0,0,0.0,0.0,0.0,1.0,2.0,2.0,3


---

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

In [11]:
feature_view = fs.create_feature_view(
    name = 'train_data',
    version = 1,
    transformation_functions = mapping_transformer,
    query = fg_query
)



Feature view created successfully, explore it at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/119/fs/67/fv/train_data/version/1


In [12]:
feature_view

<hsfs.feature_view.FeatureView at 0x7f1166a1f9a0>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`.

In [13]:
feature_view = fs.get_feature_view(
    name = 'train_data',
    version = 1
)

In [14]:
feature_view.version

1

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_dataset()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- Also we can specify split ratio using **splits** parameter.

- **train_split** - specify which split will be used for training.

In [15]:
feature_view.create_training_dataset(
    version = 1,
    description = 'training_dataset',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
   # statistics_config = {"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False},
    write_options = {'wait_for_job': False}
)

Training dataset job started successfully, you can follow the progress at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/119/jobs/named/train_data_1_1_create_fv_td_06062022144711/executions


(1, <hsfs.core.job.Job at 0x7f116369e550>)

Now our dataset has been splitted into two parts: **train(80% of original dataset)** and **validation(20% of original dataset)**.

---

## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [16]:
td_version, df = feature_view.get_training_dataset_splits(
    splits = {'train': 80, 'validation': 20},
    version = 1
)

df



ValueError: No objects to concatenate

In [None]:
# td_version, df = feature_view.get_training_dataset()

# df.head()

---

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, we will train a model on the dataset we created in this notebook.

---