# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store.</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on.
2. **How to preprocess the features.**
3. **Feature Views.**
4. **Training Datasets.**

![tutorial-flow](images/02_training-dataset.png) 

---

## <span style="color:#ff5f27;">🧑🏻‍🏫 HSFS Feature Views and Training Datasets</span>

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">🪝 Retrieving Feature Groups </span>

We can use the **Feature Groups** created in the previous notebook to create a `Feature Views`.

> In order to retrieve necessary Feature Group we can use `FeatureStore.get_or_create_feature_group()` method.

#### <span style="color:#ff5f27;">⛳️ Application Train Feature Group</span>

In [2]:
application_train_fg = fs.get_or_create_feature_group(
    name = 'application_train_fg',
    version = 1
)

application_train_fg.select_all().show(5)

2022-06-21 12:52:29,006 INFO: USE `credit_scores_featurestore`
2022-06-21 12:52:29,907 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,104385,0,Cash loans,F,Y,Y,0,162000.0,211500.0,10795.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
1,107431,0,Cash loans,F,N,Y,2,270000.0,997335.0,29290.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,177264,0,Cash loans,F,Y,Y,3,67500.0,254700.0,17149.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
3,257454,0,Cash loans,F,N,N,0,135000.0,463500.0,15079.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,386717,0,Cash loans,F,N,Y,0,49500.0,276813.0,18166.5,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,6.0


#### <span style="color:#ff5f27;">⛳️ Previous Loan Counts Feature Group</span>

In [3]:
previous_loan_counts_fg = fs.get_or_create_feature_group(
    name = 'previous_loan_counts_fg',
    version = 1
)

previous_loan_counts_fg.select_all().show(5)

2022-06-21 12:52:47,389 INFO: USE `credit_scores_featurestore`
2022-06-21 12:52:48,249 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`previous_loan_counts` `previous_loan_counts`
FROM `credit_scores_featurestore`.`previous_loan_counts_fg_1` `fg0`


Unnamed: 0,sk_id_curr,previous_loan_counts
0,134517,6
1,107431,5
2,160919,2
3,320291,5
4,343264,3


---

## <span style="color:#ff5f27;">🕵🏻‍♂️ Feature Groups Investigation</span>

We can use `FeatureGroup.show()` method to select top n rows. 

Also we use method `FeatureGroup.read()` in order **to aggregate queries**, which are the output of next methods:

- `FeatureGroup.get_feture()` to get specific feature from our Feature Group.

- `FeatureGroup.select()` to get a few features from our Feature Group.

- `FeatureGroup.select_all()` to get all features from our Feature Group.

- `FeatureGroup.select_except()` to get all features except a few from our Feature Group.

- `FeatureGroup.filter()` in order to apply specific filter to the feature group.

In [4]:
application_train_fg

<hsfs.feature_group.FeatureGroup at 0x7f26782ce0a0>

In [5]:
application_train_fg.select_all()

<hsfs.constructor.query.Query at 0x7f2678319820>

In [6]:
application_train_fg.read().head()

2022-06-19 21:28:28,483 INFO: USE `credit_scores_featurestore`
2022-06-19 21:28:29,328 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,104385,0,Cash loans,F,Y,Y,0,162000.0,211500.0,10795.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
1,107431,0,Cash loans,F,N,Y,2,270000.0,997335.0,29290.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,177264,0,Cash loans,F,Y,Y,3,67500.0,254700.0,17149.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
3,257454,0,Cash loans,F,N,N,0,135000.0,463500.0,15079.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,386717,0,Cash loans,F,N,Y,0,49500.0,276813.0,18166.5,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,6.0


---

## <span style="color:#ff5f27;">🤖 Transformation functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to training datasets.

Hopsworks Feature Store also comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

In [7]:
[t_func.name for t_func in fs.get_transformation_functions()]

['min_max_scaler', 'standard_scaler', 'robust_scaler', 'label_encoder']

We can retrieve transformation function we need .

To attach transformation function to training dataset provide transformation functions as dict, where key is feature name and value is online transformation function name.

Also training dataset must be created from the Query object. Once attached transformation function will be applied on whenever save, insert and get_serving_vector methods are called on training dataset object.

In [8]:
application_train_df = application_train_fg.read()
application_train_df.head()

2022-06-19 21:28:42,874 INFO: USE `credit_scores_featurestore`
2022-06-19 21:28:43,720 INFO: SELECT `fg0`.`sk_id_curr` `sk_id_curr`, `fg0`.`target` `target`, `fg0`.`name_contract_type` `name_contract_type`, `fg0`.`code_gender` `code_gender`, `fg0`.`flag_own_car` `flag_own_car`, `fg0`.`flag_own_realty` `flag_own_realty`, `fg0`.`cnt_children` `cnt_children`, `fg0`.`amt_income_total` `amt_income_total`, `fg0`.`amt_credit` `amt_credit`, `fg0`.`amt_annuity` `amt_annuity`, `fg0`.`amt_goods_price` `amt_goods_price`, `fg0`.`name_type_suite` `name_type_suite`, `fg0`.`name_income_type` `name_income_type`, `fg0`.`name_education_type` `name_education_type`, `fg0`.`name_family_status` `name_family_status`, `fg0`.`name_housing_type` `name_housing_type`, `fg0`.`region_population_relative` `region_population_relative`, `fg0`.`days_birth` `days_birth`, `fg0`.`days_employed` `days_employed`, `fg0`.`days_registration` `days_registration`, `fg0`.`days_id_publish` `days_id_publish`, `fg0`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
0,104385,0,Cash loans,F,Y,Y,0,162000.0,211500.0,10795.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
1,107431,0,Cash loans,F,N,Y,2,270000.0,997335.0,29290.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,177264,0,Cash loans,F,Y,Y,3,67500.0,254700.0,17149.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
3,257454,0,Cash loans,F,N,N,0,135000.0,463500.0,15079.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,386717,0,Cash loans,F,N,Y,0,49500.0,276813.0,18166.5,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,6.0


In [9]:
label_encoder = fs.get_transformation_function(name = 'label_encoder')

mapping_transformer = {}
cat_cols = application_train_df.dtypes[application_train_df.dtypes == 'object'].index

for col in cat_cols:
    mapping_transformer[col] = label_encoder

mapping_transformer

{'name_contract_type': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'code_gender': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'flag_own_car': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'flag_own_realty': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'name_type_suite': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'name_income_type': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'name_education_type': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'name_family_status': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'name_housing_type': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'weekday_appr_process_start': <hsfs.transformation_function.TransformationFunction at 0x7f26782f9f70>,
 'organization_type': <hsfs.transformation_function.T

---

## <span style="color:#ff5f27;"> 💼 Query Preparation</span>

In [11]:
fg_query = application_train_fg.select_all()\
                        .join(
                            previous_loan_counts_fg.select_all()
                        )
fg_query.show(5)

2022-06-19 21:40:12,002 INFO: USE `credit_scores_featurestore`
2022-06-19 21:40:12,840 INFO: SELECT `fg1`.`sk_id_curr` `sk_id_curr`, `fg1`.`target` `target`, `fg1`.`name_contract_type` `name_contract_type`, `fg1`.`code_gender` `code_gender`, `fg1`.`flag_own_car` `flag_own_car`, `fg1`.`flag_own_realty` `flag_own_realty`, `fg1`.`cnt_children` `cnt_children`, `fg1`.`amt_income_total` `amt_income_total`, `fg1`.`amt_credit` `amt_credit`, `fg1`.`amt_annuity` `amt_annuity`, `fg1`.`amt_goods_price` `amt_goods_price`, `fg1`.`name_type_suite` `name_type_suite`, `fg1`.`name_income_type` `name_income_type`, `fg1`.`name_education_type` `name_education_type`, `fg1`.`name_family_status` `name_family_status`, `fg1`.`name_housing_type` `name_housing_type`, `fg1`.`region_population_relative` `region_population_relative`, `fg1`.`days_birth` `days_birth`, `fg1`.`days_employed` `days_employed`, `fg1`.`days_registration` `days_registration`, `fg1`.`days_id_publish` `days_id_publish`, `fg1`.`flag_mobil` `fla

Unnamed: 0,sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,...,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year,previous_loan_counts
0,313542,0,Cash loans,F,N,Y,0,180000.0,450000.0,21780.0,...,0,0,0,0.0,0.0,0.0,2.0,0.0,3.0,6
1,436668,0,Cash loans,F,Y,N,1,157500.0,288873.0,9067.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,10
2,273326,1,Cash loans,M,N,N,2,157500.0,482593.5,38259.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,3
3,216905,0,Cash loans,M,N,N,0,81000.0,700830.0,20619.0,...,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0,4
4,298680,0,Cash loans,F,N,N,0,90000.0,234000.0,8820.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,7


---

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- out target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [12]:
feature_view = fs.create_feature_view(
    name = 'train_data',
    version = 1,
    labels = ['target'],
    transformation_functions = mapping_transformer,
    query = fg_query
)

Feature view created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/120/fs/68/fv/train_data/version/1


In [13]:
feature_view

<hsfs.feature_view.FeatureView at 0x7f25f7584bb0>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`.

In [14]:
feature_view = fs.get_feature_view(
    name = 'train_data',
    version = 1
)

In [15]:
feature_view.version

1

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

In [16]:
feature_view.create_training_data(
    description = 'training_dataset',
    data_format = 'csv'
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/120/jobs/named/train_data_1_1_create_fv_td_19062022214718/executions




(1, <hsfs.core.job.Job at 0x7f25eec94cd0>)

### <span style="color:#ff5f27;">🧑🏻‍🔬 Dataset with splits</span>

Also we can create dataset with **train and test** splits and even with **train, validation and test** splits!

You can use `feature_view.create_train_test_split()` and `feature_view.create_train_validation_test_splits()` and simply specify `test_size` and `val_size`.

In [17]:
feature_view.create_train_test_split(
    test_size = 0.2
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/120/jobs/named/train_data_1_2_create_fv_td_19062022214933/executions




(2, <hsfs.core.job.Job at 0x7f25f2059c10>)

---

## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

In [18]:
X_train, y_train = feature_view.get_training_data(
    training_dataset_version = 1
)



In [19]:
X_train.head()

Unnamed: 0,sk_id_curr,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,amt_goods_price,...,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year,previous_loan_counts
0,104385,1,0,0,0,0,162000.0,211500.0,10795.5,211500.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,3
1,107431,1,0,1,0,2,270000.0,997335.0,29290.5,832500.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,5
2,177264,1,0,0,0,3,67500.0,254700.0,17149.5,225000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,3
3,257454,1,0,1,1,0,135000.0,463500.0,15079.5,463500.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,7
4,386717,1,0,1,0,0,49500.0,276813.0,18166.5,256500.0,...,0,0,0,0.0,0.0,0.0,0.0,2.0,6.0,4


In [20]:
y_train.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [21]:
X_train.shape

(60772, 72)

In [22]:
y_train.shape

(60772, 1)

In [23]:
X_train, y_train, X_test, y_test = feature_view.get_train_test_split(
    training_dataset_version = 2
)

In [24]:
X_train.head()

Unnamed: 0,sk_id_curr,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,amt_goods_price,...,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year,previous_loan_counts
0,100016,1,0,1,0,0,67500.0,80865.0,5881.5,67500.0,...,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,7
1,100030,1,0,1,0,0,90000.0,225000.0,11074.5,225000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,6
2,100032,1,1,1,0,1,112500.0,327024.0,23827.5,270000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,100033,1,1,0,0,0,270000.0,790830.0,57676.5,675000.0,...,0,0,0,0.0,0.0,0.0,1.0,0.0,1.0,1
4,100043,1,0,1,0,2,198000.0,641173.5,23157.0,553500.0,...,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0,2


---

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, we will train a model on the dataset we created in this notebook.

---