# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## 🗒️ In this notebook we will see how to create a training dataset from the feature groups: 

1. Retrieving Feature Groups
2. Feature Group investigation
3. Transformation functions
4. Feature Views
5. Training Datasets
6. Training Datasets with Event Time filter



![tutorial-flow](images/02_training-dataset.png) 

---

## <span style="color:#ff5f27;"> 🔮 🪝 Connecting to Feature Store and Retrieving Feature Groups </span>

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


> In order to retrieve necessary Feature Group we can use `FeatureStore.get_or_create_feature_group()` method.

In [2]:
fg_weather = fs.get_or_create_feature_group(
    name = 'weather_fg',
    version = 1
)

In [3]:
fg_calendar = fs.get_or_create_feature_group(
    name = 'calendar_fg',
    version = 1
)

In [4]:
fg_electricity = fs.get_or_create_feature_group(
    name = 'electricity_fg',
    version = 1
)

---

# <span style="color:#ff5f27;">🕵🏻‍♂️ Feature Groups Investigation</span>

We can use `FeatureGroup.show()` method to select top n rows. 

Also we use method `FeatureGroup.read()` in order **to aggregate queries**, which are the output of next methods:

- `FeatureGroup.get_feature()` to get specific feature from our Feature Group.

- `FeatureGroup.select()` to get a subset of features from our Feature Group.

- `FeatureGroup.select_all()` to get all features from our Feature Group.

- `FeatureGroup.select_except()` to get all features except a few from our Feature Group.

- `FeatureGroup.filter()` to apply specific filter to the feature group.

In [5]:
fg_weather.select_all()

<hsfs.constructor.query.Query at 0x7f7c2909cd90>

In [6]:
fg_weather.select_all().read().head()

2022-06-21 11:53:50,901 INFO: USE `electricity_featurestore`
2022-06-21 11:53:51,783 INFO: SELECT `fg0`.`index` `index`, `fg0`.`date` `date`, `fg0`.`min_temperature` `min_temperature`, `fg0`.`max_temperature` `max_temperature`, `fg0`.`solar_exposure` `solar_exposure`, `fg0`.`rainfall` `rainfall`, `fg0`.`day_of_week` `day_of_week`, `fg0`.`day_of_month` `day_of_month`, `fg0`.`day_of_year` `day_of_year`, `fg0`.`week_of_year` `week_of_year`, `fg0`.`month` `month`, `fg0`.`quarter` `quarter`, `fg0`.`year` `year`
FROM `electricity_featurestore`.`weather_fg_1` `fg0`


Unnamed: 0,index,date,min_temperature,max_temperature,solar_exposure,rainfall,day_of_week,day_of_month,day_of_year,week_of_year,month,quarter,year
0,1886,1583020800000,13.2,32.1,20.8,0.0,6,1,61,9,3,1,2020
1,2039,1596240000000,5.7,16.4,11.1,0.0,5,1,214,31,8,3,2020
2,1347,1536451200000,8.6,18.1,7.3,0.0,6,9,252,36,9,3,2018
3,634,1474848000000,8.1,16.9,13.6,0.6,0,26,270,39,9,3,2016
4,8,1420761600000,16.5,18.0,3.1,1.2,4,9,9,2,1,1,2015


In [7]:
fg_calendar.select_except(['index']).show(5)

2022-06-21 11:53:53,825 INFO: USE `electricity_featurestore`
2022-06-21 11:53:54,690 INFO: SELECT `fg0`.`date` `date`, `fg0`.`school_day` `school_day`, `fg0`.`holiday` `holiday`
FROM `electricity_featurestore`.`calendar_fg_1` `fg0`


Unnamed: 0,date,school_day,holiday
0,1583020800000,1,0
1,1596240000000,1,0
2,1536451200000,1,0
3,1474848000000,0,0
4,1420761600000,0,0


In [8]:
fg_electricity.select('demand').show(5)

2022-06-21 11:53:56,516 INFO: USE `electricity_featurestore`
2022-06-21 11:53:57,345 INFO: SELECT `fg0`.`demand` `demand`
FROM `electricity_featurestore`.`electricity_fg_1` `fg0`


Unnamed: 0,demand
0,101413.145
1,111590.65
2,107838.95
3,124343.19
4,135452.26


In [9]:
fg_electricity.filter(fg_electricity.demand > 10000).show(5)

2022-06-21 11:53:59,173 INFO: USE `electricity_featurestore`
2022-06-21 11:54:00,038 INFO: SELECT `fg0`.`index` `index`, `fg0`.`date` `date`, `fg0`.`rrp` `rrp`, `fg0`.`frac_at_neg_rrp` `frac_at_neg_rrp`, `fg0`.`demand` `demand`, `fg0`.`rrp_positive` `rrp_positive`, `fg0`.`demand_neg_rrp` `demand_neg_rrp`, `fg0`.`rrp_negative` `rrp_negative`, `fg0`.`demand_pos_rrp` `demand_pos_rrp`, `fg0`.`demand_7_mean` `demand_7_mean`, `fg0`.`demand_7_std` `demand_7_std`, `fg0`.`demand_14_mean` `demand_14_mean`, `fg0`.`demand_14_std` `demand_14_std`, `fg0`.`demand_30_mean` `demand_30_mean`, `fg0`.`demand_30_std` `demand_30_std`
FROM `electricity_featurestore`.`electricity_fg_1` `fg0`
WHERE `fg0`.`demand` > 10000


Unnamed: 0,index,date,rrp,frac_at_neg_rrp,demand,rrp_positive,demand_neg_rrp,rrp_negative,demand_pos_rrp,demand_7_mean,demand_7_std,demand_14_mean,demand_14_std,demand_30_mean,demand_30_std
0,1886,1583020800000,43.745921,0.0,101413.145,43.745921,0.0,0.0,101413.145,110717.865714,9220.740759,110795.223214,8592.27448,113918.470167,11661.928533
1,2039,1596240000000,47.915895,0.0,111590.65,47.915895,0.0,0.0,111590.65,127876.293571,7974.452809,130397.305714,8870.056804,131883.9405,8381.50367
2,1347,1536451200000,67.615591,0.104167,107838.95,77.332217,9927.775,-28.213155,97911.175,119398.098571,8876.580027,124830.694643,10492.613383,125011.080833,9375.466267
3,634,1474848000000,47.67147,0.0,124343.19,47.67147,0.0,0.0,124343.19,120814.523571,10247.936071,122961.179286,10768.576673,123261.361,9562.472546
4,1435,1544054400000,125.308108,0.0,135452.26,125.308108,0.0,0.0,135452.26,109817.412143,14171.194821,108530.186071,10633.545758,107938.95,9656.161129


---

# <span style="color:#ff5f27;">🧑🏻‍🔬 Transformation functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to training datasets.

Hopsworks Feature Store also comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

In [10]:
[t_func.name for t_func in fs.get_transformation_functions()]

['robust_scaler', 'min_max_scaler', 'standard_scaler', 'label_encoder']

We can retrieve transformation function we need .

To attach transformation function to training dataset provide transformation functions as dict, where key is feature name and value is online transformation function name.

Also training dataset must be created from the Query object. Once attached transformation function will be applied on whenever save, insert and get_serving_vector methods are called on training dataset object.

In [11]:
# Load transformation functions.
standard_scaler = fs.get_transformation_function(name = 'standard_scaler')
label_encoder = fs.get_transformation_function(name = 'label_encoder')

#Map features to transformations.
mapping_transformers = {
    "rrp_positive": standard_scaler,
    "rrp_negative": standard_scaler,
    "school_day": label_encoder,
    "holiday": label_encoder
}

---

## <span style="color:#ff5f27;">💼 Query Preparation</span>

In [12]:
fg_query = fg_weather.select_all()\
                        .join(
                            fg_calendar.select_all(),
                            on = ['index','date']
                        )\
                        .join(
                            fg_electricity.select_all(),
                            on = ['index','date']
                        )
fg_query.show(5)

2022-06-21 11:54:04,320 INFO: USE `electricity_featurestore`
2022-06-21 11:54:05,216 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg2`.`index` `index`, `fg2`.`date` `date`, `fg2`.`min_temperature` `min_temperature`, `fg2`.`max_temperature` `max_temperature`, `fg2`.`solar_exposure` `solar_exposure`, `fg2`.`rainfall` `rainfall`, `fg2`.`day_of_week` `day_of_week`, `fg2`.`day_of_month` `day_of_month`, `fg2`.`day_of_year` `day_of_year`, `fg2`.`week_of_year` `week_of_year`, `fg2`.`month` `month`, `fg2`.`quarter` `quarter`, `fg2`.`year` `year`, `fg2`.`index` `join_pk_index`, `fg2`.`date` `join_evt_date`, `fg0`.`school_day` `school_day`, `fg0`.`holiday` `holiday`, RANK() OVER (PARTITION BY `fg2`.`index`, `fg2`.`date`, `fg2`.`date` ORDER BY `fg0`.`date` DESC) pit_rank_hopsworks
FROM `electricity_featurestore`.`weather_fg_1` `fg2`
INNER JOIN `electricity_featurestore`.`calendar_fg_1` `fg0` ON `fg2`.`index` = `fg0`.`index` AND `fg2`.`date` = `fg0`.`date` AND `fg2`.`date` >= `fg0`.`date`) NA
WH

Unnamed: 0,index,date,min_temperature,max_temperature,solar_exposure,rainfall,day_of_week,day_of_month,day_of_year,week_of_year,...,rrp_positive,demand_neg_rrp,rrp_negative,demand_pos_rrp,demand_7_mean,demand_7_std,demand_14_mean,demand_14_std,demand_30_mean,demand_30_std
0,33,1422921600000,16.1,20.0,22.7,0.0,1,3,34,6,...,27.003832,0.0,0.0,122880.13,114667.386429,8920.937447,119938.167857,16864.969397,120125.450167,15880.221937
1,34,1423008000000,14.8,19.0,21.2,0.0,2,4,35,6,...,27.809134,0.0,0.0,117398.03,114758.001429,8948.957583,117702.133929,14693.328802,120100.977833,15883.964829
2,43,1423785600000,16.1,32.4,14.9,0.0,4,13,44,7,...,26.322857,2992.08,-318.66,133078.54,133396.924286,12386.475233,125776.359286,14840.586153,122416.02,15771.398868
3,47,1424131200000,17.6,20.6,8.3,0.0,1,17,48,8,...,30.250149,0.0,0.0,127666.01,131606.946429,9410.194936,129983.664643,10731.392554,124825.231,14362.078276
4,50,1424390400000,17.4,25.3,24.7,0.0,4,20,51,8,...,33.538672,0.0,0.0,139465.675,129902.357857,7431.589408,131649.641071,9979.447032,125168.042667,14088.383256


---

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [13]:
feature_view = fs.create_feature_view(
    name = 'electricity_data',
    version = 1,
    labels = ['demand'],
    query = fg_query
)

Feature view created successfully, explore it at 
https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1145/fs/1093/fv/electricity_data/version/1


In [14]:
feature_view

<hsfs.feature_view.FeatureView at 0x7f7ca829c2b0>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`.

In [15]:
feature_view = fs.get_feature_view(
    name = 'electricity_data',
    version = 1
)

In [16]:
feature_view.version

1

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

#### <span style="color:#ff5f27;"> ⛳️ Simple Training Dataset</span>

In [17]:
feature_view.create_training_data(
    description = 'training_dataset',
    data_format = 'csv'
)

Training dataset job started successfully, you can follow the progress at 
https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1145/jobs/named/electricity_data_1_1_create_fv_td_21062022115433/executions




(1, <hsfs.core.job.Job at 0x7f7c28a36760>)

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

#### <span style="color:#ff5f27;"> ⛳️ Dataset with train and test splits</span>

In [18]:
feature_view.create_train_test_split(
    test_size = 0.2
)

Training dataset job started successfully, you can follow the progress at 
https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1145/jobs/named/electricity_data_1_2_create_fv_td_21062022115811/executions




(2, <hsfs.core.job.Job at 0x7f7c28736910>)

#### <span style="color:#ff5f27;"> ⛳️ Dataset with train, validation and test splits</span>

In [19]:
feature_view.create_train_validation_test_splits(
    val_size = 0.2,
    test_size = 0.1
)

Training dataset job started successfully, you can follow the progress at 
https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1145/jobs/named/electricity_data_1_3_create_fv_td_21062022120403/executions




(3, <hsfs.core.job.Job at 0x7f7c28a35e20>)

---

## <span style="color:#ff5f27;"> 🪝 Retrieving Datasets </span>

#### <span style="color:#ff5f27;"> ⛳️ Simple Training Dataset</span>

In [20]:
X_train, y_train = feature_view.get_training_data(
    training_dataset_version = 1
)



In [21]:
X_train.head()

Unnamed: 0,index,date,min_temperature,max_temperature,solar_exposure,rainfall,day_of_week,day_of_month,day_of_year,week_of_year,...,rrp_positive,demand_neg_rrp,rrp_negative,demand_pos_rrp,demand_7_mean,demand_7_std,demand_14_mean,demand_14_std,demand_30_mean,demand_30_std
0,154,1433376000000,4.4,11.3,5.5,0.0,3,4,155,23,...,33.050285,0.0,0.0,147155.265,134826.100714,15277.541585,132421.544286,12121.312927,129020.392667,11456.502219
1,573,1469577600000,7.5,12.9,6.9,0.4,2,27,209,30,...,36.397463,0.0,0.0,143209.755,133247.215,10620.20346,132380.252857,10598.615006,132709.017833,10018.226341
2,635,1474934400000,7.4,14.8,13.4,12.6,1,27,271,39,...,49.794504,0.0,0.0,126888.77,121418.018571,10496.400154,122401.794643,10304.147294,123663.239167,9448.684467
3,654,1476576000000,17.4,23.1,8.0,0.0,6,16,290,41,...,22.47496,1740.595,-3.6,100052.44,117109.188571,12509.209925,115471.599286,11851.022854,116607.4595,10873.536035
4,872,1495411200000,10.4,20.3,10.1,0.0,0,22,142,21,...,72.931488,0.0,0.0,113551.04,118247.162143,8599.465978,119668.648214,8471.561767,116976.748167,9751.151336


In [22]:
y_train.head()

Unnamed: 0,demand
0,147155.265
1,143209.755
2,126888.77
3,101793.035
4,113551.04


In [23]:
X_train.shape

(2073, 27)

#### <span style="color:#ff5f27;"> ⛳️ Dataset with train and test splits</span>

In [24]:
X_train, y_train, X_test, y_test = feature_view.get_train_test_split(
    training_dataset_version = 2
)

In [25]:
X_train.head()

Unnamed: 0,index,date,min_temperature,max_temperature,solar_exposure,rainfall,day_of_week,day_of_month,day_of_year,week_of_year,...,rrp_positive,demand_neg_rrp,rrp_negative,demand_pos_rrp,demand_7_mean,demand_7_std,demand_14_mean,demand_14_std,demand_30_mean,demand_30_std
0,573,1469577600000,7.5,12.9,6.9,0.4,2,27,209,30,...,36.397463,0.0,0.0,143209.755,133247.215,10620.20346,132380.252857,10598.615006,132709.017833,10018.226341
1,635,1474934400000,7.4,14.8,13.4,12.6,1,27,271,39,...,49.794504,0.0,0.0,126888.77,121418.018571,10496.400154,122401.794643,10304.147294,123663.239167,9448.684467
2,872,1495411200000,10.4,20.3,10.1,0.0,0,22,142,21,...,72.931488,0.0,0.0,113551.04,118247.162143,8599.465978,119668.648214,8471.561767,116976.748167,9751.151336
3,902,1498003200000,7.8,13.5,5.0,0.6,2,21,172,25,...,106.966725,0.0,0.0,139814.3,130657.331429,8642.06707,129165.076786,9253.865837,127785.879,9774.074658
4,971,1503964800000,5.4,12.9,9.1,0.0,1,29,241,35,...,143.84804,0.0,0.0,139814.04,130599.129286,10117.240137,129806.963571,8575.738851,129232.795833,10549.86387


In [26]:
X_train.shape

(1662, 27)

In [27]:
X_test.shape

(411, 27)

#### <span style="color:#ff5f27;"> ⛳️ Dataset with train, validation and test splits</span>

In [28]:
X_train, y_train, X_val, y_val, X_test, y_test = feature_view.get_train_validation_test_splits(
    training_dataset_version = 3
)

In [29]:
X_train.head()

Unnamed: 0,index,date,min_temperature,max_temperature,solar_exposure,rainfall,day_of_week,day_of_month,day_of_year,week_of_year,...,rrp_positive,demand_neg_rrp,rrp_negative,demand_pos_rrp,demand_7_mean,demand_7_std,demand_14_mean,demand_14_std,demand_30_mean,demand_30_std
0,154,1433376000000,4.4,11.3,5.5,0.0,3,4,155,23,...,33.050285,0.0,0.0,147155.265,134826.100714,15277.541585,132421.544286,12121.312927,129020.392667,11456.502219
1,573,1469577600000,7.5,12.9,6.9,0.4,2,27,209,30,...,36.397463,0.0,0.0,143209.755,133247.215,10620.20346,132380.252857,10598.615006,132709.017833,10018.226341
2,635,1474934400000,7.4,14.8,13.4,12.6,1,27,271,39,...,49.794504,0.0,0.0,126888.77,121418.018571,10496.400154,122401.794643,10304.147294,123663.239167,9448.684467
3,902,1498003200000,7.8,13.5,5.0,0.6,2,21,172,25,...,106.966725,0.0,0.0,139814.3,130657.331429,8642.06707,129165.076786,9253.865837,127785.879,9774.074658
4,971,1503964800000,5.4,12.9,9.1,0.0,1,29,241,35,...,143.84804,0.0,0.0,139814.04,130599.129286,10117.240137,129806.963571,8575.738851,129232.795833,10549.86387


In [30]:
X_train.shape[0] + X_test.shape[0]

1663

In [31]:
X_train.shape[0] + X_val.shape[0] + X_test.shape[0]

2073

In [32]:
X_train.shape

(1460, 27)

In [33]:
X_val.shape

(410, 27)

In [34]:
X_test.shape

(203, 27)

---

## <span style="color:#ff5f27;"> 🔮 Creating Training Datasets with Event Time filter</span>

First of all lets import **datetime** from datetime library and set up a time format.

Then we can define start_time point and end_time point.

Finally we can create training dataset with data in specific time bourders. 


In [35]:
from datetime import datetime

def from_unix_to_datetime(unix):
    return datetime.utcfromtimestamp(unix).strftime('%Y-%m-%d %H:%M:%S')

In [36]:
date_format = '%Y-%m-%d %H:%M:%S'

start_time_train = int(float(datetime.strptime('2017-01-01 00:00:01',date_format).timestamp()) * 1000)
end_time_train = int(float(datetime.strptime('2018-02-01 23:59:59',date_format).timestamp()) * 1000)

start_time_test = int(float(datetime.strptime('2018-02-02 23:59:59',date_format).timestamp()) * 1000)
end_time_test = int(float(datetime.strptime('2019-02-01 23:59:59',date_format).timestamp()) * 1000)

#### <span style="color:#ff5f27;"> ⛳️ Simple Training Dataset with event time</span>

In [38]:
feature_view.create_training_data(
    description = 'data_2017_2018',
    data_format = 'csv',
    start_time = start_time_train,
    end_time = end_time_train
)

Training dataset job started successfully, you can follow the progress at 
https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1145/jobs/named/electricity_data_1_4_create_fv_td_21062022121533/executions




(4, <hsfs.core.job.Job at 0x7f7c28a53ca0>)

In [39]:
X_train_lim, y_train_lim = feature_view.get_training_data(
    training_dataset_version = 4
)

In [40]:
(X_train_lim.date.agg({'min','max'}) / 1000).apply(lambda x: from_unix_to_datetime(x))

min    2017-01-02 00:00:00
max    2018-02-01 00:00:00
Name: date, dtype: object

#### <span style="color:#ff5f27;"> ⛳️ Training Dataset with train and test splits with event time</span>

In [None]:
# feature_view.create_train_test_split(
#     test_size = 0.2,
#     train_start = start_time_train,
#     train_end = end_time_train,
#     test_start = start_time_test,
#     test_end = end_time_test
# )

---

### <span style="color:#ff5f27;"> Next Steps</span>

In the next notebook, we will train a model on the dataset we created in this notebook.