# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the third part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a Feature View**.
3. **Create a dataset split** for training and validation data.

![tutorial-flow](images/02_training-dataset.png) 

---

## 🧑🏻‍🏫 HSFS `Feature Views` and `Training Datasets`

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">⚙️ Feature View Creation</span>

In the previous notebook ([feature_exploration](./feature_exploration.ipynb)) we walked through how to explore and query the Hopsworks feature store using HSFS. We can use the queries produced in the previous notebook to create a `Feature Views`.

In [2]:
sales_fg = fs.get_feature_group(
    name = 'sales_fg',
    version = 1
)

exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = sales_fg.select_all()\
        .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

In [3]:
feature_view = fs.create_feature_view(
    name = 'exodenous_sale',
    version = 1,
    query = query
)

Feature view created successfully, explore it at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/122/fs/70/fv/exodenous_sale/version/1


In [4]:
feature_view

<hsfs.feature_view.FeatureView at 0x7f3214dbaa00>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`

In [5]:
feature_view = fs.get_feature_view(
    name = 'exodenous_sale',
    version = 1
)

In [6]:
feature_view.version

1

> `FeatureView.preview_feature_vector()` returns a sample of assembled serving vector from online feature store

In [7]:
feature_view.preview_feature_vector()

[1,
 1,
 1296777600000,
 21665.76,
 0,
 675987.35,
 675987.35,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 3.523,
 7.962,
 215.7332258]

> To get subset of data use `FeatureView.get_batch_data()` 

In [8]:
df_batch = feature_view.get_batch_data()

2022-06-02 16:49:58,482 INFO: USE `basics_featurestore`
2022-06-02 16:49:59,171 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`unemployment` `unemployment`, `fg0`.`cpi` `cpi`
FROM `basics_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `f

In [9]:
type(df_batch)

pandas.core.frame.DataFrame

In [10]:
df_batch.head()

Unnamed: 0,store,dept,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,sales_last_180_days_store,sales_last_365_days_store_dep,sales_last_365_days_store,fuel_price,unemployment,cpi
0,6,49,1320364800000,8857.16,0,384336.97,384336.97,1149035.71,1149035.71,1366310.6,1366310.6,4503096.85,4503096.85,3.332,6.551,219.400081
1,26,20,1306454400000,3070.85,0,91724.78,91724.78,264616.07,264616.07,381574.26,381574.26,1100799.04,1100799.04,4.034,7.818,134.767774
2,44,42,1284681600000,80.14,0,1788.9,1788.9,917793.1,917793.1,2369840.59,2369840.59,8715466.49,8715466.49,2.875,7.804,126.145467
3,9,19,1305244800000,1685.34,0,50128.75,50128.75,145747.07,145747.07,568325.39,568325.39,1843007.6,1843007.6,3.899,6.38,219.604183
4,39,14,1291939200000,32003.82,0,592425.01,592425.01,3523397.38,3523397.38,8334755.73,8334755.73,10159948.6,10159948.6,2.843,8.476,210.237249


---

## <span style="color:#ff5f27;">🏋️‍♀️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_dataset()` method.

⚠️ **Some important things**:
- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- Also we can specify split ratio using **splits** parameter.

- **train_split** - specify which split will be used for training.

In [11]:
train_df = feature_view.create_training_dataset(
    version = 1,
    description = 'trial_dataset',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

Training dataset job started successfully, you can follow the progress at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/122/jobs/named/exodenous_sale_1_1_create_fv_td_02062022165038/executions


In [12]:
train_df

(1, <hsfs.core.job.Job at 0x7f32010f2940>)

If we want to load dataset from Hopsworks we can use `FeatureView.get_training_dataset_splits()` method.

By specifying **splits** parameter we can choose what split of training dataset to retrieve.

In [13]:
td_version, df = feature_view.get_training_dataset_splits(
    splits = {},
    start_time = None,
    end_time = None,
    version = 2
)

df.head()

2022-06-02 16:50:40,935 INFO: USE `basics_featurestore`
2022-06-02 16:50:41,605 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`unemployment` `unemployment`, `fg0`.`cpi` `cpi`
FROM `basics_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `f

Unnamed: 0,store,dept,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,sales_last_180_days_store,sales_last_365_days_store_dep,sales_last_365_days_store,fuel_price,unemployment,cpi
0,6,49,1320364800000,8857.16,0,384336.97,384336.97,1149035.71,1149035.71,1366310.6,1366310.6,4503096.85,4503096.85,3.332,6.551,219.400081
1,26,20,1306454400000,3070.85,0,91724.78,91724.78,264616.07,264616.07,381574.26,381574.26,1100799.04,1100799.04,4.034,7.818,134.767774
2,44,42,1284681600000,80.14,0,1788.9,1788.9,917793.1,917793.1,2369840.59,2369840.59,8715466.49,8715466.49,2.875,7.804,126.145467
3,9,19,1305244800000,1685.34,0,50128.75,50128.75,145747.07,145747.07,568325.39,568325.39,1843007.6,1843007.6,3.899,6.38,219.604183
4,39,14,1291939200000,32003.82,0,592425.01,592425.01,3523397.38,3523397.38,8334755.73,8334755.73,10159948.6,10159948.6,2.843,8.476,210.237249


---

## <span style="color:#ff5f27;">🔮 Creating Training Datasets with Event Time filter </span>

First of all lets import **datetime** from datetime library and set up a time format.

Then we can define start_time point and end_time point.

Finally we can create training dataset with data in specific time bourders. 

In [14]:
from datetime import datetime

def timestamp_2_time(x):
    dt_obj = datetime.strptime(x, '%Y-%m-%d')
    dt_obj = dt_obj.timestamp() * 1000
    return int(dt_obj)

In [15]:
start_time = timestamp_2_time('2008-01-01')
end_time = timestamp_2_time('2012-01-01')

In [16]:
exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = exogenous_fg.select_all()

In [17]:
exogenous_fv = fs.create_feature_view(
    name = 'exogenous_fg_2008_2012',
    version = 1,
    query = query
)

Feature view created successfully, explore it at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/122/fs/70/fv/exogenous_fg_2008_2012/version/1


In [18]:
td_version, td_job = exogenous_fv.create_training_dataset(
    description = 'exogenous_fg_filtered',
    version = 1,
    data_format = 'csv',
    write_options = {'wait_for_job': True},
    start_time = start_time,
    end_time = end_time,
    statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}
)

Training dataset job started successfully, you can follow the progress at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/122/jobs/named/exogenous_fg_2008_2012_1_1_create_fv_td_02062022165122/executions


---
## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training dataset from Feature Store we can use `get_training_dataset_splits()` or `get_training_dataset()` methods. 

If version is not provided - new one will be created.
If version is provided and version exists - retrieves trainining dataset and returns as dataframe.

In [19]:
td_version, df = exogenous_fv.get_training_dataset(
    start_time = start_time,
    end_time = end_time
)

df.head()

2022-06-02 16:52:06,790 INFO: USE `basics_featurestore`
2022-06-02 16:52:07,475 INFO: SELECT `fg0`.`store` `store`, `fg0`.`date` `date`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `fg0`.`unemployment` `unemployment`, `fg0`.`is_holiday` `is_holiday`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`
FROM `basics_featurestore`.`exogenous_fg_1` `fg0`
WHERE `fg0`.`date` >= 1199145600000 AND `fg0`.`date` <= 1325376000000




Unnamed: 0,store,date,temperature,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,is_holiday,appended_feature
0,21,1296777600000,36.33,2.989,,,,,,212.224065,8.028,0,10.0
1,4,1312502400000,86.09,3.662,,,,,,129.184645,5.644,0,10.0
2,30,1310688000000,91.05,3.575,,,,,,215.013443,7.852,0,10.0
3,1,1306454400000,77.72,3.786,,,,,,215.503788,7.682,0,10.0
4,37,1287100800000,71.57,2.72,,,,,,210.580594,8.476,0,10.0


---