### HSFS training datasets

Training datasets is the third building block of the Hopsworks Feature Store. Data scientists can query the feature store (see [feature_exploration](./feature_exploration.ipynb) notebook) and materialize their query in training datasets.

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
20,application_1602855941528_0003,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### Create a training dataset from a query

In the previous notebook ([feature_exploration](./feature_exploration.ipynb)) we walked through how to explore and query the Hopsworks feature store using HSFS. We can use the queries produced in the previous notebook to create a training dataset.

In [2]:
sales_fg = fs.get_feature_group('sales_fg')
exogenous_fg = fs.get_feature_group('exogenous_fg')

query = sales_fg.select_all()\
        .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))



As for the feature groups, we first need to generate a metadata object representing the training dataset. After that, we can call the `save()` method to persist it in the Hopsworks feature store.
Different file formats are available: `csv`, `tfrecord`, `npy`, `hdf5`, `avro`, `parquet`, `orc`.

In [3]:
td = fs.create_training_dataset(name="sales_model",
                               description="Dataset to train the sales model",
                               data_format="csv",
                               version=1)

td.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7f9fe2d0e990>

#### Pass write options

When you save a training dataset, you have the possibility of specifying additional parameters to the Spark writer. For instance, in the example below, we are adding the headers to the CSV file.

In [4]:
td = fs.create_training_dataset(name="sales_model",
                               description="Dataset to train the sales model",
                               data_format="csv",
                               version=2)

td.save(query, {'hearder': 'true'})

<hsfs.training_dataset.TrainingDataset object at 0x7f9fe2897110>

#### Split the training dataset

If you are training a model, you might want to split the training datasets into different slices (training, test and validation). HSFS allows you to specify the split sizes. You can also provide a seed for the random splitter, if you want to reproduce a training dataset.

In [5]:
td = fs.create_training_dataset(name="sales_model",
                               description="Dataset to train the sales model",
                               data_format="csv",
                               splits={'train': 0.7, 'test': 0.2, 'validate': 0.1},
                               version=3)

td.save(query, {'hearder': 'true'})

<hsfs.training_dataset.TrainingDataset object at 0x7f9fe2897310>

#### Save the dataset on an external storage system

If you are training your model on an external machine learning platform (e.g. SageMaker), you might want to save the training dataset on an external storage system (e.g. S3). You can take advantage of the Hopsworks storage connectors (see [documentation](https://hopsworks.readthedocs.io/en/latest/featurestore/guides/featurestore.html#configuring-storage-connectors-for-the-feature-store)).

Assuming you have created an S3 storage connector name `td_bucket_connector`, you can create an external training dataset as follows:

In [None]:
td_bucket_connector = fs.get_storage_connector("td_bucket_connector", "S3")

td = fs.create_training_dataset(name="sales_model",
                               description="Dataset to train the sales model",
                               data_format="csv",
                               storage_connector=td_bucket_connector,
                               version=4)

### This code is expected to fail if you connector is not configured properly
td.save(query)

#### Replay the query that generated the training dataset

If you created a training dataset from a query object, then you can ask the feature store to return the set of features (in order) and the set of joins that generated. 
This feature is useful if you are serving a model in production and you want to augment the inference vector with features taken from the online feature store

In [8]:
td = fs.get_training_dataset(name="sales_model")
print(td.query)

SELECT `fg0`.`sales_last_year_store_dep`, `fg0`.`sales_last_month_store`, `fg0`.`sales_last_six_month_store`, `fg0`.`sales_last_six_month_store_dep`, `fg0`.`sales_last_month_store_dep`, `fg0`.`sales_last_year_store`, `fg0`.`is_holiday`, `fg0`.`dept`, `fg0`.`sales_last_quarter_store`, `fg0`.`date`, `fg0`.`sales_last_quarter_store_dep`, `fg0`.`weekly_sales`, `fg0`.`store`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`
FROM `demo_fs_meb10000`.`sales_fg_1` `fg0`
INNER JOIN `demo_fs_meb10000`.`exogenous_fg_1` `fg1` ON `fg0`.`store` = `fg1`.`store` AND `fg0`.`date` = `fg1`.`date`

### Create a training dataset from a DataFrame

If you need to apply additional transformations before creating a training dataset, you can create one from a Spark DataFrame instead of using a query.
The `create_training_dataset` part stays the same, the difference is that we are going to pass a DataFrame to the `save()` method.

As you have applied additional transformations between the query object and the training dataset, we won't be able to re-play the query for this specific training dataset.

In [9]:
df = query.read()
# Apply additional transformations
df = df.drop("is_holiday")

td = fs.create_training_dataset(name="sales_model",
                               description="Dataset to train the sales model",
                               data_format="csv",
                               version=5)

td.save(df)

<hsfs.training_dataset.TrainingDataset object at 0x7f9fe28b2590>

### Add a tag to a training dataset

As for feature groups, you can add tags to a training dataset. Tags are indexed and you can search for them in the Hopsworks feature store UI. Tags are an useful tool to catalog the feature store. The `value` field can be omitted. 

In [13]:
td = fs.get_training_dataset("sales_model", 5)
td.add_tag("model", value="sales")

From the HSFS API you can also list all the tags associated with a specific training dataset

In [2]:
td = fs.get_training_dataset("sales_model", 5)
td.get_tag()

[{'model': 'sales'}]

### Read a training dataset

As for feature groups, you can call the methods `show()` method to get a preview of the training dataset and `read()` to get a Spark DataFrame of it.

In [3]:
td = fs.get_training_dataset("sales_model", 1)
td.show(5)

+-------------------------+----------------------+--------------------------+------------------------------+--------------------------+---------------------+----------+----+------------------------+-------------------+----------------------------+------------+-----+----------+------------+-----------+
|sales_last_year_store_dep|sales_last_month_store|sales_last_six_month_store|sales_last_six_month_store_dep|sales_last_month_store_dep|sales_last_year_store|is_holiday|dept|sales_last_quarter_store|               date|sales_last_quarter_store_dep|weekly_sales|store|fuel_price|unemployment|        cpi|
+-------------------------+----------------------+--------------------------+------------------------------+--------------------------+---------------------+----------+----+------------------------+-------------------+----------------------------+------------+-----+----------+------------+-----------+
|                      0.0|                   0.0|                       0.0|              

If you have splitted your training dataset, you can also read a single split

In [4]:
td = fs.get_training_dataset("sales_model", 3)
td.read("train").count()

294905

### Feed the training dataset to a model training

If you are training a model, HSFS allows you to get a `tf.data.TFRecordDataset` handle to read the trainign dataset and feed it to a model training loop efficiently.

In [None]:
td = fs.get_training_dataset("sales_model", 3)

train_input_feeder = td.feed(target_name='weekly_sales',split='train', is_training=True)
train_input = train_input_feeder.tf_record_dataset()