---
title: "2. Create Training Data from Features "
date: 2021-02-24
type: technical_note
draft: false
---

### HSFS `Feature Views` and `Training Datasets`

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
31,application_1652868601260_0056,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### Create a `Feature View` from a query

In the previous notebook ([feature_exploration](./feature_exploration.ipynb)) we walked through how to explore and query the Hopsworks feature store using HSFS. We can use the queries produced in the previous notebook to create a `Feature Views`.

In [2]:
sales_fg = fs.get_feature_group(
    name = 'sales_fg',
    version = 1
)

exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = sales_fg.select_all()\
        .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

In [3]:
feature_view = fs.create_feature_view(
    name = 'exodenous_sale',
    version = 1,
    query = query
)

In [4]:
feature_view

<hsfs.feature_view.FeatureView object at 0x7f194fd99b20>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`

In [7]:
feature_view = fs.get_feature_view(
    name = 'exodenous_sale',
    version = 1
)

In [8]:
feature_view.version

1

> `FeatureView.preview_feature_vector()` returns a sample of assembled serving vector from online feature store

In [9]:
feature_view.preview_feature_vector()

[31, 34, datetime.date(2010, 2, 5), 19443.48, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.835, '7.808', '210.3399684']

> To get subset of data use `FeatureView.get_batch_data()` 

In [10]:
df_batch = feature_view.get_batch_data()

In [11]:
type(df_batch)

<class 'pyspark.sql.dataframe.DataFrame'>

In [12]:
df_batch.select(['fuel_price', 'unemployment', 'cpi']).show(5)

+----------+------------+-----------+
|fuel_price|unemployment|        cpi|
+----------+------------+-----------+
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
+----------+------------+-----------+
only showing top 5 rows

---

To create training dataset we use `FeatureView.create_training_dataset()` method.

It will inherit the name of FeatureView.

The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

We can choose necessary format using **data_format** parameter.


Also we can specify split ratio using **splits** parameter.

**train_split** - specify which split will be used for training.

In [13]:
train_df = feature_view.create_training_dataset(
    version = 1,
    description = 'trial_dataset',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

In [14]:
train_df

(1, None)

If we want to load dataset from Hopsworks we can use `FeatureView.get_training_dataset_splits()` method.

By specifying **splits** parameter we can choose what split of training dataset to retrieve.

In [16]:
td_version, df = feature_view.get_training_dataset_splits(
    splits = {},
    start_time = None,
    end_time = None,
    version = 2
)

df.select(['store', 'dept', 'date', 'weekly_sales', 'is_holiday', 'sales_last_month_store_dep']).show(5)

+-----+----+----------+------------+----------+--------------------------+
|store|dept|      date|weekly_sales|is_holiday|sales_last_month_store_dep|
+-----+----+----------+------------+----------+--------------------------+
|   15|   1|2010-02-05|    12239.38|     false|                       0.0|
|   15|  10|2010-02-05|     9444.31|     false|                       0.0|
|   15|  11|2010-02-05|    13259.31|     false|                       0.0|
|   15|  12|2010-02-05|     2360.06|     false|                       0.0|
|   15|  13|2010-02-05|    19437.35|     false|                       0.0|
+-----+----+----------+------------+----------+--------------------------+
only showing top 5 rows


---