---
title: "4. Create training dataset from Online Feature Store enabled feature groups"
date: 2021-04-25
type: technical_note
draft: false
---

# Create a training dataset

![overview-4.png](./images/overview-4.png)

### Establish a connection with your Hopsworks feature store.

In [1]:
import hsfs
connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
62,application_1623853832952_0044,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

## Create transformation function (optional)

Transformation functions are python functions that receive a feature value as input and returns the result of applying a specific transformation on it. It's possible to defined your own python functions to transform feature values. These functions are created at the feature store level and can be used to generate training datasets by attaching them to specific features composing the dataset.

To be able to attach a transformation function to a training dataset it has to be either part of a library installed in Hopsworks or attached when starting the Jupyter notebook. For more information about transformation functions, please see https://docs.hopsworks.ai/feature-store-api/latest/generated/transformation_functions/

The example code snippet below shows how to create a transformation function in the Hopsworks Feature Store.

```python
from hsfs_transformers import transformers
normalize_meta = fs.create_transformation_function(
            transformation_function=transformers.normalize,
            output_type=int,
            version=1)
normalize_meta.save()
```

## Get feature groups

In [5]:
card_transactions_10m_agg = fs.get_feature_group("card_transactions_10m_agg", version = 1)
card_transactions_1h_agg = fs.get_feature_group("card_transactions_1h_agg", version = 1)
card_transactions_12h_agg = fs.get_feature_group("card_transactions_12h_agg", version = 1)

## Create training dataset

In [6]:
query = card_transactions_10m_agg.select(["stdev_amt_per_10m", "avg_amt_per_10m", "num_trans_per_10m"])\
                                 .join(card_transactions_1h_agg.select(["stdev_amt_per_1h", "avg_amt_per_1h", "num_trans_per_1h"]))\
                                 .join(card_transactions_12h_agg.select(["stdev_amt_per_12h", "avg_amt_per_12h", "num_trans_per_12h"]))

In [7]:
td_meta = fs.create_training_dataset(name="card_fraud_model",
                               description="Training dataset to train card fraud model",
                               data_format="tfrecord",
                               statistics_config={"enabled": True, "histograms": True, "correlations": False},
                               version=1,
#                              NOTE: To attach transformation function to training dataset provide transformation functions as dict,
#                                    where key is feature name and value is online transformation function name.
#                              transformation_functions={"stdev_amt_per_10m":normalize_meta,
#                                                        "avg_amt_per_10m":normalize_meta,
#                                                        ...
#                                                       }
                               )

In [8]:
td_meta.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7f98230f4410>

In [9]:
td_meta.read().show()

+---------------+---------------+----------------+--------------+----------------+-----------------+-----------------+-----------------+-----------------+
|avg_amt_per_10m|avg_amt_per_12h|stdev_amt_per_1h|avg_amt_per_1h|num_trans_per_1h|num_trans_per_12h|stdev_amt_per_12h|num_trans_per_10m|stdev_amt_per_10m|
+---------------+---------------+----------------+--------------+----------------+-----------------+-----------------+-----------------+-----------------+
|       1.005665|        1.03796|       1.0266877|     1.0247675|          1.0015|           1.0065|        1.0697409|            1.001|           1.0005|
|        1.03907|       1.139612|          1.0005|       1.02391|           1.001|            1.003|        1.1538386|            1.001|           1.0005|
|        1.04755|       1.318325|          1.0005|       5.18205|           1.001|            1.009|        1.9687029|            1.001|           1.0005|
|       1.043745|      1.0632564|          1.0005|       1.02109|     

## Check descriptive statistics

In [10]:
td_meta = fs.get_training_dataset("card_fraud_model", 1)
statistics = td_meta.get_statistics()

for feat_list in statistics.content.items():
    for stats in feat_list[1]:
        print("Feature: " + str(stats['column']))
        print(stats, end="\n\n")

Feature: num_trans_per_1h
{'dataType': 'Fractional', 'column': 'num_trans_per_1h', 'sum': 100.1280027627945, 'completeness': 1, 'histogram': [{'count': 10, 'value': '1.002', 'ratio': 0.1}, {'count': 1, 'value': '1.0025', 'ratio': 0.01}, {'count': 33, 'value': '1.0015', 'ratio': 0.33}, {'count': 56, 'value': '1.001', 'ratio': 0.56}], 'distinctness': 0.04, 'entropy': 0.9668672345930647, 'approximateNumDistinctValues': 4, 'isDataTypeInferred': 'false', 'uniqueness': 0.01, 'mean': 1.001280027627945, 'maximum': 1.002500057220459, 'stdDev': 0.0003557872363590212, 'minimum': 1.0010000467300415, 'approxPercentiles': []}
Feature: stdev_amt_per_10m
{'dataType': 'Fractional', 'column': 'stdev_amt_per_10m', 'sum': 103.46918630599976, 'completeness': 1, 'histogram': [{'count': 1, 'value': '1.0158194', 'ratio': 0.01}, {'count': 1, 'value': '1.0237107', 'ratio': 0.01}, {'count': 1, 'value': '4.2807717', 'ratio': 0.01}, {'count': 96, 'value': '1.0005', 'ratio': 0.96}, {'count': 1, 'value': '1.1008879'