---
title: "Create training dataset from online feature store enabled feature groups"
date: 2021-04-25
type: technical_note
draft: false
---

### Establish a connection with your Hopsworks feature store.

In [1]:
import hsfs
connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
21,application_1623086031838_0029,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

## Register transformation function

In [2]:
from hsfs_transformers import scalers
min_max_normalizer = fs.create_transformation_function(transformation_function=scalers.min_max, 
                                                        output_type=float, 
                                                        version=1)
min_max_normalizer.save()

In [3]:
min_max = fs.get_transformation_function(name="min_max")

## Get feature groups

In [4]:
card_transactions_10m_agg = fs.get_feature_group("card_transactions_10m_agg", version = 1)
card_transactions_1h_agg = fs.get_feature_group("card_transactions_1h_agg", version = 1)
card_transactions_12h_agg = fs.get_feature_group("card_transactions_12h_agg", version = 1)

## Create training dataset

In [5]:
query = card_transactions_10m_agg.select(["stdev_amt_per_10m", "avg_amt_per_10m", "num_trans_per_10m"])\
                                 .join(card_transactions_1h_agg.select(["stdev_amt_per_1h", "avg_amt_per_1h", "num_trans_per_1h"]))\
                                 .join(card_transactions_12h_agg.select(["stdev_amt_per_12h", "avg_amt_per_12h", "num_trans_per_12h"]))

In [6]:
td_meta = fs.create_training_dataset(name="card_fraud_model",
                               description="Training dataset to train card fraud model",
                               data_format="tfrecord",      
                               transformation_functions={"stdev_amt_per_10m":min_max,
                                                         "avg_amt_per_10m":min_max,
                                                         "num_trans_per_10m":min_max,
                                                         "stdev_amt_per_1h":min_max, 
                                                         "avg_amt_per_1h":min_max, 
                                                         "num_trans_per_1h":min_max,
                                                         "stdev_amt_per_12h":min_max, 
                                                         "avg_amt_per_12h":min_max, 
                                                         "num_trans_per_12h":min_max},
                               statistics_config={"enabled": True, "histograms": True, "correlations": False},
                               version=1)

In [7]:
td_meta.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7fdd28a66090>

In [8]:
td_meta.read().show()

+---------------+---------------+----------------+--------------+----------------+-----------------+-----------------+-----------------+-----------------+
|avg_amt_per_10m|avg_amt_per_12h|stdev_amt_per_1h|avg_amt_per_1h|num_trans_per_1h|num_trans_per_12h|stdev_amt_per_12h|num_trans_per_10m|stdev_amt_per_10m|
+---------------+---------------+----------------+--------------+----------------+-----------------+-----------------+-----------------+-----------------+
|       1.005665|        1.03796|       1.0266877|     1.0247675|          1.0015|           1.0065|        1.0697409|            1.001|           1.0005|
|        1.03907|       1.139612|          1.0005|       1.02391|           1.001|            1.003|        1.1538386|            1.001|           1.0005|
|        1.04755|       1.318325|          1.0005|       5.18205|           1.001|            1.009|        1.9687029|            1.001|           1.0005|
|       1.043745|      1.0632564|          1.0005|       1.02109|     