---
title: "Create training dataset from online feature store enabled feature groups"
date: 2021-04-25
type: technical_note
draft: false
---

### Establish a connection with your Hopsworks feature store.

In [1]:
import hsfs
connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
24,application_1624276539905_0004,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

## Get feature groups

In [2]:
card_transactions_10m_agg = fs.get_feature_group("card_transactions_10m_agg", version = 1)
card_transactions_1h_agg = fs.get_feature_group("card_transactions_1h_agg", version = 1)
card_transactions_12h_agg = fs.get_feature_group("card_transactions_12h_agg", version = 1)

## Create training dataset

In [14]:
query = card_transactions_10m_agg.select(["cc_num","stdev_amt_per_10m", "avg_amt_per_10m", "num_trans_per_10m"])\
                                 .join(card_transactions_1h_agg.select(["stdev_amt_per_1h", "avg_amt_per_1h", "num_trans_per_1h"]))\
                                 .join(card_transactions_12h_agg.select(["stdev_amt_per_12h", "avg_amt_per_12h", "num_trans_per_12h"]))

td_meta = fs.create_training_dataset(name="card_fraud_model3",
                               description="Training dataset to train card fraud model",
                               data_format="tfrecord",                               
                               version=1)

td_meta.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7fa2f91c7b10>

In [15]:
query.show(10)

+----------------+------------------+---------------+-----------------+------------------+------------------+----------------+------------------+------------------+-----------------+
|          cc_num| stdev_amt_per_10m|avg_amt_per_10m|num_trans_per_10m|  stdev_amt_per_1h|    avg_amt_per_1h|num_trans_per_1h| stdev_amt_per_12h|   avg_amt_per_12h|num_trans_per_12h|
+----------------+------------------+---------------+-----------------+------------------+------------------+----------------+------------------+------------------+-----------------+
|4925013053127624|               NaN|          35.63|                1|3235.3105087765534|2817.2200000000003|               2|52.463931769041736| 63.98166666666666|                6|
|4253589902657103|               NaN|          92.74|                1|               NaN|              82.2|               1|125.92984495345019|105.03999999999999|                5|
|4223253728365626|               NaN|        1500.78|                1|              

In [11]:
card_transactions_10m_agg.select(["stdev_amt_per_10m", "avg_amt_per_10m", "num_trans_per_10m"]).show(10)

+-----------------+---------------+-----------------+
|stdev_amt_per_10m|avg_amt_per_10m|num_trans_per_10m|
+-----------------+---------------+-----------------+
|              NaN|          82.28|                1|
|              NaN|          12.77|                1|
|              NaN|          88.43|                1|
|              NaN|          32.22|                1|
|              NaN|          42.09|                1|
|              NaN|         459.94|                1|
|              NaN|          95.11|                1|
|              NaN|        1308.17|                1|
|              NaN|         776.22|                1|
|              NaN|          92.62|                1|
+-----------------+---------------+-----------------+
only showing top 10 rows