---
title: "Create training dataset from online feature store enabled feature groups"
date: 2021-04-25
type: technical_note
draft: false
---

# Create a training dataset

![overview-4.png](./images/overview-4.png)

### Establish a connection with your Hopsworks feature store.

In [1]:
import hsfs
connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
12,application_1624217841399_0004,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

## Get feature groups

In [2]:
card_transactions_10m_agg = fs.get_feature_group("card_transactions_10m_agg", version = 1)
card_transactions_1h_agg = fs.get_feature_group("card_transactions_1h_agg", version = 1)
card_transactions_12h_agg = fs.get_feature_group("card_transactions_12h_agg", version = 1)

## Create training dataset

In [3]:
query = card_transactions_10m_agg.select(["stdev_amt_per_10m", "avg_amt_per_10m", "num_trans_per_10m"])\
                                 .join(card_transactions_1h_agg.select(["stdev_amt_per_1h", "avg_amt_per_1h", "num_trans_per_1h"]))\
                                 .join(card_transactions_12h_agg.select(["stdev_amt_per_12h", "avg_amt_per_12h", "num_trans_per_12h"]))

In [4]:
query.show(20)

+-----------------+---------------+-----------------+------------------+------------------+----------------+------------------+------------------+-----------------+
|stdev_amt_per_10m|avg_amt_per_10m|num_trans_per_10m|  stdev_amt_per_1h|    avg_amt_per_1h|num_trans_per_1h| stdev_amt_per_12h|   avg_amt_per_12h|num_trans_per_12h|
+-----------------+---------------+-----------------+------------------+------------------+----------------+------------------+------------------+-----------------+
|              NaN|          12.86|                1|  414.357502707505|325.96500000000003|               2| 2206.836719540488|          1170.733|               10|
|              NaN|        2889.73|                1|119.21702101629616|             71.92|               3|1135.5953586118019|501.29560000000004|               25|
|              NaN|          23.88|                1|14.191633098414009|            57.465|               2|2472.6079992350305|1454.2791666666665|               12|
|         

In [5]:
td_meta = fs.create_training_dataset(name="card_fraud_model",
                               description="Training dataset to train card fraud model",
                               data_format="tfrecord",      
                               statistics_config={"enabled": True, "histograms": True, "correlations": False},
                               version=1)

In [6]:
td_meta.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7fd8b2329dd0>

In [7]:
td_meta.read().show()

++
||
++
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
++
only showing top 20 rows

# Save to CSV

In [14]:
td_meta_csv = fs.create_training_dataset(name="card_fraud_model_csv",
                               description="Training dataset to train card fraud model CSV",
                               data_format="csv",
                               statistics_config={"enabled": False, "histograms": False, "correlations": False},
                               version=1)

In [15]:
td_meta_csv.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7fd8b22bcf90>

## Check descriptive statistics

In [8]:
td_meta = fs.get_training_dataset("card_fraud_model", 1)
statistics = td_meta.get_statistics()

for feat_list in statistics.content.items():
    for stats in feat_list[1]:
        print("Feature: " + str(stats['column']))
        print(stats)

An error was encountered:
__init__() missing 1 required positional argument: 'feature_group_commit_id'
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/training_dataset.py", line 568, in get_statistics
    return self.statistics
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/training_dataset.py", line 554, in statistics
    return self._statistics_engine.get_last(self)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/statistics_engine.py", line 78, in get_last
    return self._statistics_api.get_last(metadata_instance)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/statistics_api.py", line 90, in get_last
    _client._send_request("GET", path_params, query_params, headers=headers)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/statistics.py", line 46, in from_response_json
    return cls(**json_decamelized["items"][0])
Typ