## Create Retrieval Dataset

In this notebook, we'll create a dataset for our retrieval model.

In [6]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

First, we'll load the feature groups we created in the previous tutorial.

In [7]:
trans_fg = fs.get_feature_group("transactions",version=2)
customers_fg = fs.get_feature_group("customers",version=1)
articles_fg = fs.get_feature_group("articles",version=1)

We'll need to join these three data sources to make the data compatible with out retrieval model. Recall that each row in the `transactions` feature group relates information about which customer bought which item. We'll join this feature group with the `customers` and `articles` feature groups to inject customer and item features into each row.

In [8]:
query = trans_fg.select(["customer_id", "article_id", "month_sin", "month_cos"])\
    .join(customers_fg.select(["age"]), on="customer_id")\
    .join(articles_fg.select(["garment_group_name", "index_group_name"]), on="article_id")

### Feature View Creation
In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models.

In [4]:
feature_view = fs.create_feature_view(
    name='retrieval',
    query=query
)

Feature view created successfully, explore it at https://35.205.254.211/p/120/fs/68/fv/retrieval/version/1


To view and explore data in the feature view we can retrieve batch data using the `get_batch_data()` method.

In [5]:
feature_view.get_batch_data().head(5)

2022-06-07 12:13:44,432 INFO: USE `rec_featurestore`
2022-06-07 12:13:45,165 INFO: SELECT `fg2`.`customer_id` `customer_id`, `fg2`.`article_id` `article_id`, `fg2`.`month_sin` `month_sin`, `fg2`.`month_cos` `month_cos`, `fg0`.`age` `age`, `fg1`.`garment_group_name` `garment_group_name`, `fg1`.`index_group_name` `index_group_name`
FROM `rec_featurestore`.`transactions_1` `fg2`
INNER JOIN `rec_featurestore`.`customers_1` `fg0` ON `fg2`.`customer_id` = `fg0`.`customer_id`
INNER JOIN `rec_featurestore`.`articles_1` `fg1` ON `fg2`.`article_id` = `fg1`.`article_id`


Unnamed: 0,customer_id,article_id,month_sin,month_cos,age,garment_group_name,index_group_name
0,000226b9ea81019249060b376b516f821a80e9b24f89a7...,821674001,0.0,1.0,25.0,Special Offers,Ladieswear
1,000226b9ea81019249060b376b516f821a80e9b24f89a7...,744787003,1.224647e-16,-1.0,25.0,Swimwear,Ladieswear
2,000226b9ea81019249060b376b516f821a80e9b24f89a7...,774546002,1.224647e-16,-1.0,25.0,Swimwear,Ladieswear
3,000226b9ea81019249060b376b516f821a80e9b24f89a7...,590928001,1.224647e-16,-1.0,25.0,Swimwear,Ladieswear
4,000226b9ea81019249060b376b516f821a80e9b24f89a7...,699240005,1.224647e-16,-1.0,25.0,Swimwear,Ladieswear


### Training Dataset Creation

Finally, we can create our dataset.

In [5]:
# TODO we will use a chronological split instead.

td = feature_view.create_training_dataset(
    description = 'retrieval_dataset_split',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train"
)

Training dataset job started successfully, you can follow the progress at https://35.205.254.211/p/120/jobs/named/retrieval_1_1_create_fv_td_07062022123518/executions




### Next Steps

In the next notebook, we'll train a model on the dataset we created in this notebook.