---

## 🧑🏻‍🏫 HSFS feature exploration

In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store. 

A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets.
Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.

![Join](../images/join.svg "Join")

As for the [feature_engineering](./feature_engineering.ipynb) notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle.

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;"> 🕵🏻‍♂️ Explore feature groups </span>

You can interact with the feature groups as if they were Spark dataframe. A feature group object has a `show()` method, to show `n` number of lines, and a `read()` method to read the content of the feature group in a Spark dataframe.

The first step to do any operation on a feature group is to get its handle from the feature store. The `get_feature_group` method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1

In [2]:
sales_fg = fs.get_feature_group(
    name = "sales_fg",
    version = 1
)

In [3]:
sales_fg.select(['date','weekly_sales','is_holiday', 'sales_last_30_days_store_dep', 'sales_last_365_days_store_dep']).show(5)

2022-06-02 16:44:15,348 INFO: USE `basics_featurestore`
2022-06-02 16:44:16,021 INFO: SELECT `fg0`.`date` `date`, `fg0`.`weekly_sales` `weekly_sales`, `fg0`.`is_holiday` `is_holiday`, `fg0`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg0`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`
FROM `basics_featurestore`.`sales_fg_1` `fg0`


Unnamed: 0,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_365_days_store_dep
0,1319760000000,1580.72,0,38203.6,1509182.55
1,1327622400000,1085.29,0,40684.41,2364653.36
2,1348790400000,31636.34,0,1007916.14,10854506.76
3,1328832000000,57965.26,1,1605272.88,17214949.35
4,1345161600000,4658.94,0,137756.87,1402688.64


In [4]:
sales_df = sales_fg.read()
sales_df[sales_df.store == 20].head()

2022-06-02 16:44:21,450 INFO: USE `basics_featurestore`
2022-06-02 16:44:22,126 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`date` `date`, `fg0`.`weekly_sales` `weekly_sales`, `fg0`.`is_holiday` `is_holiday`, `fg0`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg0`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg0`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg0`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg0`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg0`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg0`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg0`.`sales_last_365_days_store` `sales_last_365_days_store`
FROM `basics_featurestore`.`sales_fg_1` `fg0`


Unnamed: 0,store,dept,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,sales_last_180_days_store,sales_last_365_days_store_dep,sales_last_365_days_store
20,20,10,1272585600000,38689.67,0,1265416.71,1265416.71,3698669.44,3698669.44,7947091.44,7947091.44,20285046.5,20285046.5
69,20,26,1318550400000,12179.93,0,341381.54,341381.54,1131613.08,1131613.08,2692422.49,2692422.49,4718249.57,4718249.57
92,20,26,1288310400000,12381.82,0,358279.66,358279.66,1369780.14,1369780.14,2966871.83,2966871.83,5673978.67,5673978.67
217,20,94,1337299200000,76452.37,0,1802970.94,1802970.94,5503827.57,5503827.57,10818023.38,10818023.38,32292157.14,32292157.14
235,20,98,1279238400000,18676.09,0,608671.58,608671.58,2188817.17,2188817.17,4151765.19,4151765.19,28246544.54,28246544.54


In [5]:
print(f'⛳️ Type: {type(sales_df)}')

⛳️ Type: <class 'pandas.core.frame.DataFrame'>


You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:

In [6]:
print("Name: {}".format(sales_fg.name))
print("Description: {}".format(sales_fg.description))
print("Features:")
features = sales_fg.features
for feature in features:
    print("{:<60} \t Primary: {} \t Partition: {}".format(feature.name, feature.primary, feature.partition))

Name: sales_fg
Description: Sales related features
Features:
store                                                        	 Primary: True 	 Partition: False
dept                                                         	 Primary: True 	 Partition: False
date                                                         	 Primary: True 	 Partition: False
weekly_sales                                                 	 Primary: False 	 Partition: False
is_holiday                                                   	 Primary: False 	 Partition: False
sales_last_30_days_store_dep                                 	 Primary: False 	 Partition: False
sales_last_30_days_store                                     	 Primary: False 	 Partition: False
sales_last_90_days_store_dep                                 	 Primary: False 	 Partition: False
sales_last_90_days_store                                     	 Primary: False 	 Partition: False
sales_last_180_days_store_dep                                	 Primar

If you are interested only in a subset of features, you can use the `select()` method on the feature group object to select a list of features. The `select()` behaves like a feature group, as such, you can call the `.show()` or `.read()` methods on it.

In [7]:
sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)

2022-06-02 16:44:31,468 INFO: USE `basics_featurestore`
2022-06-02 16:44:32,139 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`
FROM `basics_featurestore`.`sales_fg_1` `fg0`


Unnamed: 0,store,dept,weekly_sales
0,9,19,1580.72
1,45,19,1085.29
2,41,4,31636.34
3,19,40,57965.26
4,15,20,4658.94


If your feature group is available both online and offline, you can use the `online` option of the `show()` and `read()` methods to specify if you want to read your feature group from online storage.

In [8]:
sales_fg_3 = fs.get_feature_group(
    name = 'sales_fg',
    version = 3
)

sales_fg_3.select(['store', 'dept', 'weekly_sales']).show(5, online=True)

Unnamed: 0,store,dept,weekly_sales
0,1,1,16328.72
1,1,1,18820.29
2,1,1,15295.55
3,1,1,14537.37
4,1,2,44623.23


---

## <span style="color:#ff5f27;"> 👮🏼‍♀️ Filter Feature Groups </span>

If you do not want to read your feature group into a dataframe, you can also filter directly on a `FeatureGroup` object. Applying a filter to a feature group returns a `Query` object which can subsequently be used to be further joined with other feature groups or queries, or it can be materialized as a training dataset.

In [9]:
sales_fg.select(['store', 'dept', 'weekly_sales']).filter(sales_fg.weekly_sales >= 50000).show(5)

2022-06-02 16:44:40,787 INFO: USE `basics_featurestore`
2022-06-02 16:44:41,462 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`
FROM `basics_featurestore`.`sales_fg_1` `fg0`
WHERE `fg0`.`weekly_sales` >= 50000


Unnamed: 0,store,dept,weekly_sales
0,19,40,57965.26
1,40,90,60056.63
2,6,90,50338.28
3,4,40,82799.69
4,2,92,156687.73


Conjunctions of filters can be constructed using the Python Bitwise Operators `|` and `&` which replace the regular binary operators when working with feature groups and filters.

In [10]:
sales_fg.select(['store', 'dept', 'weekly_sales']).filter((sales_fg.weekly_sales >= 50000) & (sales_fg.dept == 2)).show(5)

2022-06-02 16:44:43,999 INFO: USE `basics_featurestore`
2022-06-02 16:44:44,672 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`
FROM `basics_featurestore`.`sales_fg_1` `fg0`
WHERE `fg0`.`weekly_sales` >= 50000 AND `fg0`.`dept` = 2


Unnamed: 0,store,dept,weekly_sales
0,14,2,82983.0
1,11,2,67217.87
2,12,2,72639.45
3,6,2,53809.48
4,1,2,59889.32


---

## <span style="color:#ff5f27;"> 👷🏼‍♂️ Join Features and Feature Groups </span>

HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups.
The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.

You can use the `select_all()` method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition. 
If you don't specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.

In the example below, `sales_fg` has `store`, `dept` and `date` as composite primary key while `exogenous_fg` has only `store` and `date`. So Hopsworks will set as joining condition `store` and `date`.

In [11]:
sales_fg = fs.get_feature_group(
    name = 'sales_fg',
    version = 1
)

exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = sales_fg.select_all().join(exogenous_fg.select_all())

You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the `to_string()` method on it.

In [12]:
print(query.to_string())

SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `fg0`.`unemployment` `unemployment`, `fg0`.`is_holiday` `is_holiday`, CASE WHEN `fg0`.`a

As for the feature groups, you can call the `show()` method to inspect the data before generating a training dataset from it. Or you can call the `read()` method to get a Spark DataFrame with the result of the query and apply additional transformations to it.

In [13]:
query.show(5)

2022-06-02 16:44:47,454 INFO: USE `basics_featurestore`
2022-06-02 16:44:48,125 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `f

Unnamed: 0,store,dept,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,...,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,is_holiday.1,appended_feature
0,6,49,1320364800000,8857.16,0,384336.97,384336.97,1149035.71,1149035.71,1366310.6,...,3.332,,,,,,219.400081,6.551,0,10.0
1,26,20,1306454400000,3070.85,0,91724.78,91724.78,264616.07,264616.07,381574.26,...,4.034,,,,,,134.767774,7.818,0,10.0
2,44,42,1284681600000,80.14,0,1788.9,1788.9,917793.1,917793.1,2369840.59,...,2.875,,,,,,126.145467,7.804,0,10.0
3,9,19,1305244800000,1685.34,0,50128.75,50128.75,145747.07,145747.07,568325.39,...,3.899,,,,,,219.604183,6.38,0,10.0
4,39,14,1291939200000,32003.82,0,592425.01,592425.01,3523397.38,3523397.38,8334755.73,...,2.843,,,,,,210.237249,8.476,0,10.0


As for the `show()` and `read()` method of the feature group, even in the case of a query you can specify against which storage to run the query.

---

## <span style="color:#ff5f27;"> 💼 Select only a subset of features </span>

You can replace the `select_all()` method with the `select([])` method to be able to select only a subset of features from a feature group you want to join:

In [14]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']))
query.show(5)

2022-06-02 16:45:30,894 INFO: USE `basics_featurestore`
2022-06-02 16:45:31,563 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`
FROM `basics_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`


Unnamed: 0,store,dept,weekly_sales,fuel_price
0,6,49,8857.16,3.332
1,26,20,3070.85,4.034
2,44,42,80.14,2.875
3,9,19,1685.34,3.899
4,39,14,32003.82,2.843


---

## <span style="color:#ff5f27;"> 💈 Overwrite the joining key </span>

If your feature groups don't have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.

As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:

In [15]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), on=['date'])
query.show(5)

2022-06-02 16:45:58,738 INFO: USE `basics_featurestore`
2022-06-02 16:45:59,402 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`
FROM `basics_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`date` = `fg0`.`date`


Unnamed: 0,store,dept,weekly_sales,fuel_price
0,8,60,237.6,3.095
1,8,60,237.6,3.422
2,8,60,237.6,3.439
3,8,60,237.6,3.439
4,8,60,237.6,3.585


### Overwriting the join type

By default, the join type between two feature groups is `INNER JOIN`. You can overwrite this behavior by passing the `join_type` parameter to the join method. Valid types are: `INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA`

In [16]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), join_type="left")

print(query.to_string())

SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`
FROM `basics_featurestore`.`sales_fg_1` `fg1`
LEFT JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`


---

## <span style="color:#ff5f27;"> 🧲 Join mulitple feature groups </span>

You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:

    (`sales_fg` <> `store_fg`) <> `exogenous_fg`

The join paramers you pass in each `join()` method call apply to that specific join. This means that you can concatenate left and right joins.
Please be aware that currently HSFS **does not support** nested join such as: 

    `sales_fg` <> (`store_fg` <> `exogenous_fg`)

In [17]:
store_fg = fs.get_feature_group(
    name = "store_fg",
    version = 1
)

query = sales_fg.select_all()\
                .join(store_fg.select_all())\
                .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

print(query.to_string())

SELECT `fg2`.`store` `store`, `fg2`.`dept` `dept`, `fg2`.`date` `date`, `fg2`.`weekly_sales` `weekly_sales`, `fg2`.`is_holiday` `is_holiday`, `fg2`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg2`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg2`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg2`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg2`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg2`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg2`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg2`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`type` `type`, `fg0`.`size` `size`, `fg0`.`dept` `dept`, `fg1`.`fuel_price` `fuel_price`, `fg1`.`unemployment` `unemployment`, `fg1`.`cpi` `cpi`
FROM `basics_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `basics_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `basics_featurestore`.`exogenous_fg_1

### Use Joins together with Filters

It is possible to use filters in any of the subqueries of a joined query.

In [18]:
query = sales_fg.select_all()\
                .join(store_fg.select_all())\
                .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']).filter(exogenous_fg.fuel_price <= 2.7)) \
                .filter(sales_fg.weekly_sales >= 50000)

print(query.to_string())

SELECT `fg2`.`store` `store`, `fg2`.`dept` `dept`, `fg2`.`date` `date`, `fg2`.`weekly_sales` `weekly_sales`, `fg2`.`is_holiday` `is_holiday`, `fg2`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg2`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg2`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg2`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg2`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg2`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg2`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg2`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`type` `type`, `fg0`.`size` `size`, `fg0`.`dept` `dept`, `fg1`.`fuel_price` `fuel_price`, `fg1`.`unemployment` `unemployment`, `fg1`.`cpi` `cpi`
FROM `basics_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `basics_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `basics_featurestore`.`exogenous_fg_1

---

## <span style="color:#ff5f27;"> 🔮 Free hand query </span>

With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. `fs.sql` returns a Spark Dataframe.

In [19]:
fs.sql("SELECT * FROM `store_fg_1`").head()

2022-06-02 16:49:16,155 INFO: USE `basics_featurestore`
2022-06-02 16:49:16,836 INFO: SELECT * FROM `store_fg_1`


Unnamed: 0,store_fg_1.store,store_fg_1.type,store_fg_1.size,store_fg_1.dept,store_fg_1._hoodie_record_key,store_fg_1._hoodie_partition_path,store_fg_1._hoodie_commit_time,store_fg_1._hoodie_file_name,store_fg_1._hoodie_commit_seqno
0,26,A,152513,76,26,,20220602163354257,c3f2e6cd-d3c4-4e4b-b096-2cf6b04d83d9-0_0-29-66...,20220602163354257_0_1
1,1,A,151315,77,1,,20220602163354257,c3f2e6cd-d3c4-4e4b-b096-2cf6b04d83d9-0_0-29-66...,20220602163354257_0_2
2,21,B,140167,77,21,,20220602163354257,c3f2e6cd-d3c4-4e4b-b096-2cf6b04d83d9-0_0-29-66...,20220602163354257_0_3
3,17,B,93188,76,17,,20220602163354257,c3f2e6cd-d3c4-4e4b-b096-2cf6b04d83d9-0_0-29-66...,20220602163354257_0_4
4,30,C,42988,64,30,,20220602163354257,c3f2e6cd-d3c4-4e4b-b096-2cf6b04d83d9-0_0-29-66...,20220602163354257_0_5


---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook we will use our feature groups to create a dataset we can train a model on.