## <span style="color:#ff5f27">👨🏻‍🏫 Build Index </span>

In this notebook you will create a feature group for your candidate embeddings.

## <span style="color:#ff5f27">📝 Imports </span>

In [1]:
import tensorflow as tf
import pprint
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/11383
Connected. Call `.close()` to terminate connection gracefully.
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27">🎯 Compute Candidate Embeddings </span>

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [3]:
model = mr.get_model(
    name="candidate_model",
    version=1,
)
model_path = model.download()

Downloading model artifact (2 dirs, 7 files)... DONE

In [4]:
candidate_model = tf.saved_model.load(model_path)

Next you compute the embeddings of all candidate videos that were used to train the retrieval model.

In [5]:
feature_view = fs.get_feature_view(
    name="retrieval",
    version=1,
)

In [6]:
train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(
    validation_size=0.1, 
    test_size=0.1,
    description='Retrieval dataset splits',
)
train_df.head(3)

Finished: Reading data from Hopsworks, using ArrowFlight (39.57s) 




Unnamed: 0,interaction_id,user_id,gender,age,country,video_id,category,views,likes,video_length
0,2877-12-0444,RG876M,Female,76,South Africa,6ZY07Y,News,192408,91932,113
1,4380-37-7971,YU710V,Female,59,Gambia,5ZY75O,Comedy,7275,5936,171
2,3714-23-7897,IQ911P,Male,35,Sudan,6NE41R,Dance,66149,45789,238


In [7]:
# Get the list of input features for the candidate model from the model schema
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Select the candidate features from the training DataFrame
item_df = train_df[candidate_features]

# Drop duplicate rows based on the 'article_id' column to get unique candidate items
item_df.drop_duplicates(subset="video_id", inplace=True)

item_df.head(3)

Unnamed: 0,video_id,category,views,likes,video_length
0,6ZY07Y,News,192408,91932,113
1,5ZY75O,Comedy,7275,5936,171
2,6NE41R,Dance,66149,45789,238


In [8]:
# Create a TensorFlow dataset from the item DataFrame
item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items using the candidate_model
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["video_id"], candidate_model(x))
)

## <span style="color:#ff5f27">⚙️ Data Preparation </span>


In [9]:
# Concatenate all article IDs and embeddings from the candidate_embeddings dataset
all_article_ids = tf.concat([batch[0] for batch in candidate_embeddings], axis=0)
all_embeddings = tf.concat([batch[1] for batch in candidate_embeddings], axis=0)

# Convert tensors to numpy arrays
all_article_ids_np = all_article_ids.numpy()
all_embeddings_np = all_embeddings.numpy()

# Convert numpy arrays to lists
items_ids_list = all_article_ids_np.tolist()
embeddings_list = all_embeddings_np.tolist()

In [10]:
# Create a DataFrame
data_emb = pd.DataFrame({
    'video_id': items_ids_list, 
    'embeddings': embeddings_list,
})
data_emb['video_id'] = data_emb['video_id'].str.decode('utf-8')

data_emb.head()

Unnamed: 0,video_id,embeddings
0,6ZY07Y,"[1.2281079292297363, 1.768419623374939, 0.7509..."
1,5ZY75O,"[1.2281079292297363, 1.768419623374939, 0.7509..."
2,6NE41R,"[1.2281079292297363, 1.768419623374939, 0.7509..."
3,2OL92N,"[11.753347396850586, -11.14534854888916, 4.464..."
4,0MH71C,"[1.2281079292297363, 1.768419623374939, 0.7509..."


## <span style="color:#ff5f27">🪄 Feature Group Creation </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FG.

In [11]:
from hsfs import embedding

# Create the Embedding Index
emb = embedding.EmbeddingIndex()

emb.add_embedding(
    "embeddings",                           # Embeddings feature name
    len(data_emb["embeddings"].iloc[0]),    # Embeddings length
)

In [12]:
# Get or create the 'candidate_embeddings_fg' feature group
candidate_embeddings_fg = fs.get_or_create_feature_group(
    name="candidate_embeddings_fg",
    embedding_index=emb,                    # Specify the Embedding Index
    primary_key=['video_id'],
    version=1,
    description='Embeddings for each video',
    online_enabled=True,
)

candidate_embeddings_fg.insert(data_emb)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/11383/fs/11331/fg/12305


Uploading Dataframe: 0.00% |          | Rows 0/24990 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: candidate_embeddings_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11383/jobs/named/candidate_embeddings_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fe787503460>, None)

## <span style="color:#ff5f27">🪄 Feature View Creation </span>


In [13]:
# Get or create the 'candidate_embeddings' feature view
feature_view = fs.get_or_create_feature_view(
    name="candidate_embeddings",
    version=1,
    description='Embeddings of each article',
    query=candidate_embeddings_fg.select(["video_id"]),
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/11383/fs/11331/fv/candidate_embeddings/version/1


---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

At this point you have a recommender system that is able to generate a set of candidate videos for a user. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, you'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions.