## <span style="color:#ff5f27">👨🏻‍🏫 Build Index </span>

In this notebook you will create a feature group for your candidate embeddings.

## <span style="color:#ff5f27">📝 Imports </span>

In [7]:
import tensorflow as tf
import pprint
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [8]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()
mr = project.get_model_registry()

Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/17565
Connected. Call `.close()` to terminate connection gracefully.
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27">🎯 Compute Candidate Embeddings </span>

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [9]:
model = mr.get_model(
    name="candidate_model",
    version=1,
)
model_path = model.download()

Downloading model artifact (2 dirs, 6 files)... DONE

In [10]:
candidate_model = tf.saved_model.load(model_path)

2024-10-01 21:28:19.765433: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:03:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-01 21:28:19.766347: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Next you compute the embeddings of all candidate videos that were used to train the retrieval model.

In [11]:
feature_view = fs.get_feature_view(
    name="retrieval",
    version=1,
)

In [12]:
train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(
    validation_size=0.1, 
    test_size=0.1,
    description='Retrieval dataset splits',
)
train_df.head(3)

/arrow/cpp/src/arrow/status.cc:137: DoAction result was not fully consumed: Cancelled: Flight cancelled call, with message: CANCELLED. Detail: Cancelled


Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (73.10s) 


Unnamed: 0,interaction_id,user_id,gender,age,country,video_id,category,views,likes,video_length
0,8493-35-9872,MG666N,Female,47,Turks & Caicos Islands,0IE39E,News,63950,45996,38
1,1247-11-6324,IW619V,Other,44,Côte d’Ivoire,2PV16N,Technology,200516,144355,106
2,0336-88-6412,EO571D,Other,34,Bahamas,9WK73L,News,181915,59220,74


In [13]:
train_df

Unnamed: 0,interaction_id,user_id,gender,age,country,video_id,category,views,likes,video_length
0,8493-35-9872,MG666N,Female,47,Turks & Caicos Islands,0IE39E,News,63950,45996,38
1,1247-11-6324,IW619V,Other,44,Côte d’Ivoire,2PV16N,Technology,200516,144355,106
2,0336-88-6412,EO571D,Other,34,Bahamas,9WK73L,News,181915,59220,74
3,1055-76-9596,CL930U,Other,85,Mongolia,9KT05N,Music,46054,8021,55
4,1821-43-0748,NJ361K,Other,23,Suriname,5VW23G,Comedy,308519,53853,226
...,...,...,...,...,...,...,...,...,...,...
1009995,0724-29-0432,LS345B,Male,78,Cameroon,1NR73Y,Technology,157359,89735,115
1009996,2349-95-8268,AW776Z,Other,78,Laos,6VT86G,Comedy,53227,9151,189
1009997,9527-95-8831,YT721H,Other,57,Nauru,0RT08F,Entertainment,355769,145940,199
1009998,8835-20-4526,JT609X,Female,78,Morocco,9DB59C,Education,107165,26295,230


In [14]:
# Get the list of input features for the candidate model from the model schema
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Select the candidate features from the training DataFrame
item_df = train_df[candidate_features]

# Drop duplicate rows based on the 'article_id' column to get unique candidate items
item_df.drop_duplicates(subset="video_id", inplace=True)

item_df.head(3)

Unnamed: 0,video_id,category,views,likes,video_length
0,0IE39E,News,63950,45996,38
1,2PV16N,Technology,200516,144355,106
2,9WK73L,News,181915,59220,74


In [15]:
# Create a TensorFlow dataset from the item DataFrame
item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items using the candidate_model
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["video_id"], candidate_model(x))
)

## <span style="color:#ff5f27">⚙️ Data Preparation </span>


In [16]:
# Concatenate all article IDs and embeddings from the candidate_embeddings dataset
all_article_ids = tf.concat([batch[0] for batch in candidate_embeddings], axis=0)
all_embeddings = tf.concat([batch[1] for batch in candidate_embeddings], axis=0)

# Convert tensors to numpy arrays
all_article_ids_np = all_article_ids.numpy()
all_embeddings_np = all_embeddings.numpy()

# Convert numpy arrays to lists
items_ids_list = all_article_ids_np.tolist()
embeddings_list = all_embeddings_np.tolist()

In [17]:
# Create a DataFrame
data_emb = pd.DataFrame({
    'video_id': items_ids_list, 
    'embeddings': embeddings_list,
})
data_emb['video_id'] = data_emb['video_id'].str.decode('utf-8')

data_emb.head()

Unnamed: 0,video_id,embeddings
0,0IE39E,"[-84.35520935058594, -3.6666274070739746, 69.4..."
1,2PV16N,"[-256.7899475097656, -2.141116142272949, 220.9..."
2,9WK73L,"[-309.5802917480469, -6.449563980102539, 261.3..."
3,9KT05N,"[-88.38681030273438, -8.603543281555176, 68.34..."
4,5VW23G,"[-562.212646484375, -11.332976341247559, 477.8..."


## <span style="color:#ff5f27">🪄 Feature Group Creation </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FG.

In [18]:
from hsfs import embedding

# Create the Embedding Index
emb = embedding.EmbeddingIndex()

emb.add_embedding(
    "embeddings",                           # Embeddings feature name
    len(data_emb["embeddings"].iloc[0]),    # Embeddings length
)

In [19]:
# Get or create the 'candidate_embeddings_fg' feature group
candidate_embeddings_fg = fs.get_or_create_feature_group(
    name="candidate_embeddings_fg",
    embedding_index=emb,                    # Specify the Embedding Index
    primary_key=['video_id'],
    version=1,
    description='Embeddings for each video',
    online_enabled=True,
)

candidate_embeddings_fg.insert(data_emb)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/17565/fs/17485/fg/1238382


Uploading Dataframe: 0.00% |          | Rows 0/25980 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: candidate_embeddings_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/17565/jobs/named/candidate_embeddings_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f9164f76230>, None)

## <span style="color:#ff5f27">🪄 Feature View Creation </span>


In [20]:
# Get or create the 'candidate_embeddings' feature view
feature_view = fs.get_or_create_feature_view(
    name="candidate_embeddings",
    version=1,
    description='Embeddings of each article',
    query=candidate_embeddings_fg.select(["video_id"]),
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/17565/fs/17485/fv/candidate_embeddings/version/1


---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

At this point you have a recommender system that is able to generate a set of candidate videos for a user. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, you'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions.