## <span style="color:#ff5f27">👨🏻‍🏫 Build Index </span>

In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks.

## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
import tensorflow as tf
import pprint
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()
mr = project.get_model_registry()

## <span style="color:#ff5f27">🎯 Compute Candidate Embeddings </span>

We start by computing candidate embeddings for all items in the training data.

First, we load our candidate model. Recall that we uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [None]:
model = mr.get_model(
    name="candidate_model",
    version=1,
)
model_path = model.download()

If you already have the model saved locally you can simply replace `model_path` with the path to your model.

In [None]:
candidate_model = tf.saved_model.load(model_path)

Next we compute the embeddings of all candidate items that were used to train the retrieval model.

In [None]:
feature_view = fs.get_feature_view(
    name="retrieval", 
    version=1,
)

In [None]:
train_df, val_df, test_df, y_train, y_val, y_test = feature_view.train_validation_test_split(
    validation_size=0.1, 
    test_size=0.1,
    description='Retrieval dataset splits',
)

In [None]:
train_df["article_id"] = train_df["article_id"].astype(str)
val_df["article_id"] = val_df["article_id"].astype(str)

In [None]:
# Get list of input features for the candidate model.
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Get list of unique candidate items.
item_df = train_df[candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)

item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items.
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], candidate_model(x)))

(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)

## <span style="color:#ff5f27">🔮 Index Embeddings </span>

Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library.

In [None]:
from opensearchpy import OpenSearch

opensearch_api = project.get_opensearch_api()
client = OpenSearch(**opensearch_api.get_default_py_config())

We'll create an index called `candidate_index`.

In [None]:
index_name = opensearch_api.get_project_index("candidate_index")

emb_dim = 16 # candidate_model.layers[-1].output.shape[-1]

Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.

See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters.

In [None]:
# To delete the indices
# response = client.indices.delete(
#     index = index_name
# )
# print(response)

In [None]:
# Dimensionality of candidate embeddings.

index_body = {
    "settings": {
        "knn": True,
        "knn.algo_param.ef_search": 100,
    },
    "mappings": {
        "properties": {
            "my_vector1": {
                "type": "knn_vector",
                "dimension": emb_dim,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}

response = client.indices.create(index_name, body=index_body)
print(response)

Now we can finally insert our candidate embeddings.

In [None]:
from opensearchpy.helpers import bulk

actions = []
for batch in candidate_embeddings:
    item_id_list, embedding_list = batch
    item_id_list = item_id_list.numpy().astype(int)
    embedding_list = embedding_list.numpy()

    for item_id, embedding in zip(item_id_list, embedding_list):
        actions.append({
            "_index": index_name,
            "_id": item_id,
            "_source": {
                "my_vector1": embedding,
            }
        })

# Bulk insertion.
bulk(client, actions)

To test that it works we can retrieve the neighbors of a random vector.

In [None]:
embedding = np.random.rand(emb_dim)

query = {
  "size": 10,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": embedding,
        "k": 10
      }
    }
  }
}

response = client.search(
    body = query,
    index = index_name
)

pprint.pprint(response)

---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions.