## Build Index

In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks.

### Compute Item Embeddings

First we need to compute item embedding.

In [2]:
import tensorflow as tf
import pandas as pd

# Load candidate model.
item_model = tf.keras.models.load_model("item_model")

candidate_features = ["article_id", "garment_group_name", "index_group_name"]

item_df = pd.read_csv("train_df.csv", dtype={"article_id": object})[
    candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)
item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], item_model(x)))

2022-05-25 14:24:43.810118: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




In [1]:
import tensorflow as tf
import pandas as pd

# Load candidate model.
item_model = tf.keras.models.load_model("item_model")

2022-05-23 15:26:47.800825: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)

#### Index Embeddings

Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library.

In [None]:
import hopsworks
from opensearchpy import OpenSearch

connection = hopsworks.connection()
project = connection.get_project()
opensearch_api = project.get_opensearch_api()

client = OpenSearch(**opensearch_api.get_default_py_config())

We'll create an index called `item_index`.

In [None]:
index_name = opensearch_api.get_project_index("item_index")

Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.

See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters.

In [None]:
# Dimensionality of candidate embeddings.
emb_dim = item_model.layers[-1].output.shape[-1]

index_body = {
    "settings": {
        "knn": True,
        "knn.algo_param.ef_search": 100,
    },
    "mappings": {
        "properties": {
            "my_vector1": {
                "type": "knn_vector",
                "dimension": emb_dim,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}

response = client.indices.create(index_name, body=index_body)
print(response)

Now we can finally insert our candidate embeddings.

In [None]:
from opensearchpy.helpers import bulk

actions = []
for batch in candidate_embeddings:
    item_id_list, embedding_list = batch
    item_id_list = item_id_list.numpy().astype(int)
    embedding_list = embedding_list.numpy()

    for item_id, embedding in zip(item_id_list, embedding_list):
        actions.append({
            "_index": index_name,
            "_id": item_id,
            "_source": {
                "my_vector1": embedding,
            }
        })

# Bulk insertion.
bulk(client, actions)

To test that it works we can retrieve the neighbors of a random vector.

In [None]:
# TODO would be more illustrative to select a vector from 'actions'
# and join the results with item_df.

import pprint
import numpy as np

embedding = np.random.rand(emb_dim)

query = {
  "size": 10,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": embedding,
        "k": 10
      }
    }
  }
}

response = client.search(
    body = query,
    index = index_name
)

pprint.pprint(response)

#### Next Steps

At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll train a *ranking model* to do more fine-grained predictions.