## Build Index

In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks.

### Compute Candidate Embeddings

We start by computing candidate embeddings for all items in the training data.

First, we load our candidate model. Recall that we uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [1]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

model = mr.get_model("candidate_model")
model_path = model.download()

Connected. Call `.close()` to terminate connection gracefully.




Downloading file ... 

If you already have the model saved locally you can simply replace `model_path` with the path to your model.

In [2]:
import tensorflow as tf

candidate_model = tf.saved_model.load(model_path)



Next we compute the embeddings of all candidate items that were used to train the retrieval model.

In [3]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


In [4]:
feature_view = fs.get_feature_view("retrieval", version=1)
train_df, y_train, val_df, y_val, test_df, y_test = feature_view.get_train_validation_test_split(training_dataset_version=4)

2022-09-16 11:36:27,617 INFO: USE `rec_featurestore`
2022-09-16 11:36:28,381 INFO: SELECT `fg2`.`customer_id` `customer_id`, `fg2`.`article_id` `article_id`, `fg2`.`month_sin` `month_sin`, `fg2`.`month_cos` `month_cos`, `fg0`.`age` `age`, `fg1`.`garment_group_name` `garment_group_name`, `fg1`.`index_group_name` `index_group_name`
FROM `rec_featurestore`.`transactions_1` `fg2`
INNER JOIN `rec_featurestore`.`customers_1` `fg0` ON `fg2`.`customer_id` = `fg0`.`customer_id`
INNER JOIN `rec_featurestore`.`articles_1` `fg1` ON `fg2`.`article_id` = `fg1`.`article_id`


In [5]:
# import hsfs

# conn = hsfs.connection()
# fs = conn.get_feature_store()

# Load training dataset.
# td = fs.get_training_dataset("retrieval_1")
# train_df = td.read("train")
train_df["article_id"] = train_df["article_id"].astype(str)

# Get list of input features for the candidate model.
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Get list of unique candidate items.
item_df = train_df[candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)

item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items.
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], candidate_model(x)))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)

#### Index Embeddings

Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library.

In [6]:
import hopsworks
from opensearchpy import OpenSearch

connection = hopsworks.connection()
project = connection.get_project()
opensearch_api = project.get_opensearch_api()

client = OpenSearch(**opensearch_api.get_default_py_config())

Connected. Call `.close()` to terminate connection gracefully.


We'll create an index called `candidate_index`.

In [7]:
index_name = opensearch_api.get_project_index("candidate_index")

Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.

See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters.

In [9]:
# Dimensionality of candidate embeddings.
emb_dim = 16 # candidate_model.layers[-1].output.shape[-1]

index_body = {
    "settings": {
        "knn": True,
        "knn.algo_param.ef_search": 100,
    },
    "mappings": {
        "properties": {
            "my_vector1": {
                "type": "knn_vector",
                "dimension": emb_dim,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}

response = client.indices.create(index_name, body=index_body)
print(response)

2022-09-16 11:41:42,890 INFO: PUT https://172.16.4.231:9200/rec_candidate_index [status:200 request:0.098s]
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'rec_candidate_index'}


Now we can finally insert our candidate embeddings.

In [10]:
from opensearchpy.helpers import bulk

actions = []
for batch in candidate_embeddings:
    item_id_list, embedding_list = batch
    item_id_list = item_id_list.numpy().astype(int)
    embedding_list = embedding_list.numpy()

    for item_id, embedding in zip(item_id_list, embedding_list):
        actions.append({
            "_index": index_name,
            "_id": item_id,
            "_source": {
                "my_vector1": embedding,
            }
        })

# Bulk insertion.
bulk(client, actions)

2022-09-16 11:42:02,582 INFO: POST https://172.16.4.231:9200/_bulk [status:200 request:0.124s]
2022-09-16 11:42:02,672 INFO: POST https://172.16.4.231:9200/_bulk [status:200 request:0.074s]
2022-09-16 11:42:02,718 INFO: POST https://172.16.4.231:9200/_bulk [status:200 request:0.038s]


(1215, [])

To test that it works we can retrieve the neighbors of a random vector.

In [11]:
# TODO would be more illustrative to select a vector from 'actions'
# and join the results with item_df.

import pprint
import numpy as np

embedding = np.random.rand(emb_dim)

query = {
  "size": 10,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": embedding,
        "k": 10
      }
    }
  }
}

response = client.search(
    body = query,
    index = index_name
)

pprint.pprint(response)

2022-09-16 11:42:25,510 INFO: POST https://172.16.4.231:9200/rec_candidate_index/_search [status:200 request:0.091s]
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '860632003',
                    '_index': 'rec_candidate_index',
                    '_score': 2.7318163,
                    '_source': {'my_vector1': [0.5746775269508362,
                                               -0.05191389471292496,
                                               0.5020051002502441,
                                               -0.003409195691347122,
                                               0.15812934935092926,
                                               0.16854670643806458,
                                               0.10850943624973297,
                                               0.17470090091228485,
                                               0.17700625956058502,
                                               0.22156503796577454

#### Next Steps

At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll train a *ranking model* to do more fine-grained predictions.