In [1]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://hopsworks0.logicalclocks.com/p/119


## Requirements

Install libraries:

* **tensorflow** (version 2.11) [already installed]
* **opensearch**

In [2]:
!pip install --quiet opensearch-py

## Build Index

In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks.

### Compute Candidate Embeddings

We start by computing candidate embeddings for all items in the training data.

First, we load our candidate model. Recall that we uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [3]:
mr = project.get_model_registry()

model = mr.get_model("candidate_model")
model_path = model.download()

Connected. Call `.close()` to terminate connection gracefully.




Downloading file ... 

If you already have the model saved locally you can simply replace `model_path` with the path to your model.

In [4]:
import tensorflow as tf

candidate_model = tf.saved_model.load(model_path)



Next we compute the embeddings of all candidate items that were used to train the retrieval model.

In [5]:
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


In [6]:
feature_view = fs.get_feature_view("retrieval", version=1)

train_df, val_df, test_df, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)

In [7]:
train_df["article_id"] = train_df["article_id"].astype(str)
val_df["article_id"] = val_df["article_id"].astype(str)

In [8]:
# Get list of input features for the candidate model.
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Get list of unique candidate items.
item_df = train_df[candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)

item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items.
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], candidate_model(x)))

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)

#### Index Embeddings

Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library.

In [10]:
from opensearchpy import OpenSearch

opensearch_api = project.get_opensearch_api()
client = OpenSearch(**opensearch_api.get_default_py_config())

We'll create an index called `candidate_index`.

In [11]:
index_name = opensearch_api.get_project_index("candidate_index")

emb_dim = 16 # candidate_model.layers[-1].output.shape[-1]

Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.

See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters.

In [None]:
# To delete the indices
# response = client.indices.delete(
#     index = index_name
# )
# print(response)

In [12]:
# Dimensionality of candidate embeddings.

index_body = {
    "settings": {
        "knn": True,
        "knn.algo_param.ef_search": 100,
    },
    "mappings": {
        "properties": {
            "my_vector1": {
                "type": "knn_vector",
                "dimension": emb_dim,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}

response = client.indices.create(index_name, body=index_body)
print(response)

2023-07-10 19:38:48,268 INFO: PUT https://10.0.2.15:9200/rec_candidate_index [status:200 request:0.187s]
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'rec_candidate_index'}




Now we can finally insert our candidate embeddings.

In [13]:
from opensearchpy.helpers import bulk

actions = []
for batch in candidate_embeddings:
    item_id_list, embedding_list = batch
    item_id_list = item_id_list.numpy().astype(int)
    embedding_list = embedding_list.numpy()

    for item_id, embedding in zip(item_id_list, embedding_list):
        actions.append({
            "_index": index_name,
            "_id": item_id,
            "_source": {
                "my_vector1": embedding,
            }
        })

# Bulk insertion.
bulk(client, actions)

2023-07-10 19:38:54,177 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.119s]
2023-07-10 19:38:54,300 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.086s]
2023-07-10 19:38:54,405 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.064s]
2023-07-10 19:38:54,475 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.040s]
2023-07-10 19:38:54,572 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.057s]
2023-07-10 19:38:54,652 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.045s]
2023-07-10 19:38:54,724 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.037s]
2023-07-10 19:38:54,786 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.033s]
2023-07-10 19:38:54,858 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.044s]
2023-07-10 19:38:54,934 INFO: POST https://10.0.2.15:9200/_bulk [status:200 request:0.047s]
2023-07-10 19:38:55,010 INFO: POST https://10.0.2.15:9200/_bulk [status:200 requ

(58345, [])

To test that it works we can retrieve the neighbors of a random vector.

In [14]:
import pprint
import numpy as np

embedding = np.random.rand(emb_dim)

query = {
  "size": 10,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": embedding,
        "k": 10
      }
    }
  }
}

response = client.search(
    body = query,
    index = index_name
)

pprint.pprint(response)

2023-07-10 19:39:06,085 INFO: POST https://10.0.2.15:9200/rec_candidate_index/_search [status:200 request:0.139s]
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '550969001',
                    '_index': 'rec_candidate_index',
                    '_score': 5.292914,
                    '_source': {'my_vector1': [1.750615119934082,
                                               1.824286699295044,
                                               1.9051073789596558,
                                               -2.153055191040039,
                                               1.270268201828003,
                                               -0.7928036451339722,
                                               1.5667674541473389,
                                               2.506160259246826,
                                               -1.557816743850708,
                                               -0.4792322814464569,
               

#### Next Steps

At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll train a *ranking model* to do more fine-grained predictions.