In [2]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:443/p/119


## Requirements

Install libraries:

* **tensorflow** (version 2.9.1)
* **opensearch**

In [3]:
!pip install --quiet tensorflow==2.9.1
!pip install --quiet opensearch-py

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nbconvert 7.0.0 requires jinja2>=3.0, but you have jinja2 2.11.3 which is incompatible.
nbconvert 7.0.0 requires mistune<3,>=2.0.3, but you have mistune 0.8.4 which is incompatible.
hsfs 3.1.0.dev1 requires markupsafe<2.1.0, but you have markupsafe 2.1.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m


## Build Index

In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks.

### Compute Candidate Embeddings

We start by computing candidate embeddings for all items in the training data.

First, we load our candidate model. Recall that we uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [4]:
mr = project.get_model_registry()

model = mr.get_model("candidate_model")
model_path = model.download()

Connected. Call `.close()` to terminate connection gracefully.




Downloading file ... 

If you already have the model saved locally you can simply replace `model_path` with the path to your model.

In [5]:
import tensorflow as tf

candidate_model = tf.saved_model.load(model_path)

2022-11-15 10:54:19.278761: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-15 10:54:19.278779: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-11-15 10:54:20.393916: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-15 10:54:20.393937: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (xps-15): /proc/driver/nvidia/version does not exist
2022-11-15 10:54:20.394141: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To 

Next we compute the embeddings of all candidate items that were used to train the retrieval model.

In [6]:
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


In [7]:
feature_view = fs.get_feature_view("retrieval", version=1)

train_df, val_df, test_df, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)

In [8]:
train_df["article_id"] = train_df["article_id"].astype(str)
val_df["article_id"] = val_df["article_id"].astype(str)

In [9]:
# Get list of input features for the candidate model.
model_schema = model.model_schema['input_schema']['columnar_schema']
candidate_features = [feat['name'] for feat in model_schema]

# Get list of unique candidate items.
item_df = train_df[candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)

item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items.
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], candidate_model(x)))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)

#### Index Embeddings

Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library.

In [10]:
# TODO: Remove env var
import os
os.environ["ELASTIC_ENDPOINT"] = "https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200"

In [11]:
from opensearchpy import OpenSearch

opensearch_api = project.get_opensearch_api()
client = OpenSearch(**opensearch_api.get_default_py_config())

We'll create an index called `candidate_index`.

In [12]:
index_name = opensearch_api.get_project_index("candidate_index")

emb_dim = 16 # candidate_model.layers[-1].output.shape[-1]

Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.

See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters.

In [None]:
# todo: remove
# response = client.indices.delete(
#     index = index_name
# )
# print(response)

In [13]:
# Dimensionality of candidate embeddings.

index_body = {
    "settings": {
        "knn": True,
        "knn.algo_param.ef_search": 100,
    },
    "mappings": {
        "properties": {
            "my_vector1": {
                "type": "knn_vector",
                "dimension": emb_dim,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}

response = client.indices.create(index_name, body=index_body)
print(response)

2022-11-15 10:55:22,039 INFO: PUT https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/recsys_candidate_index [status:200 request:0.472s]
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'recsys_candidate_index'}


Now we can finally insert our candidate embeddings.

In [14]:
from opensearchpy.helpers import bulk

actions = []
for batch in candidate_embeddings:
    item_id_list, embedding_list = batch
    item_id_list = item_id_list.numpy().astype(int)
    embedding_list = embedding_list.numpy()

    for item_id, embedding in zip(item_id_list, embedding_list):
        actions.append({
            "_index": index_name,
            "_id": item_id,
            "_source": {
                "my_vector1": embedding,
            }
        })

# Bulk insertion.
bulk(client, actions)

2022-11-15 10:55:25,172 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.944s]
2022-11-15 10:55:25,472 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.277s]
2022-11-15 10:55:25,693 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.166s]
2022-11-15 10:55:25,897 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.166s]
2022-11-15 10:55:26,088 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.156s]
2022-11-15 10:55:26,275 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.158s]
2022-11-15 10:55:26,488 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/_bulk [status:200 request:0.161s]
2022-11-15 10:55:26,700 INFO: POST

(58305, [])

To test that it works we can retrieve the neighbors of a random vector.

In [15]:
import pprint
import numpy as np

embedding = np.random.rand(emb_dim)

query = {
  "size": 10,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": embedding,
        "k": 10
      }
    }
  }
}

response = client.search(
    body = query,
    index = index_name
)

pprint.pprint(response)

2022-11-15 10:55:49,782 INFO: POST https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:9200/recsys_candidate_index/_search [status:200 request:0.152s]
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '727641001',
                    '_index': 'recsys_candidate_index',
                    '_score': 7.507862,
                    '_source': {'my_vector1': [-2.214114189147949,
                                               -0.4502824544906616,
                                               2.4246411323547363,
                                               0.758843719959259,
                                               1.4847484827041626,
                                               -0.3893885612487793,
                                               -2.0370888710021973,
                                               0.33446311950683594,
                                               -0.1956670880317688,
                        

#### Next Steps

At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll train a *ranking model* to do more fine-grained predictions.