# E2E recsys with matching engine and TFRS


Simple example, goal being:

    1) Train a Two-Tower model using movielens data
    
    2) Deploy the query model endpoint
    
    3) Save movie embeddings to json, for use in matching engine
    
    
#### Note on VPC Pairing - insturctions for in-notebook pairing [here](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/matching_engine/sdk_matching_engine_for_indexing.ipynb)
    
First we will create a user-managed notebook behind the already created peered VPC network used for Matching Engine. Select tensorflow enterprise 2.6 with a T4 GPU


![](./create-workbench.png)


##### Be sure to create the notebook in the peered network


![](./network-create.png)

    
The next notebook will connect matching engine with the query endpoint for a simple recommender system

Run the below pip install one time to install tensorflow-recommenders

In [1]:
!echo Y | pip uninstall tensorflow
!pip install tensorflow-recommenders --user

Found existing installation: tensorflow 2.9.1
Uninstalling tensorflow-2.9.1:
  Would remove:
    /home/jupyter/.local/lib/python3.7/site-packages/tensorflow-2.9.1.dist-info/*
    /home/jupyter/.local/lib/python3.7/site-packages/tensorflow/*
Proceed (Y/n)?   Successfully uninstalled tensorflow-2.9.1
Collecting tensorflow-recommenders
  Using cached tensorflow_recommenders-0.7.0-py3-none-any.whl (88 kB)
Collecting tensorflow>=2.9.0
  Using cached tensorflow-2.9.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 MB)
Installing collected packages: tensorflow, tensorflow-recommenders
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.9.0 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.52.0 which is incompatible.
tfx-bsl 1.9.0 requires pyarrow<6,>=1, but you have pyarrow 8.0.0 which is incompatible

### Important - restart the kernel after installing

# Train a 2 tower model

In [2]:
from typing import Dict, Text

import json

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

# disable INFO and DEBUG logging everywhere
import logging

from google.cloud import aiplatform_v1beta1 #needed for matching engine calls
from google.protobuf import struct_pb2

import pandas as pd


logging.disable(logging.WARNING)

DIMENSIONS = 64 # this is how large the embedding dimensions get


# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")

# Select the basic features.
ratings = ratings.map(lambda x: {
    "movie_id": tf.strings.to_number(x["movie_id"]),
    "user_id": tf.strings.to_number(x["user_id"])
})
movies = movies.map(lambda x: tf.strings.to_number(x["movie_id"]))

# Build a model.
class Model(tfrs.Model):

    def __init__(self):
        super().__init__()

        # Set up user representation.
        self.user_model = tf.keras.Sequential([
            tf.keras.layers.Embedding(
            input_dim=2000, output_dim=DIMENSIONS),
            ])
        # Set up movie representation.
        self.item_model = tf.keras.Sequential([
            tf.keras.layers.Embedding(
            input_dim=2000, output_dim=DIMENSIONS),
        ])
        # Set up a retrieval task and evaluation metrics over the
        # entire dataset of candidates.
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=movies.batch(128).map(self.item_model)
            )
        )

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

        user_embeddings = self.user_model(features["user_id"])
        movie_embeddings = self.item_model(features["movie_id"])

        return self.task(user_embeddings, movie_embeddings)


model = Model()
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Randomly shuffle data and split between train and test.
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

# Train.
model.fit(train.batch(1024), epochs=5)

# Evaluate.
model.evaluate(test.batch(1024), return_dict=True)

[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /home/jupyter/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/100000 [00:00<?, ? examples/s]

Shuffling movielens-train.tfrecord...:   0%|          | 0/100000 [00:00<?, ? examples/s]

[1mDataset movielens downloaded and prepared to /home/jupyter/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m


2022-08-08 18:50:04.004307: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-08 18:50:04.078415: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-08-08 18:50:04.089592: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-08-08 18:50:04.091144: I tensorflow/core/platform/cpu_f

[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /home/jupyter/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1682 [00:00<?, ? examples/s]

Shuffling movielens-train.tfrecord...:   0%|          | 0/1682 [00:00<?, ? examples/s]

[1mDataset movielens downloaded and prepared to /home/jupyter/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


{'factorized_top_k/top_1_categorical_accuracy': 0.0,
 'factorized_top_k/top_5_categorical_accuracy': 9.999999747378752e-05,
 'factorized_top_k/top_10_categorical_accuracy': 0.0016499999910593033,
 'factorized_top_k/top_50_categorical_accuracy': 0.053449999541044235,
 'factorized_top_k/top_100_categorical_accuracy': 0.1501999944448471,
 'loss': 3467.974609375,
 'regularization_loss': 0,
 'total_loss': 3467.974609375}

### Set your variables

In [3]:
import os

PROJECT = 'wortz-project-352116' #set to your own
NETWORK_NAME = 'me-network' #same as VPC peered network

### Create a bucket to store our embeddings and models
BUCKET = 'gs://end-to-end-two-tower' # TODO - change for each user
EMBEDDINGS = os.path.join(BUCKET, 'embeddings')
QUERY_MODEL = os.path.join(BUCKET, 'query_model')
REGION = 'us-central1'

## Gets an auth token with the Parent variable
PROJECT_ID = PROJECT
AUTH_TOKEN = !gcloud auth print-access-token
PROJECT_NUMBER = !gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = PROJECT_NUMBER[0]


PARENT = "projects/{}/locations/{}".format(PROJECT_ID, REGION)
PARENT

'projects/wortz-project-352116/locations/us-central1'

In [4]:
# run one time to create your bucket
# !gsutil mb -l $REGION $BUCKET

In [5]:
# Save the query/user model

model.user_model.save(QUERY_MODEL)

In [6]:
# Make sure it saved
!gsutil ls $QUERY_MODEL

gs://end-to-end-two-tower/query_model/
gs://end-to-end-two-tower/query_model/keras_metadata.pb
gs://end-to-end-two-tower/query_model/saved_model.pb
gs://end-to-end-two-tower/query_model/assets/
gs://end-to-end-two-tower/query_model/variables/


In [6]:
from google.cloud import aiplatform

model_gcp = aiplatform.Model.upload(
        display_name="Movielens User Query Model",
        artifact_uri=QUERY_MODEL,
        serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-6:latest',
        description="Top of the query tower, meant to return an embedding for each user instance",
    )

In [7]:
#validate the model type output
model_gcp

<google.cloud.aiplatform.models.Model object at 0x7f1557005f10> 
resource name: projects/679926387543/locations/us-central1/models/3782685037409861632

In [None]:
import time

In [8]:
endpoint = aiplatform.Endpoint.create(
    display_name="Movielens Model Endpoint",
    project=PROJECT,
    location=REGION,
)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.34 µs


In [9]:
deployment = model_gcp.deploy(
    endpoint=endpoint,
    deployed_model_display_name="Movielens User Query Model",
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=2,
    accelerator_type=None,
    accelerator_count=0,
    sync=False,
)


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


In [10]:
deployment

<google.cloud.aiplatform.models.Endpoint object at 0x7f15c7691a90> 
resource name: projects/679926387543/locations/us-central1/endpoints/2924023630721449984

  value=value,


## Save the embeddings for the movie dataset

### Write embeddings to local storage
Following this format for Matching Engine
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/matching_engine/sdk_matching_engine_for_indexing.ipynb


In [11]:
movie_embs = movies.batch(1000).map(lambda x: [x, model.item_model(x)]).unbatch() #process 1000 at a time then flatten it back

In [13]:
# Write to local disk
with open("movie_embeddings.json", 'w') as f:
    for movie_id, movie_emb in movie_embs:
        # print(movie_id.numpy(), movie_emb.numpy())
        f.write('{"id":"' + str(movie_id.numpy()) + '","embedding":[' + ",".join(str(x) for x in list(movie_emb.numpy())) + ']}')
        f.write("\n")

You should now see .json data as required by matching engine
![](jsonl.png)

### Upload the data to GCS
Only remove if you have issues uploading the json file

In [16]:
!gsutil cp movie_embeddings.json $EMBEDDINGS/movie_embeddings.json

Copying file://movie_embeddings.json [Content-Type=application/json]...
/ [1 files][  1.2 MiB/  1.2 MiB]                                                
Operation completed over 1 objects/1.2 MiB.                                      


# Next we will deploy our movie inidicies. With Matching Engine
* Create an index (from the `json` files)
* Create and endpoint
* Deploy the index to the endpoint so you can perform vector search

In [17]:
api_endpoint_me = "{}-aiplatform.googleapis.com".format(REGION)

index_client = aiplatform_v1beta1.IndexServiceClient(
    client_options=dict(api_endpoint=api_endpoint_me)
)


DISPLAY_NAME = f"Movielens Movie: {DIMENSIONS} DIMENSIONS"

Set the Nearest Neighbor Options

See here for tips on [tuning the index](https://cloud.google.com/vertex-ai/docs/matching-engine/using-matching-engine#tuning_the_index)

Other best practices from our PM team:
```
Start from leafNodesToSearchPercent=5 and approximateNeighborsCount=10 * k

use default values for others.

measure performance and recall and change those 2 parameters accordingly.
```

In [18]:
treeAhConfig = struct_pb2.Struct(
    fields={
        "leafNodeEmbeddingCount": struct_pb2.Value(number_value=20),
        "leafNodesToSearchPercent": struct_pb2.Value(number_value=7),
    }
)

algorithmConfig = struct_pb2.Struct(
    fields={"treeAhConfig": struct_pb2.Value(struct_value=treeAhConfig)}
)

config = struct_pb2.Struct(
    fields={
        "dimensions": struct_pb2.Value(number_value=DIMENSIONS),
        "approximateNeighborsCount": struct_pb2.Value(number_value=10),
        "distanceMeasureType": struct_pb2.Value(string_value="DOT_PRODUCT_DISTANCE"),
        "algorithmConfig": struct_pb2.Value(struct_value=algorithmConfig),
    }
)

metadata = struct_pb2.Struct(
    fields={
        "config": struct_pb2.Value(struct_value=config),
        "contentsDeltaUri": struct_pb2.Value(string_value=EMBEDDINGS),
    }
)

ann_index = {
    "display_name": DISPLAY_NAME,
    "description": f"Movielens {DIMENSIONS}",
    "metadata": struct_pb2.Value(struct_value=metadata),
}

In [19]:
ann_index = index_client.create_index(parent=PARENT, index=ann_index)

In [20]:
# Poll the operation until it's done successfullly.
# This will take ~40 min.
import time 

while True:
    if ann_index.done():
        break
    print("Poll the operation to create index...")
    time.sleep(60)

Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the operation to create index...
Poll the ope

In [21]:
ann_index

<google.api_core.operation.Operation at 0x7f1557024050>

In [22]:
ann_index.result()

name: "projects/679926387543/locations/us-central1/indexes/6165375113312075776"

### Save the name of the endpoint

In [23]:
INDEX_RESOURCE_NAME = ann_index.result().name
INDEX_RESOURCE_NAME

'projects/679926387543/locations/us-central1/indexes/6165375113312075776'

Debugging tool in case you run into issues. Example usage below.
`!gcloud beta ai operations describe 4122851463774863360 --index=7253099976438317056 --project=$PROJECT`

## Create Index Endpoint and Deploy Index

In [24]:
VPC_NETWORK_NAME = "projects/{}/global/networks/{}".format(PROJECT_NUMBER, NETWORK_NAME)
VPC_NETWORK_NAME

'projects/679926387543/global/networks/me-network'

In [25]:
index_endpoint = {
    "display_name": "index_endpoint_for_demo",
    "network": VPC_NETWORK_NAME,
}

In [26]:
index_endpoint_client = aiplatform_v1beta1.IndexEndpointServiceClient(
    client_options=dict(api_endpoint=api_endpoint_me)
)

ann_index_en = index_endpoint_client.create_index_endpoint(
    parent=PARENT, index_endpoint=index_endpoint
)

In [27]:
ann_index_en.result()

name: "projects/679926387543/locations/us-central1/indexEndpoints/5519108566784409600"

In [28]:
INDEX_ENDPOINT_NAME = ann_index_en.result().name
INDEX_ENDPOINT_NAME

'projects/679926387543/locations/us-central1/indexEndpoints/5519108566784409600'

In [47]:
DEPLOYED_INDEX_ID = 'movielens_deployed2'

deploy_ann_index = {
    "id": DEPLOYED_INDEX_ID,
    "display_name": DEPLOYED_INDEX_ID,
    "index": INDEX_RESOURCE_NAME,
}
r = index_endpoint_client.deploy_index(
    index_endpoint=INDEX_ENDPOINT_NAME, deployed_index=deploy_ann_index
)

In [48]:
r.result()

deployed_index {
  id: "movielens_deployed2"
}

# Connect Matching Engine and The User Model Into a Recommendation System

This will bring it all together by incorporating the prediction endpoint 

In [56]:
# establish index_endpoint -IMPORTANT for constructing already created endpoints/indicies/etc...
ME_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(INDEX_ENDPOINT_NAME)

In [60]:
USER = 627.0 #pick anyone 0-100k to see watch history and recommendations
NUM_NEIGH=3

emb_627 = endpoint.predict([[USER]]) #prediction from the saved model
emb_627 = emb_627.predictions[0]
emb_627 # we should get our user xxx embedding @ dim len

[[-0.22754097,
  -0.0064380914,
  0.0135015259,
  0.112238489,
  -0.277818114,
  -0.159358323,
  -0.1638522,
  -0.00327851763,
  -0.159260809,
  -0.491864324,
  -0.193261221,
  0.30505079,
  -0.307912469,
  -0.301744461,
  0.23069109,
  -0.361792147,
  0.463443398,
  0.0457165837,
  -0.162908927,
  -0.427746952,
  0.636374176,
  0.342389196,
  -0.0582389496,
  0.221517876,
  -0.652715504,
  0.216488883,
  0.427579284,
  0.488048613,
  0.0215745568,
  0.205890983,
  0.0553605147,
  -0.470034838,
  0.0314469598,
  -0.691070139,
  0.0609962605,
  -0.201734498,
  -0.0754234344,
  0.089362748,
  -0.254325509,
  -0.0444015339,
  -0.0583328977,
  0.243367925,
  0.221339,
  -0.136401564,
  0.338161618,
  -0.11122942,
  0.248045325,
  -0.294821292,
  0.429468542,
  0.0255604722,
  1.00177443,
  0.00296269078,
  -0.268753201,
  -0.240518704,
  -0.247991562,
  0.674329817,
  -0.313638657,
  -0.370729953,
  -0.149122491,
  0.338815,
  -0.332382,
  0.0227264874,
  0.852791429,
  0.235119537]]

In [62]:
ME_index_endpoint.match(queries=emb_627, deployed_index_id=DEPLOYED_INDEX_ID, num_neighbors=10)

[[MatchNeighbor(id='1478.0', distance=4.327176094055176),
  MatchNeighbor(id='1135.0', distance=4.176234722137451),
  MatchNeighbor(id='1136.0', distance=4.097315311431885),
  MatchNeighbor(id='1004.0', distance=3.9972083568573),
  MatchNeighbor(id='939.0', distance=3.618990182876587),
  MatchNeighbor(id='188.0', distance=3.5515005588531494),
  MatchNeighbor(id='518.0', distance=3.525977611541748),
  MatchNeighbor(id='461.0', distance=3.502216339111328),
  MatchNeighbor(id='76.0', distance=3.4985222816467285),
  MatchNeighbor(id='942.0', distance=3.492293357849121)]]

#### Create movie lookup tables
Get what given user has rated highly, and what is being recommended

In [63]:
! wget https://files.grouplens.org/datasets/movielens/ml-100k/u.item

--2022-07-25 19:30:56--  https://files.grouplens.org/datasets/movielens/ml-100k/u.item
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 236344 (231K)
Saving to: ‘u.item’


2022-07-25 19:30:57 (2.92 MB/s) - ‘u.item’ saved [236344/236344]



In [64]:
# Quick sidetour - create movie lookup dictionary
movie_names = pd.read_csv('u.item', delimiter='|' , 
                          encoding='latin-1', 
                          usecols=(0,1),
                          names = ['movie_id', 'title'])
movielookup = movie_names.to_dict()['title']

In [74]:
for i, watched_movie in enumerate(ratings.filter(lambda x: x['user_id']==USER)):
    if i >= 10: #limit to top n
        break
    else:
        key = watched_movie['movie_id'].numpy()
        print(f"""Movies watched: \n 
              {i}: {movielookup[key]}"""
             )

Movies watched: 
 
              0: Piano, The (1993)
Movies watched: 
 
              1: Star Trek: The Wrath of Khan (1982)
Movies watched: 
 
              2: Return of the Jedi (1983)
Movies watched: 
 
              3: Star Trek VI: The Undiscovered Country (1991)
Movies watched: 
 
              4: Star Trek III: The Search for Spock (1984)
Movies watched: 
 
              5: Four Rooms (1995)
Movies watched: 
 
              6: Addams Family Values (1993)
Movies watched: 
 
              7: Arsenic and Old Lace (1944)
Movies watched: 
 
              8: Pinocchio (1940)
Movies watched: 
 
              9: Dead Poets Society (1989)


In [67]:
query_vector = emb_627


ann_response = ME_index_endpoint.match(
    deployed_index_id='movielens_deployed', 
    queries=query_vector, 
    num_neighbors=NUM_NEIGH
)

print("Recommended movie IDs:", ann_response)

Recommended movie IDs: [[MatchNeighbor(id='1478.0', distance=4.327176094055176), MatchNeighbor(id='1135.0', distance=4.176234722137451), MatchNeighbor(id='1136.0', distance=4.097315311431885)]]


In [68]:
# look at the recommended movies vs the viewed for that user
for i, match in enumerate(ann_response[0]):
    key = int(float(match.id))
    print(f"""Movies recommended: \n 
          {i}: {movielookup[key]} (distance: {match.distance})"""
         )


Movies recommended: 
 
          0: Reckless (1995) (distance: 4.327176094055176)
Movies recommended: 
 
          1: Ghosts of Mississippi (1996) (distance: 4.176234722137451)
Movies recommended: 
 
          2: Beautiful Thing (1996) (distance: 4.097315311431885)


### Cleaning up
To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial. You can also manually delete resources that you created by running the following code.

In [None]:
INDEX_RESOURCE_NAME
# 7352179168240467968

In [None]:
index_endpoint_client

In [None]:
index_endpoint_client.undeploy_index(index_endpoint=INDEX_ENDPOINT_NAME, deployed_index_id=DEPLOYED_INDEX_ID)

index_client.delete_index(name=INDEX_RESOURCE_NAME)

index_endpoint_client.delete_index_endpoint(name=INDEX_ENDPOINT_NAME)

In [None]:
endpoint_resource_name = endpoint.resource_name
endpoint_resource_name

In [None]:
deployment_resource_name = deployment.resource_name
deployment_resource_name
aiplatform.Endpoint.delete(endpoint, gcp_model)
#delete our model endpoints, etc..