## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.

In [58]:
!pip3 install google-cloud-aiplatform --upgrade --user


Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.4.3-py2.py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 5.2 MB/s eta 0:00:01
Installing collected packages: google-cloud-aiplatform
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 0.7.1
    Uninstalling google-cloud-aiplatform-0.7.1:
      Successfully uninstalled google-cloud-aiplatform-0.7.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx 1.2.0 requires google-cloud-aiplatform<0.8,>=0.5.0, but you have google-cloud-aiplatform 1.4.3 which is incompatible.
tfx 1.2.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<2.6,>=1.15.2, but you have tensorflow 2.6.0 which is incompatible.[0m
Successfully installed google-cloud-aiplatform-1.4.3


### Import libraries

In [1]:
import tensorflow as tf
import numpy as np
from datetime import datetime

### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `REGION`: The region to use for the AI Platform Training job.

In [2]:
PROJECT_ID = 'rec-ai-demo-326116' # Change to your project.
BUCKET = 'rec_bq_jsw' # Change to the bucket you created.
REGION = 'us-central1' # Change to your AI Platform Training region.
EMBEDDING_FILES_PREFIX = f'gs://{BUCKET}/bqml/item_embeddings/embeddings-*'
OUTPUT_INDEX_DIR = f'gs://{BUCKET}/bqml/scann_index'

### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [3]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Build the ANN index

Use the `build` method implemented in the [indexer.py](index_builder/builder/indexer.py) module to load the embeddings from the CSV files, create the ANN index model and train it on the embedding data, and save the SavedModel file to Cloud Storage. You pass the following three parameters to this method:

+ `embedding_files_path`, which specifies the Cloud Storage location from which to load the embedding vectors.
+ `num_leaves`, which provides the value for a hyperparameter that tunes the model based on the trade-off between retrieval latency and recall. A higher `num_leaves` value will use more data and provide better recall, but will also increase latency. If `num_leaves` is set to `None` or `0`, the `num_leaves` value is the square root of the number of items.
+ `output_dir`, which specifies the Cloud Storage location to write the ANN index SavedModel file to.

Other configuration options for the model are set based on the [rules-of-thumb](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md#rules-of-thumb) provided by ScaNN.

### Build the index locally

In [4]:
import scann

from index_builder.builder import indexer
indexer.build(EMBEDDING_FILES_PREFIX, OUTPUT_INDEX_DIR)

Indexer started...
1 embedding files are found.
Loading embeddings in file 1 of 1...
2933 embeddings are loaded.
Start building the ScaNN index...
ScaNN index is built.
Saving index as a SavedModel...


2021-09-20 18:35:20.541242: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-09-20 18:35:20.541294: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-20 18:35:20.541317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tf-26): /proc/driver/nvidia/version does not exist
2021-09-20 18:35:20.541676: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-20 18:35:20.5

INFO:tensorflow:Assets written to: gs://rec_bq_jsw/bqml/scann_index/assets
Index is saved to gs://rec_bq_jsw/bqml/scann_index
Saving tokens file...
Item file is saved to gs://rec_bq_jsw/bqml/scann_index/tokens.
Indexer finished.


### Build the index using AI Platform Training

Submit an AI Platform Training job to build the ScaNN index at scale. The [index_builder](index_builder) directory contains the expected [training application packaging structure](https://cloud.google.com/ai-platform/training/docs/packaging-trainer) for submitting the AI Platform Training job.

In [23]:
%%writefile index_builder/Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-6

WORKDIR /

# Copies the trainer code to the docker image.
COPY builder /builder

RUN pip install -r builder/requirements.txt

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "builder.task"]

Overwriting index_builder/Dockerfile


In [24]:
IMAGE_URI = f"gcr.io/{PROJECT_ID}/multiworker:scann-demo"
!docker build -t $IMAGE_URI index_builder
!docker push $IMAGE_URI

Sending build context to Docker daemon  29.18kB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-6
 ---> b91f525ceaed
Step 2/5 : WORKDIR /
 ---> Using cache
 ---> 7027c1ee8522
Step 3/5 : COPY builder /builder
 ---> 62c581f525a8
Step 4/5 : RUN pip install -r builder/requirements.txt
 ---> Running in 78f66c7d2eef
Collecting scann
  Downloading scann-1.2.3-cp37-cp37m-manylinux2014_x86_64.whl (10.6 MB)
Collecting six~=1.15.0
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions, six, scann
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.0
    Uninstalling typing-extensions-3.10.0.0:
      Successfully uninstalled typing-extensions-3.10.0.0
  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16

In [25]:
#this utility function will create the job spec - note this could be adapted to include hyperparameters and other key vertex job config objects

def prepare_worker_pool_specs(
    image_uri,
    args,
    cmd,
    replica_count=1,
    machine_type="n1-standard-4",
    accelerator_count=1,
    accelerator_type="ACCELERATOR_TYPE_UNSPECIFIED",
    reduction_server_count=0,
    reduction_server_machine_type="n1-highcpu-16",
    reduction_server_image_uri=b"us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest",
):

    if accelerator_count > 0:
        machine_spec = {
            "machine_type": machine_type,
            "accelerator_type": accelerator_type,
            "accelerator_count": accelerator_count,
        }
    else:
        machine_spec = {"machine_type": machine_type}

    container_spec = {
        "image_uri": image_uri,
        "args": args,
        "command": cmd,
    }

    chief_spec = {
        "replica_count": 1,
        "machine_spec": machine_spec,
        "container_spec": container_spec,
    }

    worker_pool_specs = [chief_spec]
    if replica_count > 1:
        workers_spec = {
            "replica_count": replica_count - 1,
            "machine_spec": machine_spec,
            "container_spec": container_spec,
        }
        worker_pool_specs.append(workers_spec)
    if reduction_server_count > 1:
        workers_spec = {
            "replica_count": reduction_server_count,
            "machine_spec": {
                "machine_type": reduction_server_machine_type,
            },
            "container_spec": {"image_uri": reduction_server_image_uri},
        }
        worker_pool_specs.append(workers_spec)

    return worker_pool_specs

In [26]:
from google.cloud import aiplatform as vertex_ai

# initialize vertex sdk
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f'{OUTPUT_INDEX_DIR}/jobs/'
)

In [31]:
worker_args

['--embedding-files-path=gs://rec_bq_jsw/bqml/item_embeddings/embeddings-*',
 '--output-dir=gs://rec_bq_jsw/bqml/scann_index',
 '--num-leaves=500']

In [27]:
if tf.io.gfile.exists(OUTPUT_INDEX_DIR):
  print("Removing {} contents...".format(OUTPUT_INDEX_DIR))
  tf.io.gfile.rmtree(OUTPUT_INDEX_DIR)

print("Creating output: {}".format(OUTPUT_INDEX_DIR))
tf.io.gfile.makedirs(OUTPUT_INDEX_DIR)

timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks_bqml_build_scann_index_{timestamp}'


worker_args = [f'--embedding-files-path={EMBEDDING_FILES_PREFIX}',
               f'--output-dir={OUTPUT_INDEX_DIR}',
               f'--num-leaves=500'
            ]
WORKER_CMD = ["python", "builder/task.py"]
WORKER_ARGS = worker_args
REPLICA_COUNT = 1
WORKER_MACHINE_TYPE = "n1-standard-16"
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
PER_MACHINE_ACCELERATOR_COUNT = 1

worker_pool_specs = prepare_worker_pool_specs(
    image_uri=IMAGE_URI,
    args=WORKER_ARGS,
    cmd=WORKER_CMD,
    replica_count=REPLICA_COUNT,
    machine_type=WORKER_MACHINE_TYPE,
    accelerator_count=PER_MACHINE_ACCELERATOR_COUNT,
    accelerator_type=ACCELERATOR_TYPE,
)

timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks_bqml_build_scann_index_{timestamp}'

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
staging_bucket=OUTPUT_INDEX_DIR)


job.run(sync=False)

Removing gs://rec_bq_jsw/bqml/scann_index contents...
Creating output: gs://rec_bq_jsw/bqml/scann_index
INFO:google.cloud.aiplatform.jobs:Creating CustomJob
INFO:google.cloud.aiplatform.jobs:CustomJob created. Resource name: projects/733956866731/locations/us-central1/customJobs/8635100330645782528
INFO:google.cloud.aiplatform.jobs:To use this CustomJob in another session:
INFO:google.cloud.aiplatform.jobs:custom_job = aiplatform.CustomJob.get('projects/733956866731/locations/us-central1/customJobs/8635100330645782528')
INFO:google.cloud.aiplatform.jobs:View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/8635100330645782528?project=733956866731
INFO:google.cloud.aiplatform.jobs:CustomJob projects/733956866731/locations/us-central1/customJobs/8635100330645782528 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:CustomJob projects/733956866731/locations/us-central1/customJobs/8635100330645782528 current state:
JobState.JO

After the AI Platform Training job finishes, check that the `scann_index` folder has been created in your Cloud Storage bucket:

In [28]:
!gsutil ls $OUTPUT_INDEX_DIR

gs://rec_bq_jsw/bqml/scann_index/
gs://rec_bq_jsw/bqml/scann_index/saved_model.pb
gs://rec_bq_jsw/bqml/scann_index/tokens
gs://rec_bq_jsw/bqml/scann_index/assets/
gs://rec_bq_jsw/bqml/scann_index/variables/


## Test the ANN index

Test the ANN index by using the `ScaNNMatcher` class implemented in the [index_server/matching.py](index_server/matching.py) module.

Run the following code snippets to create an item embedding from random generated values and pass it to `scann_matcher`, which returns the items IDs for the five items that are the approximate nearest neighbors of the embedding you submitted.

In [29]:
from index_server.matching import ScaNNMatcher
scann_matcher = ScaNNMatcher(OUTPUT_INDEX_DIR)

Loading ScaNN index...


2021-09-20 19:30:22.895189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


ScaNN index is loadded.


In [30]:
vector = np.random.rand(50)
scann_matcher.match(vector, 5)

['10719', '26718', '7534', '5701', '26608']

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**