# Feed the Sleuth: find 2nd/3rd/... nearest neighbor in scale  

__with Approximate Nearest Neighbor in Anisotropic Vector Quantization, Shallow Tree + Asymmetric Hashing, and Sentence Transformer__

## 0. Imperative: binge watch Tom and Jerry, but with different plots 

We are binge watching Tom and Jerry, and want to continue with similar-but-not-the-exactly-the-same episodes. 

For example, in one episode, Tom is dreaming about catching Jerry in the sleep, whereas in another episode, Tom is really chasing down Jerry (and not in the sleep-walking sense). 

And back to real-life situations ... 

In the case of __renter's insurance__, how can we tell __contract 1 lightly touches on plumbing, whereas contract 2 includes substantial clauses__?

__More generally, can we feed months/years of software/building/legal contacts, to discover (subtle) changes, i.e. COVID-19 could break up the contracts?__

In other words, can we __feed colossal amount of documents without domain knowledge/supervised learning, so as to surface and drive the adoption for Document/Label Sleuth/other AI/ML/NLP tools__?


### 0.1. Solution: find close-but-not-the-exactly-same clues, in scale

One potential way is to be able to 
1. Measure the distance of vector/embeddings, i.e., think of Cosine Similiarities in extremely high yet sparse representations
2. Find the __2nd/3rd/4th.. closest distance, but not the closest-distanced nor the precisely-matched one__ 
3. And __do so in-scale__


### 0.2. Approximate Nearest Neighbor (ANN) with Anisotropic Vector Quantization, Shallow Tree + Asymmetric Hashing

Google's Approximate Nearest Neighbor (ANN) Index is a high scale, low latency solution, to find similar vectors (or more specifically "embeddings") for a large corpus. 

The highlight is "Anisotropic Vector Quantization" [Approximate Nearest Neighbor (ANN) technology](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html). 

### 0.3. Vector/embedding with Setence Transfomer/sentence-t5-base 

To build the vector/embedding, we are using Setence Transfomer/sentence-t5-base from Hugging Face
https://huggingface.co/blog/sentence-transformers-in-the-hub


### 0.4. Distance measure of 'The cat likes to sleep in the sun' to 9 other sentences

Let's say that, we have summarized a Tom and Jerry episode into the one-liner of 

'The cat likes to sleep in the sun'.

Meanwhile, we have summarized 9 other random TV-shows into 9 one-liners as follows:
1. I spent the day at the medical facility.
2. That a cultured medical genius found her inspiring was beyond flattering.
3. She drew nearer, eyes sweeping over the medical equipment in the room.
4. I did not ask the American Medical Association their opinion of this arrangement.
5. I think the cat wants dessert!
6. Im in no mood to watch a cat fight tonight.
7. The cat would like to eat the mouse.
8. A large grey cat was asleep on a rocking chair.
9. The pilot was able to land the airplane

### 0.5. Result 

You will see in the following that for the sentence of
'The cat likes to sleep in the sun'.

The closest sentences would be:
1. A large grey cat was asleep on a rocking chair.
2. I think the cat wants dessert!
3. The cat would like to eat the mouse.
4. Im in no mood to watch a cat fight tonight.
5. The pilot was able to land the airplane

Pretty good, right?

### 0.6. Continue watching episodes 2 and 3 

Based on above, we will toss away show 1 because it is roughly the same, and show 4/5 since they are not our interests, leaving with episodes 2/3:

2. I think the cat wants dessert!
3. The cat would like to eat the mouse.

Now we can continue binge watching!


### 0.7 Specific steps
1. Prep environment/var
2. Create vector/embedding with sentence-transformers
3. Create ANN Index and Brute Force Index
4. Create an IndexEndpoint with VPC Network
5. Deploy ANN Index
6. Perform online query

### 0.8 To-do

Test cases, algorithms (sorting), document/contracts  

## 1. Prepare environment and variables

### 1.1 Get GCP's project ID

In [1]:
import os

In [2]:
shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]
print("Project ID: ", PROJECT_ID)

Project ID:  me-ann1-370514


In [3]:
PROJECT_NUMBER = !gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
#print("PROJECT_NUMBER: {}".format(PROJECT_NUMBER))
PROJECT_NUMBER = PROJECT_NUMBER[0]
print("PROJECT_NUMBER: {}".format(PROJECT_NUMBER))

PROJECT_NUMBER: 25661841074


### 1.2 Create a Cloud Storage bucket

Set Random ID (optional), to avoid name collisions.

In [4]:
import random
import string

RANDOM_ID = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))

In [5]:
REGION = "us-central1"
BUCKET_URI = "gs://" + PROJECT_ID + "-aip-" + RANDOM_ID

print(BUCKET_URI, REGION)

gs://me-ann1-370514-aip-og0kjvbz us-central1


Create the Cloud Storage bucket

In [6]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Creating gs://me-ann1-370514-aip-og0kjvbz/...


Check the bucket 

In [7]:
! gsutil ls -al $BUCKET_URI

## 2. Create vector/embedding with sentence-transformers

### 2.1 Sentence Transformers and sentence-t5-base 

In [8]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting torch>=1.6.0
  Downloading torch-1.13.0-cp37-cp37m-manylinux1_x86_64.whl (890.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.2/890.2 MB[0m [31m635.8 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchvision
  Downloading torchvision-0.14.0-cp37-cp37m-manylinux1_x86_64.whl (24.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.3/24.3 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nltk
  

In [9]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-t5-base')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/74.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/198 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/461 [00:00<?, ?B/s]

### 2.2 Create sentences, embeddings, then save into JSON for upload to GCS

In [10]:
sentences = ['I spent the day at the medical facility.', #0
            'That a cultured medical genius found her inspiring was beyond flattering.', #1
            'She drew nearer, eyes sweeping over the medical equipment in the room.', #2
            'I did not ask the American Medical Association their opinion of this arrangement.', #3
            'I think the cat wants dessert!', #4
            'Im in no mood to watch a cat fight tonight.', #5
            'The cat would like to eat the mouse.', #6
            'A large grey cat was asleep on a rocking chair.', # 7
            'The pilot was able to land the airplane'] #8

In [11]:
embedding = model.encode(sentences) 

In [12]:
type(embedding)

numpy.ndarray

In [13]:
embedding.shape

(9, 768)

In [15]:
output_file = "init_data.json"

In [16]:
with open(output_file, "w") as f:
    for i in range(len(sentences)):
        f.write('{"id":"' + str(i) + '",')
        f.write('"embedding":[' + ",".join(str(x) for x in embedding[i]) + "]}")
        f.write("\n")

### 2.3 Upload JSON data to GCS

In [17]:
EMBEDDINGS_INITIAL_URI = f"{BUCKET_URI}/matching_engine/initial/"
EMBEDDINGS_INITIAL_URI

'gs://me-ann1-370514-aip-og0kjvbz/matching_engine/initial/'

In [18]:
! gsutil cp init_data.json {EMBEDDINGS_INITIAL_URI}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Copying file://init_data.json [Content-Type=application/json]...
/ [1 files][ 83.6 KiB/ 83.6 KiB]                                                
Operation completed over 1 objects/83.6 KiB.                                     


## 3. Create Indexes

### 3.1  Define constants for Vertex AI 

In [19]:
import os
import sys

from google.cloud import aiplatform

In [20]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

In [21]:
DIMENSIONS = 768
DISPLAY_NAME = "tree_ah_st5"
ANN_COUNT=50

### 3.2 Create ANN index with configurations (Shallow tree + Asymmetric Hashing)

In [22]:
tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_INITIAL_URI,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=ANN_COUNT,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="Sentence-t5-base ANN index",
    labels={"label_name": "label_value"},
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/25661841074/locations/us-central1/indexes/1029503523412246528/operations/1538633825762934784
MatchingEngineIndex created. Resource name: projects/25661841074/locations/us-central1/indexes/1029503523412246528
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/25661841074/locations/us-central1/indexes/1029503523412246528')


### 3.3 Get the resource name, to retrieve it later existing MatchingEngineIndex.

In [23]:
INDEX_RESOURCE_NAME = tree_ah_index.resource_name
INDEX_RESOURCE_NAME

'projects/25661841074/locations/us-central1/indexes/1029503523412246528'

In [24]:
# tree_ah_index = aiplatform.MatchingEngineIndex(index_name=INDEX_RESOURCE_NAME)

## 4. Create an IndexEndpoint (with VPC Network)

In [25]:
VPC_NETWORK = "vpc1"
VPC_NETWORK_FULL = "projects/{}/global/networks/{}".format(PROJECT_NUMBER, VPC_NETWORK)
VPC_NETWORK_FULL

'projects/25661841074/global/networks/vpc1'

In [26]:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="ANN_index_endpoint",
    description="Sentence-t5-base ANN IndexEndpoint",
    network=VPC_NETWORK_FULL,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000/operations/3099131096646811648
MatchingEngineIndexEndpoint created. Resource name: projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000')


Get the resource name, to retrieve it later with MatchingEngineIndexEndpoint

In [27]:
INDEX_ENDPOINT_NAME = index_endpoint.resource_name
INDEX_ENDPOINT_NAME

'projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000'

## 5. Deploy ANN Index

In [28]:
DEPLOYED_INDEX_ID = f"ANN_ST5_deployed"

In [29]:
index_endpoint = index_endpoint.deploy_index(
    index=tree_ah_index, deployed_index_id=DEPLOYED_INDEX_ID,
    min_replica_count=1, max_replica_count=1
)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000/operations/1275173247561760768
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/25661841074/locations/us-central1/indexEndpoints/6162762673684480000


In [30]:
index_endpoint.deployed_indexes

[id: "ANN_ST5_deployed"
index: "projects/25661841074/locations/us-central1/indexes/1029503523412246528"
create_time {
  seconds: 1670085001
  nanos: 825895000
}
private_endpoints {
  match_grpc_address: "10.63.96.5"
}
index_sync_time {
  seconds: 1670085884
  nanos: 985000
}
automatic_resources {
  min_replica_count: 1
  max_replica_count: 1
}
deployment_group: "default"
]

## 6. Create Online Queries

Query against deployed index through online querying gRPC API (Match service) within virtual machine instances from the same region.  

In [31]:
# The number of nearest neighbors to be retrieved from database for each query.
NUM_NEIGHBOURS = 15

In [32]:
sentence = ['The cat likes to sleep in the sun']
QUERY = model.encode(sentence)
type(QUERY)

numpy.ndarray

In [33]:
QUERY.shape

(1, 768)

In [34]:
response = index_endpoint.match(
    deployed_index_id=DEPLOYED_INDEX_ID, queries=QUERY, num_neighbors=NUM_NEIGHBOURS
)

In [35]:
response

[[MatchNeighbor(id='7', distance=0.8463100790977478),
  MatchNeighbor(id='4', distance=0.8144865036010742),
  MatchNeighbor(id='6', distance=0.7567464709281921),
  MatchNeighbor(id='5', distance=0.6951133608818054),
  MatchNeighbor(id='8', distance=0.6735643148422241),
  MatchNeighbor(id='0', distance=0.6379110217094421),
  MatchNeighbor(id='1', distance=0.607258677482605),
  MatchNeighbor(id='2', distance=0.5947748422622681),
  MatchNeighbor(id='3', distance=0.5791853666305542)]]