In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Text Embeddings + Vertex AI Vector Search



## Introduction

In this tutorial, you learn how to use Google Cloud AI tools to quickly bring the power of Large Language Models to enterprise systems.  

This tutorial covers the following -

*   What are embeddings - what business challenges do they help solve ?
*   Understanding Text with Vertex AI Text Embeddings
*   Find Embeddings fast with Vertex AI Vector Search
*   Grounding LLM outputs with Vector Search

This tutorial is based on [the blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings), combined with sample code.


### Prerequisites

This tutorial is designed for developers who has basic knowledge and experience with Python programming and machine learning.

If you are not reading this tutorial in Qwiklab, then you need to have a Google Cloud project that is linked to a billing account to run this. Please go through [this document](https://cloud.google.com/vertex-ai/docs/start/cloud-environment) to create a project and setup a billing account for it.




# Settings required outside this notebook


1. Enable APIs
  - Vertex AI API
  - BigQuery API
2. Enable Private Service Access (PSA) in the project
3. Provision the required IAM permissions to the Default Compute Enginer service agent account format: "{project-number}-compute@developer.gserviceaccount.com"

  Roles required:
    - roles/aiplatform.user
    - roles/bigquery.user
    - roles/storage.admin


# Text Embeddings in Action

Lets try using Text Embeddings in action with actual sample code.

## Setup

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication (Colab only)
* Enable APIs
* Set IAM permissions

### Install Python SDK

Vertex AI, Cloud Storage and BigQuery APIs can be accessed with multiple ways including REST API and Python SDK. In this tutorial we will use the SDK.

In [64]:
!pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage google-cloud-bigquery

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [65]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>



### Authenticating your notebook environment
If you are using Vertex AI Colab Enterprise, you will not require additional authentication.

For more information, you can check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

To authenticate on Jupter lab running on local Mac run:
```
!gcloud auth application-default login
```

### Environment variables

Sets environment variables. If asked, please replace the following `[your-project-id]` with your project ID and run it.

In [2]:
# define project information

PROJECT_ID = "derma-acs-fe-sand-1pf8"  # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}



## Getting Started with Vertex AI Embeddings for Text

Now it's ready to get started with embeddings!

### Data Preparation

We will be using [the Stack Overflow public dataset](https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow) hosted on BigQuery table `bigquery-public-data.stackoverflow.posts_questions`. This is a very big dataset with 23 million rows that doesn't fit into the memory. We are going to limit it to 5000 rows for this tutorial.

In [3]:
# load the BQ Table into a Pandas Dataframe
import pandas as pd
from google.cloud import bigquery

QUESTIONS_SIZE = 5000

bq_client = bigquery.Client(project=PROJECT_ID)
QUERY_TEMPLATE = """
        SELECT distinct q.id, q.title
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions`
        where Score > 0 ORDER BY View_Count desc) AS q
        LIMIT {limit} ;
        """
query = QUERY_TEMPLATE.format(limit=QUESTIONS_SIZE)
query_job = bq_client.query(query)
rows = query_job.result()
df = rows.to_dataframe()

# examine the data
df.head()

Unnamed: 0,id,title
0,73210586,Get list of all compartments in OCI Tenancy
1,73229845,Is there a way to use a skyve validator (e.g. ...
2,73501178,How can I prevent prettier from changing singl...
3,73301407,how to add policy file for the following in jaas
4,73214807,"In skyve, how do I do a OR filter in a list grid?"


### Call the API to generate embeddings

With the Stack Overflow dataset, we will use the `title` column (the question title) and generate embedding for it with Embeddings for Text API. The API is available under the [vertexai](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai) package of the SDK.

You may see some warning messages from the TensorFlow library but you can ignore them.

In [4]:
# init the vertexai package
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

From the package, import [TextEmbeddingModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel) and get a model.

In [5]:
# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

In this tutorial we will use `textembedding-gecko@001` model for getting text embeddings. Please take a look at [Supported models](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#supported_models) on the doc to see the list of supported models.

Once you get the model, you can call its [get_embeddings](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel#vertexai_language_models_TextEmbeddingModel_get_embeddings) function to get embeddings. You can pass up to 5 texts at once in a call. But there is a caveat. By default, the text embeddings API has a "request per minute" quota set to 60 for new Cloud projects and 600 for projects with usage history (see [Quotas and limits](https://cloud.google.com/vertex-ai/docs/quotas#request_quotas) to check the latest quota value for `base_model:textembedding-gecko`). So, rather than using the function directly, you may want to define a wrapper like below to limit under 10 calls per second, and pass 5 texts each time.

In [6]:
import time
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 5

def get_embeddings_wrapper(texts, max_retries=5):
    embs = []
    retry_delay = 1  # Initial delay in seconds
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        for attempt in range(max_retries):
            try:
                result = model.get_embeddings(texts[i : i + BATCH_SIZE])
                embs = embs + [e.values for e in result]
                break
            except Exception:
                time.sleep(retry_delay)
                retry_delay *= 2  # Double the delay for the next attempt
                retry_delay += random.uniform(0, 1)  # Add jitter
                break
            raise Exception("Maximum retry attempts reached")
    return embs

The following code will get embedding for the question titles and add them as a new column `embedding` to the DataFrame. This will take a few minutes.

In [7]:
# get embeddings for the question titles and add them as "embedding" column
df = df.assign(embedding=get_embeddings_wrapper(list(df.title)))
df.head()

100%|██████████| 2000/2000 [02:32<00:00, 13.11it/s]


Unnamed: 0,id,title,embedding
0,73210586,Get list of all compartments in OCI Tenancy,"[0.013951302506029606, -0.014743354171514511, ..."
1,73229845,Is there a way to use a skyve validator (e.g. ...,"[-0.021185141056776047, -0.05046217516064644, ..."
2,73501178,How can I prevent prettier from changing singl...,"[0.005010632798075676, -0.06683017313480377, -..."
3,73301407,how to add policy file for the following in jaas,"[0.040733642876148224, -0.02197178639471531, -..."
4,73214807,"In skyve, how do I do a OR filter in a list grid?","[0.029326317831873894, -0.01743663102388382, -..."


## Look at the embedding similarities

Let's see how these embeddings are organized in the embedding space with their meanings by quickly calculating the similarities between them and sorting them.

As embeddings are vectors, you can calculate similarity between two embeddings by using one of the popular metrics like the followings:

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/8.png)

Which metric should we use? Usually it depends on how each model is trained. In case of the model `textembedding-gecko`, we need to use inner product (dot product).

In the following code, it picks up one question randomly and uses the numpy `np.dot` function to calculate the similarities between the question and other questions.

In [8]:
import numpy as np

# pick the first key/question on the dataframe
key = 0
print(f"Key question: {df.title[key]}\n")

# calc dot product between the key and other questions
embs = np.array(df.embedding.to_list())
similarities = np.dot(embs[key], embs.T)

# print similarities for the first 5 questions
similarities[:5]

Key question: Get list of all compartments in OCI Tenancy



array([0.99999804, 0.50980233, 0.47981528, 0.55852779, 0.612905  ])

Finally, sort the questions with the similarities and print the list.

In [9]:
# print the question
print(f"Key question: {df.title[key]}\n")

# sort and print the questions by similarities
sorted_questions = sorted(
    zip(df.title, similarities), key=lambda x: x[1], reverse=True
)[:20]
for i, (question, similarity) in enumerate(sorted_questions):
    print(f"{similarity:.4f} {question}")

Key question: Get list of all compartments in OCI Tenancy

1.0000 Get list of all compartments in OCI Tenancy
0.7046 How can I check the users permissions for a specific OU?
0.7028 Get all container registries under a subscription - .NET Azure sdk
0.6926 Finding Menu ids of existing Modules in Odoo?
0.6816 Get active and total bookings (from 1 table) for every user in users table
0.6763 Get Entities through common relation
0.6754 Get items in a multidimensional array-like object
0.6733 Want to create possible multi hierarchy sub category with its parent sub category string array?
0.6724 c# get records which exist in each group ASP.NET MVC Entity Framework
0.6719 Getting a section of an array in OpenCL
0.6712 Aggregating values in a list by multiple group by and calculating percentage of distribution in java 8
0.6709 Select all rows for the first N distinct child table rows
0.6705 generate all possible combinations with multiple colums in pan das
0.6702 Odoo 13: Contacts: Filter/Search 

# Find embeddings fast with Vertex AI Vector Search

As we have explained above, you can find similar embeddings by calculating the distance or similarity between the embeddings.

But this isn't easy when you have millions or billions of embeddings. For example, if you have 1 million embeddings with 768 dimensions, you need to repeat the distance calculations for 1 million x 768 times. This would take some seconds - too slow.

So the researchers have been studying a technique called [Approximate Nearest Neighbor (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) for faster search. ANN uses "vector quantization" for separating the space into multiple spaces with a tree structure. This is similar to the index in relational databases for improving the query performance, enabling very fast and scalable search with billions of embeddings.

With the rise of LLMs, the ANN is getting popular quite rapidly, known as the Vector Search technology.

![](https://storage.googleapis.com/gweb-cloudblog-publish/images/7._ANN.1143068821171228.max-2200x2200.png)

In 2020, Google Research published a new ANN algorithm called [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html). It is considered one of the best ANN algorithms in the industry, also the most important foundation for search and recommendation in major Google services such as Google Search, YouTube and many others.


## What is Vertex AI Vector Search?

Google Cloud developers can take the full advantage of Google's vector search technology with [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) (previously called Matching Engine). With this fully managed service, developers can just add the embeddings to its index and issue a search query with a key embedding for the blazingly fast vector search. In the case of the Stack Overflow demo, Vector Search can find relevant questions from 8 million embeddings in tens of milliseconds.

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/9.png)

With Vector Search, you don't need to spend much time and money building your own vector search service from scratch or using open source tools if your goal is high scalability, availability and maintainability for production systems.

## Get Started with Vector Search

When you already have the embeddings, then getting started with Vector Search is pretty easy. In this section, we will follow the steps below.

### Setting up Vector Search
- Save the embeddings in JSON files on Cloud Storage
- Build an Index
- Create an Index Endpoint
- Deploy the Index to the endpoint

### Use Vector Search

- Query with the endpoint

### **Tip for Colab users**

If you use Colab for this tutorial, you may lose your runtime while you are waiting for the Index building and deployment in the later sections as it takes tens of minutes. In that case, run the following sections again with the new instance to recover the runtime: [Install Python SDK, Environment variables and Authentication](https://colab.research.google.com/drive/1xJhLFEyPqW0qvKiERD6aYgeTHa6_U50N?resourcekey=0-2qUkxckCjt6W03AsqvZHhw#scrollTo=AtXnXhF8U-8R&line=9&uniqifier=1).

Then, use the [Utilities](https://colab.research.google.com/drive/1xJhLFEyPqW0qvKiERD6aYgeTHa6_U50N?resourcekey=0-2qUkxckCjt6W03AsqvZHhw#scrollTo=BE1tELsH-u8N&line=1&uniqifier=1) to recover the Index and Index Endpoint and continute with the rest.

### Save the embeddings in a JSON file
To load the embeddings to Vector Search, we need to save them in JSON files with JSONL format. See more information in the docs at [Input data format and structure](https://cloud.google.com/vertex-ai/docs/matching-engine/match-eng-setup/format-structure#data-file-formats).

First, export the `id` and `embedding` columns from the DataFrame in JSONL format, and save it.

In [10]:
# save id and embedding as a json file
jsonl_string = df[["id", "embedding"]].to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
    f.write(jsonl_string)

# show the first few lines of the json file
! head -n 3 questions.json

{"id":73210586,"embedding":[0.0139513025,-0.0147433542,-0.0261959955,0.0285781734,0.0516352989,0.0048935171,0.024650516,-0.033777304,-0.0160438418,0.0286903698,-0.0046704351,0.0086344955,-0.0072673107,-0.0226843525,-0.006174942,-0.0352224074,0.0263284482,-0.0011634964,-0.059246175,0.0137064122,-0.0134991873,0.0025238204,-0.0377190746,-0.0218907464,-0.003821658,-0.0063919346,-0.0157446396,-0.087614581,-0.0289135128,0.0285816994,-0.0468162522,0.0410229005,-0.0902959555,0.0077241398,-0.0240082685,-0.0206805393,-0.0034184416,-0.0418329798,-0.0261355247,0.0611698329,0.0227663163,-0.0229563583,-0.0361920446,-0.0041173152,0.0101775154,-0.055483494,-0.0137018105,0.023108907,-0.0059295809,-0.0269603673,-0.003033113,-0.0019886068,0.0335769951,-0.0213403758,0.0083882678,-0.0227813851,0.0024620781,-0.0047742445,-0.0053480854,0.0087519363,0.0119840298,0.017712988,0.0115757035,0.0762576535,0.0171728283,-0.0869579464,-0.0360775627,-0.0181585867,0.040822193,0.0339637809,0.0189821385,-0.0370352305,0.09

Then, create a new Cloud Storage bucket and copy the file to it.

In [12]:
BUCKET_URI = f"gs://{PROJECT_ID}-rag-session"
! gsutil mb -l $LOCATION -p {PROJECT_ID} {BUCKET_URI}
! gsutil cp questions.json {BUCKET_URI}

Creating gs://derma-acs-fe-sand-1pf8-rag-session/...
Copying file://questions.json [Content-Type=application/json]...
\ [1 files][ 98.4 MiB/ 98.4 MiB]                                                
Operation completed over 1 objects/98.4 MiB.                                     


### Create an Index

Now it's ready to load the embeddings to Vector Search. Its APIs are available under the [aiplatform](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform) package of the SDK.

In [13]:
# init the aiplatform package
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

Create an [MatchingEngineIndex](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex) with its `create_tree_ah_index` function (Matching Engine is the previous name of Vector Search).

In [18]:
# create index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"rag_demo1_stackoverflow_index",
    contents_delta_uri=BUCKET_URI,
    dimensions=768,
    approximate_neighbors_count=20,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Creating MatchingEngineIndex
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Create MatchingEngineIndex backing LRO: projects/842907197256/locations/us-central1/indexes/7851027593762439168/operations/2121765280752336896
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex created. Resource name: projects/842907197256/locations/us-central1/indexes/7851027593762439168
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:To use this MatchingEngineIndex in another session:
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:index = aiplatform.MatchingEngineIndex('projects/842907197256/locations/us-central1/indexes/7851027593762439168')


By calling the `create_tree_ah_index` function, it starts building an Index. This will take under a few minutes if the dataset is small, otherwise about 50 minutes or more depending on the size of the dataset. You can check status of the index creation on [the Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).

![](https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/creating-index.png)

#### The parameters for creating index

- `contents_delta_uri`: The URI of Cloud Storage directory where you stored the embedding JSON files
- `dimensions`: Dimension size of each embedding. In this case, it is 768 as we are using the embeddings from the Text Embeddings API.
- `approximate_neighbors_count`: how many similar items we want to retrieve in typical cases
- `distance_measure_type`: what metrics to measure distance/similarity between embeddings. In this case it's `DOT_PRODUCT_DISTANCE`

See [the document](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) for more details on creating Index and the parameters.

#### Batch Update or Streaming Update?
There are two types of index: Index for *Batch Update* (used in this tutorial) and Index for *Streaming Updates*. The Batch Update index can be updated with a batch process whereas the Streaming Update index can be updated in real-time. The latter one is more suited for use cases where you want to add or update each embeddings in the index more often, and crucial to serve with the latest embeddings, such as e-commerce product search.



### Create Index Endpoint and deploy the Index

To use the Index, you need to create an [Index Endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public). It works as a server instance accepting query requests for your Index.

In [21]:
DEPLOYED_INDEX_ID = f"rag_demo1_stackoverflow_index_deployed"

In [20]:
# This steps takes about 25 minutes

# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"gcp_rag_session_endpoint",
    public_endpoint_enabled=True)

# deploy the Index to the Index Endpoint
my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Creating MatchingEngineIndexEndpoint
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Create MatchingEngineIndexEndpoint backing LRO: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592/operations/7168048663220977664
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint created. Resource name: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:To use this MatchingEngineIndexEndpoint in another session:
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592')


With the Index Endpoint, deploy the Index by specifying an unique deployed index ID.

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592/operations/8707153835874844672
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592


<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7f4a2dcf5900> 
resource name: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592

If it is the first time to deploy an Index to an Index Endpoint, it will take around 25 minutes to automatically build and initiate the backend for it. After the first deployment, it will finish in seconds. To see the status of the index deployment, open [the Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints) and click the Index Endpoint.

<img src="https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/deploying-index.png" width="70%">

### Run Query

Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

In [23]:
def run_query(query : str):
  # Get embedding
  test_embeddings = get_embeddings_wrapper([query])
  print()

  # Get closest vectors
  response = my_index_endpoint.find_neighbors(
      deployed_index_id=DEPLOYED_INDEX_ID,
      queries=test_embeddings,
      num_neighbors=20,
  )

  # show the result
  for idx, neighbor in enumerate(response[0]):
      id = np.int64(neighbor.id)
      similar = df.query("id == @id", engine="python")
      print(f"{neighbor.distance:.4f} {similar.title.values[0]}")

In [24]:
run_query("How to refresh cookies in JS?")

100%|██████████| 1/1 [00:00<00:00,  7.46it/s]



0.8056 Refresh data received from an API every minute React, Javascript
0.7975 How can I re-render to display changes without refreshing the page?
0.7971 When I make an AJAX call the page refreshes - why?
0.7937 How to send and receive cookies by react js to a SAP server?
0.7919 Thymeleaf - How to interact and reload Javascript?
0.7842 How do I pass cookies from a PHP proxy server to a webpage?
0.7841 How to refresh MainActivity?
0.7749 Javascript test : Selenium cookies data url
0.7725 Prevent Token Refresh Request From Being Fired Multiple Times in Angular
0.7705 How to execute javascript code after htmx makes an ajax request?
0.7666 How to store JWT token in cookie React fetch
0.7657 How can I reset browser's navigation history in angular2?
0.7650 Where to read Forms authentication cookie?
0.7648 How to update a CSS Grid without re-rendering a specific component in ReactJS?
0.7638 JWT Token Refresh Responsibility - Best practice
0.7607 How can I keep the check-boxes checked after r

In [61]:
# Get embedding
run_query("what is the meaning of life?")

100%|██████████| 1/1 [00:00<00:00, 10.11it/s]



0.6307 Summarize in a line
0.6120 How to get the se semantic meaning of a word/phrase
0.5965 Please missing data
0.5885 Whats the reason for segmentation fault
0.5871 Translation and fixed number of letters words
0.5827 How to make a list of words?
0.5774 How to organize based on specific data values
0.5771 What does this C value mean?
0.5768 I'm having trouble understanding the syntax used in a piece of code
0.5764 Scheduling & Routing Optimization
0.5751 Finding maximum
0.5737 It is possible to use html <video> tag to show a Youtube video
0.5728 How to perform sequence classification with a neural network?
0.5713 Why is the result an empty array?
0.5696 Questions example phoneNumber
0.5695 Output produces random(?) numbers
0.5689 Understanding Docker image
0.5687 What's the point of a default constructor in OOP?
0.5681 Text recommendation based on keywords
0.5669 How to design meaningful objects and their relationships in simple physic simulations


The `find_neighbors` function only takes milliseconds to fetch the similar items even when you have billions of items on the Index, thanks to the ScaNN algorithm. Vector Search also supports [autoscaling](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public#autoscaling) which can automatically resize the number of nodes based on the demands of your workloads.

# IMPORTANT: Cleaning Up

In case you are using your own Cloud project, not a temporary project on Qwiklab, please make sure to delete all the Indexes, Index Endpoints and Cloud Storage buckets after finishing this tutorial. Otherwise the remaining objects would **incur unexpected costs**.

If you used Workbench, you may also need to delete the Notebooks from [the console](https://console.cloud.google.com/vertex-ai/workbench).

In [None]:
# wait for a confirmation
input("Press Enter to delete Index Endpoint, Index and Cloud Storage bucket:")

# delete Index Endpoint
my_index_endpoint.undeploy_all()
my_index_endpoint.delete(force=True)

# delete Index
my_index.delete()

# delete Cloud Storage bucket
! gsutil rm -r {BUCKET_URI}

# Summary

## Grounding LLM outputs with Vertex AI Vector Search

As we have seen, by combining the Embeddings API and Vector Search, you can use the embeddings to "ground" LLM outputs to real business data with low latency.

For example, if an user asks a question, Embeddings API can convert it to an embedding, and issue an query on Vector Search to find similar embeddings in its index. Those embeddings represent the actual business data in the databases. As we are just retrieving the business data and not generating any artificial texts, there is no risk of having hallucinations in the result.

![](https://storage.googleapis.com/gweb-cloudblog-publish/original_images/10._grounding.png)

### The difference between the questions and answers

In this tutorial, we have used the Stack Overflow dataset. There is a reason why we had to use it; As the dataset has many pairs of **questions and answers**, so you can just find questions similar to your question to find answers to it.

In many business use cases, the semantics (meaning) of questions and answers are different. Also, there could be cases where you would want to add variety of recommended or personalized items to the results, like product search on e-commerce sites.

In these cases, the simple semantics search don't work well. It's more like a recommendation system problem where you may want to train a model (e.g. Two-Tower model) to learn the relationship between the question embedding space and answer embedding space. Also, many production systems adds reranking phase after the semantic search to achieve higher search quality. Please see [Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture) to learn more.

### Hybrid of semantic + keyword search

Another typical challenge you will face in production system is to support keyword search combined with the semantic search. For example, for e-commerce product search, you may want to let users find product by entering its product name or model number. As LLM doesn't memorize those product names or model numbers, semantic search can't handle those "usual" search functionalities.

[Vertex AI Search](https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-search-and-conversation-is-now-generally-available) is another product you may consider for those requirements. While Vector Search provides a simple semantic search capability only, Search provides a integrated search solution that combines semantic search, keyword search, reranking and filtering, available as an out-of-the-box tool.

### What about Retrieval Augmented Generation (RAG)?

In this tutorial, we have looked at the simple combination of LLM embeddings and vector search. From this starting point, you may also extend the design to [Retrieval Augmented Generation (RAG)](https://www.google.com/search?q=Retrieval+Augmented+Generation+(RAG)&oq=Retrieval+Augmented+Generation+(RAG)).

RAG is a popular architecture pattern of implementing grounding with LLM with text chat UI. The idea is to have the LLM text chat UI as a frontend for the document retrieval with vector search and summarization of the result.

![](https://storage.googleapis.com/gweb-cloudblog-publish/images/Figure-7-Ask_Your_Documents_Flow.max-529x434.png)

There are some pros and cons between the two solutions.

| | Emb + vector search | RAG |
|---|---|---|
| Design | simple | complex |
| UI | Text search UI | Text chat UI |
| Summarization of result | No | Yes |
| Multi-turn (Context aware) | No | Yes |
| Latency | millisecs | seconds |
| Cost | lower | higher |
| Hallucinations | No risk | Some risk |

The Embedding + vector search pattern we have looked at with this tutorial provides simple, fast and low cost semantic search functionality with the LLM intelligence. RAG adds context-aware text chat experience and result summarization to it. While RAG provides the more "Gen AI-ish" experience, it also adds a risk of hallucination and higher cost and time for the text generation.

To learn more about how to build a RAG solution, you may look at [Building Generative AI applications made easy with Vertex AI PaLM API and LangChain](https://cloud.google.com/blog/products/ai-machine-learning/generative-ai-applications-with-vertex-ai-palm-2-models-and-langchain).

## Resources

To learn more, please check out the following resources:

### Documentations

[Vertex AI Embeddings for Text API documentation
](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)

[Vector Search documentation](https://cloud.google.com/vertex-ai/docs/matching-engine/overview)

### Vector Search blog posts

[Vertex Matching Engine: Blazing fast and massively scalable nearest neighbor search](https://cloud.google.com/blog/products/ai-machine-learning/vertex-matching-engine-blazing-fast-and-massively-scalable-nearest-neighbor-search)

[Find anything blazingly fast with Google's vector search technology](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology)

[Enabling real-time AI with Streaming Ingestion in Vertex AI](https://cloud.google.com/blog/products/ai-machine-learning/real-time-ai-with-google-cloud-vertex-ai)

[Mercari leverages Google's vector search technology to create a new marketplace](https://cloud.google.com/blog/topics/developers-practitioners/mercari-leverages-googles-vector-search-technology-create-new-marketplace)

[Recommending news articles using Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/recommending-articles-using-vertex-ai-matching-engine)

[What is Multimodal Search: "LLMs with vision" change businesses](https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search)

# Utilities

Sometimes it takes tens of minutes to create or deploy Indexes and you would lose connection with the Colab runtime. In that case, instead of creating or deploying new Index again, you can check [the Vector Search Console](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints) and get the existing ones to continue.

## Get an existing Index

To get an Index object that already exists, replace the following `[your-index-id]` with the index ID and run the cell. You can check the ID on [the Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).

In [None]:
my_index_id = "[your-index-id]"  # @param {type:"string"}
my_index = aiplatform.MatchingEngineIndex(my_index_id)

## Get an existing Index Endpoint

To get an Index Endpoint object that already exists, replace the following `[your-index-endpoint-id]` with the Index Endpoint ID and run the cell. You can check the ID on [the Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints).

In [None]:
my_index_endpoint_id = "[your-index-endpoint-id]"  # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)