In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Text Embeddings + Vertex AI Vector Search



## Introduction

In this tutorial, you learn how to use Google Cloud AI tools to quickly bring the power of Large Language Models to enterprise systems.  

This tutorial covers the following -

*   What are embeddings - what business challenges do they help solve ?
*   Understanding Text with Vertex AI Text Embeddings
*   Find Embeddings fast with Vertex AI Vector Search
*   Grounding LLM outputs with Vector Search

This tutorial is based on [the blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings), combined with sample code.


### Prerequisites

This tutorial is designed for developers who has basic knowledge and experience with Python programming and machine learning.

If you are not reading this tutorial in Qwiklab, then you need to have a Google Cloud project that is linked to a billing account to run this. Please go through [this document](https://cloud.google.com/vertex-ai/docs/start/cloud-environment) to create a project and setup a billing account for it.




# Settings required outside this notebook


1. Enable APIs
  - Vertex AI API
  - BigQuery API
2. Enable Private Service Access (PSA) in the project
3. Provision the required IAM permissions to the Default Compute Enginer service agent account format: "{project-number}-compute@developer.gserviceaccount.com"

  Roles required:
    - roles/aiplatform.user
    - roles/bigquery.user
    - roles/storage.admin


# Text Embeddings in Action

Lets try using Text Embeddings in action with actual sample code.

## Setup

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication (Colab only)
* Enable APIs
* Set IAM permissions

### Install Python SDK

Vertex AI, Cloud Storage and BigQuery APIs can be accessed with multiple ways including REST API and Python SDK. In this tutorial we will use the SDK.

In [None]:
!pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage google-cloud-bigquery

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>



### Authenticating your notebook environment
If you are using Vertex AI Colab Enterprise, you will not require additional authentication.

For more information, you can check out the setup instructions [here](https://cloudtech.apple.com/documentation/gcp/getting-started/authentication-to-gcp).

To authenticate on Jupter lab running on local Mac run:
```
!gcloud auth application-default login
```

### Environment variables

Sets environment variables. If asked, please replace the following `[your-project-id]` with your project ID and run it.

In [57]:
# Enter project information

PROJECT_ID = "project-id"  # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

## Getting Started with Vertex AI Embeddings for Text

Now it's ready to get started with embeddings!

### Data Preparation

We will be using [the Stack Overflow public dataset](https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow) hosted on BigQuery table `bigquery-public-data.stackoverflow.posts_questions`. This is a very big dataset with 23 million rows that doesn't fit into the memory. We are going to limit it to 3000 rows for this tutorial.

In [39]:
# load the BQ Table into a Pandas Dataframe
import pandas as pd
from google.cloud import bigquery

QUESTIONS_SIZE = 3000

bq_client = bigquery.Client(project=PROJECT_ID)
QUERY_TEMPLATE = """
        SELECT distinct q.id, q.title
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions`
        where Score > 0 ORDER BY View_Count desc) AS q
        LIMIT {limit} ;
        """
query = QUERY_TEMPLATE.format(limit=QUESTIONS_SIZE)
query_job = bq_client.query(query)
rows = query_job.result()
df = rows.to_dataframe()

# examine the data
df.head()

Unnamed: 0,id,title
0,73370728,Firebase doesn't work on Android Studio Emulator
1,73401682,Appletv website is unresponsive through my ele...
2,73415813,"In tensorflow 1, when the loss function is def..."
3,73186559,view and control a windows 10 desktop with ras...
4,73198124,Moongose: Insert Many Docs then get Id and Upd...


### Call the API to generate embeddings

With the Stack Overflow dataset, we will use the `title` column (the question title) and generate embedding for it with Embeddings for Text API. The API is available under the [vertexai](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai) package of the SDK.

You may see some warning messages from the TensorFlow library but you can ignore them.

In [40]:
# init the vertexai package
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

From the package, import [TextEmbeddingModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel) and get a model.

In [41]:
# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

In this tutorial we will use `textembedding-gecko@003` model for getting text embeddings. Please take a look at [Supported models](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#supported_models) on the doc to see the list of supported models.

Once you get the model, you can call its [get_embeddings](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel#vertexai_language_models_TextEmbeddingModel_get_embeddings) function to get embeddings. You can pass up to 250 texts at once in every call with this model version.

It is always recommended to use implement a [Retry with Exponential Backoff strategy](https://www.pullrequest.com/blog/retrying-and-exponential-backoff-smart-strategies-for-robust-software/) to call the model or any API in general.

In [42]:
import time
import random
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 250

def get_embeddings_wrapper(texts, max_retries=5):
    """
    Retrieves embeddings for a list of texts, with retry logic in case of errors.

    Args:
        texts (list): A list of text strings.
        max_retries (int, optional): The maximum number of retries to attempt in case of errors. Defaults to 5.

    Returns:
        list: A list of embedding vectors, corresponding to the input texts.

    Raises:
        Exception: If the maximum number of retries is reached without success.
    """

    embs = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        retry_delay = 1  # Initial delay in seconds
        for attempt in range(max_retries):
            try:
                result = model.get_embeddings(texts[i : i + BATCH_SIZE])
                embs = embs + [e.values for e in result]
                break
            except Exception as e:
                time.sleep(retry_delay)
                retry_delay *= 2  # Double the delay for the next attempt
                retry_delay += random.uniform(0, 1)  # Add jitter
                if attempt == max_retries - 1:
                    print(e)
                    raise Exception("Maximum retry attempts reached")
    return embs


The following code will get embedding for the question titles and add them as a new column `embedding` to the DataFrame. This will take about 30 seconds depending on the quota available in the project.

In [43]:
# get embeddings for the question titles and add them as "embedding" column
df = df.assign(embedding=get_embeddings_wrapper(list(df.title)))
df.head()

100%|██████████| 12/12 [00:23<00:00,  1.93s/it]


Unnamed: 0,id,title,embedding
0,73370728,Firebase doesn't work on Android Studio Emulator,"[0.037912581115961075, -0.0116707943379879, -0..."
1,73401682,Appletv website is unresponsive through my ele...,"[0.014407188631594181, -0.00597828533500433, -..."
2,73415813,"In tensorflow 1, when the loss function is def...","[0.013817641884088516, -0.03823564946651459, -..."
3,73186559,view and control a windows 10 desktop with ras...,"[0.03249296918511391, -0.007909863255918026, -..."
4,73198124,Moongose: Insert Many Docs then get Id and Upd...,"[-0.0031713491771370173, 0.004105436149984598,..."


## Look at the embedding similarities

Let's see how these embeddings are organized in the embedding space with their meanings by quickly calculating the similarities between them and sorting them.

As embeddings are vectors, you can calculate similarity between two embeddings by using one of the popular metrics like the followings:

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/8.png)

Which metric should we use? Usually it depends on how each model is trained. In case of the model `textembedding-gecko`, we need to use inner product (dot product).

In the following code, it picks up one question randomly and uses the numpy `np.dot` function to calculate the similarities between the question and other questions.

In [44]:
import numpy as np

# pick the first key/question on the dataframe
key = 0
print(f"Key question: {df.title[key]}\n")

# calc dot product between the key and other questions
embs = np.array(df.embedding.to_list())
similarities = np.dot(embs[key], embs.T)

# print similarities for the first 5 questions
similarities[:5]

Key question: Firebase doesn't work on Android Studio Emulator



array([0.99999773, 0.64171932, 0.56128022, 0.54312613, 0.56273021])

Finally, sort the questions with the similarities and print the list.

In [45]:
# print the question
print(f"Key question: {df.title[key]}\n")

# sort and print the questions by similarities
sorted_questions = sorted(
    zip(df.title, similarities), key=lambda x: x[1], reverse=True
)[:20]
for i, (question, similarity) in enumerate(sorted_questions):
    print(f"{similarity:.4f} {question}")

Key question: Firebase doesn't work on Android Studio Emulator

1.0000 Firebase doesn't work on Android Studio Emulator
0.8044 Android Studio Emulator Internet Connection Problem For Only First Time
0.7954 Android studio: error occurred during initialization of VM
0.7886 After installing react-native-firebase/app it's Build will failed in react-native ios
0.7884 Flutter-Firebase: Unhandled Exception: [firebase_functions/internal] Response is not valid JSON object
0.7639 Nodejs Firebase cloud messaging delayed delivery on android
0.7601 FirebaseAuth.getInstance().getCurrentUser().getUid() always pointing to same ID
0.7542 Firebase user properties not summing up to number of active users
0.7463 Firebase App Check with Vue 3 Invalid app resource name
0.7442 Android Emulator is restored to last state when started from "Cold Boot Now"
0.7421 Firestore dependencies in Huawei developer console
0.7361 how to auth user using async and await in firebase
0.7349 Issue with Flutter retrieving data 

# Find embeddings fast with Vertex AI Vector Search

As we have explained above, you can find similar embeddings by calculating the distance or similarity between the embeddings.

But this isn't easy when you have millions or billions of embeddings. For example, if you have 1 million embeddings with 768 dimensions, you need to repeat the distance calculations for 1 million x 768 times. This would take some seconds - too slow.

So the researchers have been studying a technique called [Approximate Nearest Neighbor (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) for faster search. ANN uses "vector quantization" for separating the space into multiple spaces with a tree structure. This is similar to the index in relational databases for improving the query performance, enabling very fast and scalable search with billions of embeddings.

With the rise of LLMs, the ANN is getting popular quite rapidly, known as the Vector Search technology.

![](https://storage.googleapis.com/gweb-cloudblog-publish/images/7._ANN.1143068821171228.max-2200x2200.png)

In 2020, Google Research published a new ANN algorithm called [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html). It is considered one of the best ANN algorithms in the industry, also the most important foundation for search and recommendation in major Google services such as Google Search, YouTube and many others.


## What is Vertex AI Vector Search?

Google Cloud developers can take the full advantage of Google's vector search technology with [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) (previously called Matching Engine). With this fully managed service, developers can just add the embeddings to its index and issue a search query with a key embedding for the blazingly fast vector search. In the case of the Stack Overflow demo, Vector Search can find relevant questions from 8 million embeddings in tens of milliseconds.

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/9.png)

With Vector Search, you don't need to spend much time and money building your own vector search service from scratch or using open source tools if your goal is high scalability, availability and maintainability for production systems.

## Get Started with Vector Search

When you already have the embeddings, then getting started with Vector Search is pretty easy. In this section, we will follow the steps below.

### Setting up Vector Search
- Save the embeddings in JSON files on Cloud Storage
- Build an Index
- Create an Index Endpoint
- Deploy the Index to the endpoint

### Use Vector Search

- Query with the endpoint

### Save the embeddings in a JSON file
To load the embeddings to Vector Search, we need to save them in JSON files with JSONL format. See more information in the docs at [Input data format and structure](https://cloud.google.com/vertex-ai/docs/matching-engine/match-eng-setup/format-structure#data-file-formats).

First, export the `id` and `embedding` columns from the DataFrame in JSONL format, and save it.

In [59]:
# save id and embedding as a json file
jsonl_string = df[["id", "embedding"]].to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
    f.write(jsonl_string)

# show the first line of the json file
! head -n 1 questions.json

{"id":73370728,"embedding":[0.0379125811,-0.0116707943,-0.0058115269,-0.0351585224,0.0384169519,0.007862092,-0.0090285847,-0.0178354606,0.0793624967,0.0081094205,0.010868635,-0.0419697985,-0.01443417,-0.0484210327,0.0448307768,-0.0129040452,0.0452941656,-0.0545261949,0.0164001323,-0.041892074,0.0129835596,-0.0243700687,-0.0186334196,-0.0096583692,-0.0098479986,-0.0018138821,0.0065292129,-0.0438369773,-0.0348258391,0.0737737864,-0.0741051361,0.0987343937,-0.0685421154,-0.0068338611,0.0011251752,-0.0396082997,0.0024447916,0.0173771475,0.0134554924,-0.0073548499,0.0106734307,-0.0320124216,0.00924969,-0.0335697383,-0.0171290059,0.0212180242,-0.0062113069,0.0118331946,0.0282672048,-0.026230799,0.0511952974,0.0444744751,0.0379469991,-0.0119360555,-0.0171105172,-0.0136444354,0.0201309752,0.027998928,0.0537913702,-0.0348400846,-0.0074846041,0.0390514955,-0.0066811703,0.0736512318,-0.0505450554,-0.0689983368,-0.0101076625,-0.0164450295,0.0307318047,0.0143832294,0.0039858087,-0.0132910339,0.0673

Then, create a new Cloud Storage bucket and copy the file to it.

In [47]:
BUCKET_URI = f"gs://{PROJECT_ID}-rag-session"
! gsutil mb -l $LOCATION -p {PROJECT_ID} {BUCKET_URI}
! gsutil cp questions.json {BUCKET_URI}

Copying file://questions.json [Content-Type=application/json]...
-
Operation completed over 1 objects/29.5 MiB.                                     


### Create an Index

Now it's ready to load the embeddings to Vector Search. Its APIs are available under the [aiplatform](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform) package of the SDK.

In [48]:
# init the aiplatform package
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

Create an [MatchingEngineIndex](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex) with its `create_tree_ah_index` function (Matching Engine is the previous name of Vector Search).

In [49]:
# create index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"rag_demo1_stackoverflow_index",
    contents_delta_uri=BUCKET_URI,
    dimensions=768,
    approximate_neighbors_count=20,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Creating MatchingEngineIndex
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Create MatchingEngineIndex backing LRO: projects/842907197256/locations/us-central1/indexes/4508793720300109824/operations/4970934159854796800
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex created. Resource name: projects/842907197256/locations/us-central1/indexes/4508793720300109824
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:To use this MatchingEngineIndex in another session:
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:index = aiplatform.MatchingEngineIndex('projects/842907197256/locations/us-central1/indexes/4508793720300109824')


By calling the `create_tree_ah_index` function, it starts building an Index. This will take under a few minutes if the dataset is small, otherwise about 50 minutes or more depending on the size of the dataset. You can check status of the index creation on [the Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).

![](https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/creating-index.png)

#### The parameters for creating index

- `contents_delta_uri`: The URI of Cloud Storage directory where you stored the embedding JSON files
- `dimensions`: Dimension size of each embedding. In this case, it is 768 as we are using the embeddings from the Text Embeddings API.
- `approximate_neighbors_count`: how many similar items we want to retrieve in typical cases
- `distance_measure_type`: what metrics to measure distance/similarity between embeddings. In this case it's `DOT_PRODUCT_DISTANCE`

See [the document](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) for more details on creating Index and the parameters.

#### Batch Update or Streaming Update?
There are two types of index: Index for *Batch Update* (used in this tutorial) and Index for *Streaming Updates*. The Batch Update index can be updated with a batch process whereas the Streaming Update index can be updated in real-time. The latter one is more suited for use cases where you want to add or update each embeddings in the index more often, and crucial to serve with the latest embeddings, such as e-commerce product search.



### Create Index Endpoint and deploy the Index

To use the Index, you need to create an [Index Endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public). It works as a server instance accepting query requests for your Index.

In [20]:
# These steps take about 25 minutes

# # create IndexEndpoint
# my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
#     display_name=f"gcp_rag_session_endpoint",
#     public_endpoint_enabled=True)

# # deploy the Index to the Index Endpoint
# DEPLOYED_INDEX_ID = "f\"rag_session_demo1_stackoverflow_index_pre_deployed\"" # @param {type:"string"}
# my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592/operations/2722512045890076672
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592


<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7e607df57e20> 
resource name: projects/842907197256/locations/us-central1/indexEndpoints/1049255150293614592

In [50]:
# Get existing endpoint and index deployed
DEPLOYED_INDEX_ID = "rag_session_demo1_stackoverflow_index_pre_deployed" # @param {type:"string"}
my_index_endpoint_id = "1049255150293614592" # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)



If it is the first time to deploy an Index to an Index Endpoint, it will take around 25 minutes to automatically build and initiate the backend for it. After the first deployment, it will finish in seconds. To see the status of the index deployment, open [the Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints) and click the Index Endpoint.

<img src="https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/deploying-index.png" width="70%">

### Run Query

Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

In [51]:
def run_query(query : str):
  # Get embedding
  test_embeddings = get_embeddings_wrapper([query])
  print()

  # Get closest vectors
  response = my_index_endpoint.find_neighbors(
      deployed_index_id=DEPLOYED_INDEX_ID,
      queries=test_embeddings,
      num_neighbors=20
  )

  # show the result
  for idx, neighbor in enumerate(response[0]):
      id = np.int64(neighbor.id)
      similar = df.query("id == @id", engine="python")
      print(f"{neighbor.distance:.4f} {similar.title.values[0]}")

In [52]:
run_query("How to refresh cookies in JS?")

100%|██████████| 1/1 [00:00<00:00,  7.88it/s]



0.7963 How to save PHP session after browse closing? (without separate cookies)
0.7780 Why doesn't e.preventDefault() stop my page from Refreshing?
0.7553 How to restart css transitions with "touchstart" and "touchend"?
0.7531 How does HttpOnly cookie protect against XSS/Injection attack if they are passed automatically with every request?
0.7450 how to load a cookie file in a requests session
0.7448 Is there a relation between Cookies and reCaptcha?
0.7415 How cookie based authentication works in multiple instance web application?
0.7391 How can I reverse an infinite images carousel using JS?
0.7348 How to close an open window after the user has searched something in the open window(using JavaScript)?
0.7315 Accessing cookie from another sub-domain
0.7301 How to iterate through SVG nodes and add event to it (JS)?
0.7298 How to extend a list of HTML elements in JavaScript?
0.7244 What is Laravel's $request->session()->regenerateToken() method used for?
0.7232 How to avoid function rep

In [54]:
# Get embedding
run_query("why are chocolates delicious?")

100%|██████████| 1/1 [00:00<00:00,  8.09it/s]



0.5590 Why can converting numbers to characters change the numbers?
0.5557 Why the error information "unrecognized arguments" return?
0.5548 Why did celebrate request validation failed?
0.5499 Choosing a random element from an array with weights
0.5495 Values on x-Axes
0.5474 What is the purpose of subtracting from Math.random?
0.5471 Can those codes be simplified?
0.5469 Is there a relation between Cookies and reCaptcha?
0.5426 How to form exclusive pairs with conditions?
0.5409 Why would you declare a std::string using it's constructor?
0.5406 Break Colors Into Even Splits
0.5405 Why can't we have a safe ISA?
0.5405 This page has an error. You might just need to refresh it. Action failed:
0.5403 Why is optimizing inline functions easier than normal functions?
0.5389 How to resolve Java Certificate Issue?
0.5384 Why cons give a different value based on position of S-expression
0.5383 how should batch size be customised?
0.5364 Query in the topic "List"
0.5361 What is the point of sou

The `find_neighbors` function only takes milliseconds to fetch the similar items even when you have billions of items on the Index, thanks to the ScaNN algorithm. Vector Search also supports [autoscaling](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public#autoscaling) which can automatically resize the number of nodes based on the demands of your workloads.

# IMPORTANT: Cleaning Up

In case you are using your own Cloud project, not a temporary project on Qwiklab, please make sure to delete all the Indexes, Index Endpoints and Cloud Storage buckets after finishing this tutorial. Otherwise the remaining objects would **incur unexpected costs**.

If you used Workbench, you may also need to delete the Notebooks from [the console](https://console.cloud.google.com/vertex-ai/workbench).

In [56]:
# wait for a confirmation
input("Press Enter to delete Index Endpoint, Index and Cloud Storage bucket:")

# delete Index Endpoint
my_index_endpoint.undeploy_all()
my_index_endpoint.delete(force=True)

# delete Index
my_index.delete()

# delete Cloud Storage bucket
! gsutil rm -r {BUCKET_URI}

# Summary

## Grounding LLM outputs with Vertex AI Vector Search

As we have seen, by combining the Embeddings API and Vector Search, you can use the embeddings to "ground" LLM outputs to real business data with low latency.

For example, if an user asks a question, Embeddings API can convert it to an embedding, and issue an query on Vector Search to find similar embeddings in its index. Those embeddings represent the actual business data in the databases. As we are just retrieving the business data and not generating any artificial texts, there is no risk of having hallucinations in the result.

![](https://storage.googleapis.com/gweb-cloudblog-publish/original_images/10._grounding.png)

### The difference between the questions and answers

In this tutorial, we have used the Stack Overflow dataset. There is a reason why we had to use it; As the dataset has many pairs of **questions and answers**, so you can just find questions similar to your question to find answers to it.

In many business use cases, the semantics (meaning) of questions and answers are different. Also, there could be cases where you would want to add variety of recommended or personalized items to the results, like product search on e-commerce sites.

In these cases, the simple semantics search don't work well. It's more like a recommendation system problem where you may want to train a model (e.g. Two-Tower model) to learn the relationship between the question embedding space and answer embedding space. Also, many production systems adds reranking phase after the semantic search to achieve higher search quality. Please see [Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture) to learn more.

### Hybrid of semantic + keyword search

Another typical challenge you will face in production system is to support keyword search combined with the semantic search. For example, for e-commerce product search, you may want to let users find product by entering its product name or model number. As LLM doesn't memorize those product names or model numbers, semantic search can't handle those "usual" search functionalities.

[Vertex AI Search](https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-search-and-conversation-is-now-generally-available) is another product you may consider for those requirements. While Vector Search provides a simple semantic search capability only, Search provides a integrated search solution that combines semantic search, keyword search, reranking and filtering, available as an out-of-the-box tool.

### What about Retrieval Augmented Generation (RAG)?

In this tutorial, we have looked at the simple combination of LLM embeddings and vector search. From this starting point, you may also extend the design to [Retrieval Augmented Generation (RAG)](https://www.google.com/search?q=Retrieval+Augmented+Generation+(RAG)&oq=Retrieval+Augmented+Generation+(RAG)).

RAG is a popular architecture pattern of implementing grounding with LLM with text chat UI. The idea is to have the LLM text chat UI as a frontend for the document retrieval with vector search and summarization of the result.

![](https://storage.googleapis.com/gweb-cloudblog-publish/images/Figure-7-Ask_Your_Documents_Flow.max-529x434.png)

There are some pros and cons between the two solutions.

| | Emb + vector search | RAG |
|---|---|---|
| Design | simple | complex |
| UI | Text search UI | Text chat UI |
| Summarization of result | No | Yes |
| Multi-turn (Context aware) | No | Yes |
| Latency | millisecs | seconds |
| Cost | lower | higher |
| Hallucinations | No risk | Some risk |

The Embedding + vector search pattern we have looked at with this tutorial provides simple, fast and low cost semantic search functionality with the LLM intelligence. RAG adds context-aware text chat experience and result summarization to it. While RAG provides the more "Gen AI-ish" experience, it also adds a risk of hallucination and higher cost and time for the text generation.

To learn more about how to build a RAG solution, you may look at [Building Generative AI applications made easy with Vertex AI PaLM API and LangChain](https://cloud.google.com/blog/products/ai-machine-learning/generative-ai-applications-with-vertex-ai-palm-2-models-and-langchain).

## Resources

To learn more, please check out the following resources:

### Documentations

[Vertex AI Embeddings for Text API documentation
](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)

[Vector Search documentation](https://cloud.google.com/vertex-ai/docs/matching-engine/overview)

### Vector Search blog posts

[Vertex Matching Engine: Blazing fast and massively scalable nearest neighbor search](https://cloud.google.com/blog/products/ai-machine-learning/vertex-matching-engine-blazing-fast-and-massively-scalable-nearest-neighbor-search)

[Find anything blazingly fast with Google's vector search technology](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology)

[Enabling real-time AI with Streaming Ingestion in Vertex AI](https://cloud.google.com/blog/products/ai-machine-learning/real-time-ai-with-google-cloud-vertex-ai)

[Mercari leverages Google's vector search technology to create a new marketplace](https://cloud.google.com/blog/topics/developers-practitioners/mercari-leverages-googles-vector-search-technology-create-new-marketplace)

[Recommending news articles using Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/recommending-articles-using-vertex-ai-matching-engine)

[What is Multimodal Search: "LLMs with vision" change businesses](https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search)

# Utilities

Sometimes it takes tens of minutes to create or deploy Indexes and you would lose connection with the Colab runtime. In that case, instead of creating or deploying new Index again, you can check [the Vector Search Console](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints) and get the existing ones to continue.

## Get an existing Index

To get an Index object that already exists, replace the following `[your-index-id]` with the index ID and run the cell. You can check the ID on [the Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).

In [None]:
my_index_id = "[your-index-id]"  # @param {type:"string"}
my_index = aiplatform.MatchingEngineIndex(my_index_id)

## Get an existing Index Endpoint

To get an Index Endpoint object that already exists, replace the following `[your-index-endpoint-id]` with the Index Endpoint ID and run the cell. You can check the ID on [the Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints).

In [None]:
my_index_endpoint_id = "[your-index-endpoint-id]"  # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)