# Using Vertex AI Vector Search and Vertex AI Embeddings for Text for StackOverflow Questions

## Overview
This lab demonstrates how to provide semantic search to a large dataset of questions from StackOverflow. Because of the large size of the datataset, you will query the questions using BigQuery. Then, you will create text embeddings from the questions using the Vertex AI Text-Embeddings API service and store them in a Cloud Storage bucket. Once all embeddings are created, you will create a Vertex AI Vector Search index for those embeddings, so that you can search them afterwards.

The Vertex AI Text-Embeddings API enhances the process of generating text embeddings. These text embeddings, which are numerical representations of text, play a pivotal role in many tasks involving the identification of similar items, like Google searches, online shopping recommendations, and personalized music suggestions.

Vertex AI Vector Search Engine service is a high-scale, low-latency solution, for finding similar vectors from a large corpus. Vector Search is a fully managed offering, further reducing operational overhead. It is built upon Approximate Nearest Neighbor (ANN) technology developed by Google Research.



In [1]:
! pip3 install --upgrade google-cloud-aiplatform \
                        google-cloud-storage \
                        'google-cloud-bigquery[pandas]'

Collecting google-cloud-storage
  Downloading google_cloud_storage-2.18.2-py2.py3-none-any.whl.metadata (9.1 kB)
Downloading google_cloud_storage-2.18.2-py2.py3-none-any.whl (130 kB)
Installing collected packages: google-cloud-storage
  Attempting uninstall: google-cloud-storage
    Found existing installation: google-cloud-storage 2.14.0
    Uninstalling google-cloud-storage-2.14.0:
      Successfully uninstalled google-cloud-storage-2.14.0
Successfully installed google-cloud-storage-2.18.2


In [1]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
PROJECT = !gcloud config get-value project
PROJECT_ID = PROJECT[0]
REGION = "us-west1"

In [2]:
import vertexai
vertexai.init(project = PROJECT_ID,
              location = REGION)

## Prepare the data in BigQuery
The dataset used for this lab is the StackOverflow dataset. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset.

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

The BigQuery table is too large to fit into memory, so you need to write a generator called query_bigquery_chunks to yield chunks of the dataframe for processing. Additionally, an extra column title_with_body is added, which is a concatenation of the question title and body that will be used for creating embeddings.

Import the libraries and initialize the BigQuery client.

In [3]:
import math
from typing import Any, Generator

import pandas as pd
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)

In [4]:
# Define the BigQuery query for the remote dataset.
QUERY_TEMPLATE = """
        SELECT distinct q.id, q.title, q.body
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` where Score>0 ORDER BY View_Count desc) AS q
        LIMIT {limit} OFFSET {offset};
        """

In [5]:
# Create a function to access the BigQuery data in chunks.

def query_bigquery_chunks(
    max_rows: int, rows_per_chunk: int, start_chunk: int = 0
) -> Generator[pd.DataFrame, Any, None]:
    for offset in range(start_chunk, max_rows, rows_per_chunk):
        query = QUERY_TEMPLATE.format(limit=rows_per_chunk, offset=offset)
        query_job = client.query(query)
        rows = query_job.result()
        df = rows.to_dataframe()
        df["title_with_body"] = df.title + "\n" + df.body
        yield df

In [6]:
# Get a dataframe of 1000 rows for demonstration purposes.
df = next(query_bigquery_chunks(max_rows=1000, rows_per_chunk=1000))

# Examine the data
df.head()

Unnamed: 0,id,title,body,title_with_body
0,16238973,Why does the order of both a return and a thro...,<pre><code>private static ext.clsPassageiro Co...,Why does the order of both a return and a thro...
1,16239147,iOS: Do not back up attribute?,<p>Our app has recently been rejected for viol...,iOS: Do not back up attribute?\n<p>Our app has...
2,16196338,JSON.stringify doesn't work with normal Javasc...,"<p>I must be missing something here, but the f...",JSON.stringify doesn't work with normal Javasc...
3,8356058,OpenGL window isn't opening,<p>I have code from the OpenGLBook (openglbook...,OpenGL window isn't opening\n<p>I have code fr...
4,7711137,how to remove old applet from browser's cache ...,<p>Is there a way to delete older version of a...,how to remove old applet from browser's cache ...


## Create text embeddings from BigQuery data
Load the Vertex AI Embeddings for Text model.

In [7]:
from typing import List, Optional
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("text-embedding-004")

In [8]:
# Define an embedding method that uses the model.
def encode_texts_to_embeddings(sentences: List[str]) -> List[Optional[List[float]]]:
    try:
        embeddings = model.get_embeddings(sentences)
        return [embedding.values for embedding in embeddings]
    except Exception:
        return [None for _ in range(len(sentences))]

According to the documentation, each request can handle up to 5 text instances. So we will need to split the BigQuery question results in batches of 5 before sending to the embedding API.

Create a generate_batches to split results in batches of 5 to be sent to the embeddings API.

In [9]:
import functools
import time
from concurrent.futures import ThreadPoolExecutor
from typing import Generator, List, Tuple

import numpy as np
from tqdm.auto import tqdm


# Generator function to yield batches of sentences
def generate_batches(
    sentences: List[str], batch_size: int
) -> Generator[List[str], None, None]:
    for i in range(0, len(sentences), batch_size):
        yield sentences[i : i + batch_size]

Encapsulate the process of generating batches and calling the embeddings API in a method called encode_text_to_embedding_batched. This method also handles rate-limiting using time.sleep. For production use cases, you would want a more sophisticated rate-limiting mechanism that takes retries into account.

In [10]:
def encode_text_to_embedding_batched(
    sentences: List[str], api_calls_per_second: int = 10, batch_size: int = 5
) -> Tuple[List[bool], np.ndarray]:

    embeddings_list: List[List[float]] = []

    # Prepare the batches using a generator
    batches = generate_batches(sentences, batch_size)

    seconds_per_job = 1 / api_calls_per_second

    with ThreadPoolExecutor() as executor:
        futures = []
        for batch in tqdm(
            batches, total=math.ceil(len(sentences) / batch_size), position=0
        ):
            futures.append(
                executor.submit(functools.partial(encode_texts_to_embeddings), batch)
            )
            time.sleep(seconds_per_job)

        for future in futures:
            embeddings_list.extend(future.result())

    is_successful = [
        embedding is not None for sentence, embedding in zip(sentences, embeddings_list)
    ]
    embeddings_list_successful = np.squeeze(
        np.stack([embedding for embedding in embeddings_list if embedding is not None])
    )
    return is_successful, embeddings_list_successful

Test the encoding function by encoding a subset of data and see if the embeddings and distance metrics make sense.

In [11]:
# Encode a subset of questions for validation
questions = df.title.tolist()[:500]
is_successful, question_embeddings = encode_text_to_embedding_batched(
    sentences=df.title.tolist()[:500]
)

# Filter for successfully embedded sentences
questions = np.array(questions)[is_successful]

  0%|          | 0/100 [00:00<?, ?it/s]

In [12]:
# Save the dimension size for later usage when creating the Vertex AI Vector Search index.
DIMENSIONS = len(question_embeddings[0])

print(DIMENSIONS)

768


Sort questions in order of similarity. According to the embedding documentation, the similarity of embeddings is calculated using the dot-product, with np.dot. Once you have the similarity score, sort the results and print them for inspection. 1 means very similar, 0 means very different.

In [13]:
import random

question_index = random.randint(0, 99)

print(f"Query question = {questions[question_index]}")

# Get similarity scores for each embedding by using dot-product.
scores = np.dot(question_embeddings[question_index], question_embeddings.T)

# Print top 20 matches
for index, (question, score) in enumerate(
    sorted(zip(questions, scores), key=lambda x: x[1], reverse=True)[:20]
):
    print(f"\t{index}: {question}: {score}")

Query question = How to define DTD without strict element order?
	0: How to define DTD without strict element order?: 0.9999987167880122
	1: How to merge multiple items from an Xml field in SQL Server: 0.45985864810829613
	2: "The 'http://www.w3.org/XML/1998/namespace:lang' attribute is not declared.": 0.43271876314979474
	3: Lxml cssselect wildcard: 0.4297562877172967
	4: XSL changing spaces in a link to %20: 0.425946026449363
	5: Parameterized method with Ordering?: 0.424112698947372
	6: Regex for matching a previous group in the pattern?: 0.41796161033082024
	7: <span> vs <figure> vs <area>: 0.4143188733723252
	8: How to "partially" fulfill a sales order?: 0.41260790804391795
	9: Serialize with java 7, deserialize with java 6?: 0.4074802316452574
	10: HTML5 vs HTML4 - h1 tag rendered with extra space - how to remove?: 0.3834107020111859
	11: Any algorithm/approach to 'simulate' a cyclic graph over time?: 0.3829617080651104
	12: Is there a way to use ctrl-d as forward delete in Scala

Save the embeddings in JSONL format. The data must be formatted in JSONL format, which means each embedding dictionary is written as an individual JSON object on its own line.

In [14]:
import tempfile
from pathlib import Path

# Create temporary file to write embeddings to
embeddings_file_path = Path(tempfile.mkdtemp())

print(f"Embeddings directory: {embeddings_file_path}")

Embeddings directory: /var/tmp/tmpr745daze


Write embeddings in batches to prevent out-of-memory errors. Notice we are only using 5000 questions so that the embedding creation process and indexing is faster. The dataset contains more than 50,000 questions. This step will take around 5 minutes.

In [17]:
import gc
import json

BQ_NUM_ROWS = 5000
BQ_CHUNK_SIZE = 1000
BQ_NUM_CHUNKS = math.ceil(BQ_NUM_ROWS / BQ_CHUNK_SIZE)

START_CHUNK = 0

# Create a rate limit of 300 requests per minute. Adjust this depending on your quota.
API_CALLS_PER_SECOND = 300 / 60
# According to the docs, each request can process 5 instances per request
ITEMS_PER_REQUEST = 5

# Loop through each generated dataframe, convert
for i, df in tqdm(
    enumerate(
        query_bigquery_chunks(
            max_rows=BQ_NUM_ROWS, rows_per_chunk=BQ_CHUNK_SIZE, start_chunk=START_CHUNK
        )
    ),
    total=BQ_NUM_CHUNKS - START_CHUNK,
    position=-1,
    desc="Chunk of rows from BigQuery",
):
    # Create a unique output file for each chunk
    chunk_path = embeddings_file_path.joinpath(
        f"{embeddings_file_path.stem}_{i+START_CHUNK}.json"
    )
    with open(chunk_path, "a") as f:
        id_chunk = df.id

        # Convert batch to embeddings
        is_successful, question_chunk_embeddings = encode_text_to_embedding_batched(
            sentences=df.title_with_body.to_list(),
            api_calls_per_second=API_CALLS_PER_SECOND,
            batch_size=ITEMS_PER_REQUEST,
        )
        
        # Debugging: Check if any embeddings are returned
        print(f"Embeddings generated: {len(question_chunk_embeddings)}")
        if len(question_chunk_embeddings) == 0:
            print("No embeddings were generated, skipping this chunk.")
            continue

        # Check if successful embeddings exist
        if not any(is_successful):
            print(f"No successful embeddings in chunk {i}")
            continue

        # Append to file
        embeddings_formatted = [
            json.dumps(
                {
                    "id": str(id),
                    "embedding": [str(value) for value in embedding],
                }
            )
            + "\n"
            for id, embedding in zip(id_chunk[is_successful], question_chunk_embeddings)
        ]
        f.writelines(embeddings_formatted)

        # Delete the DataFrame and any other large data structures
        del df
        gc.collect()

Chunk of rows from BigQuery:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

Embeddings generated: 175


  0%|          | 0/200 [00:00<?, ?it/s]

Embeddings generated: 65


  0%|          | 0/200 [00:00<?, ?it/s]

Embeddings generated: 5


  0%|          | 0/200 [00:00<?, ?it/s]

Embeddings generated: 65


  0%|          | 0/200 [00:00<?, ?it/s]

Embeddings generated: 55


## Upload embeddings to Cloud Storage
Upload the text-embeddings to Cloud Storage, so that Vertex AI Vector Search can access them later.

Define a bucket where you will store your embeddings.

In [18]:
BUCKET_URI = f"gs://{PROJECT_ID}-unique"

In [19]:
# Create your Cloud Storage bucket.
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://qwiklabs-gcp-01-b5af813e00ba-unique/...


In [20]:
# Upload the training data to a Google Cloud Storage bucket.
remote_folder = f"{BUCKET_URI}/{embeddings_file_path.stem}/"
! gsutil -m cp -r {embeddings_file_path}/* {remote_folder}

Copying file:///var/tmp/tmpr745daze/tmpr745daze_0.json [Content-Type=application/json]...
Copying file:///var/tmp/tmpr745daze/tmpr745daze_1.json [Content-Type=application/json]...
Copying file:///var/tmp/tmpr745daze/tmpr745daze_2.json [Content-Type=application/json]...
Copying file:///var/tmp/tmpr745daze/tmpr745daze_3.json [Content-Type=application/json]...
Copying file:///var/tmp/tmpr745daze/tmpr745daze_4.json [Content-Type=application/json]...
- [5/5 files][ 17.5 MiB/ 17.5 MiB] 100% Done                                    
Operation completed over 5 objects/17.5 MiB.                                     


## Create an Index in Vertex AI Vector Search for your embeddings
Setup your index name and description.

In [21]:
DISPLAY_NAME = "stack_overflow"
DESCRIPTION = "question titles and bodies from stackoverflow"

Create the index. Notice that the index reads the embeddings from the Cloud Storage bucket. The indexing process can take from 45 minutes up to 60 minutes. Wait for completion, and then proceed. You can open a different Google Cloud Console page, navigate to Vertex AI Vector search, and see how the index is being created.

In [22]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

DIMENSIONS = 768

tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=remote_folder,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=80,
    description=DESCRIPTION,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/866155630714/locations/us-west1/indexes/7591600024211947520/operations/7989810700199395328
MatchingEngineIndex created. Resource name: projects/866155630714/locations/us-west1/indexes/7591600024211947520
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/866155630714/locations/us-west1/indexes/7591600024211947520')


In [23]:
# Reference the index name to make sure it got created successfully.
INDEX_RESOURCE_NAME = tree_ah_index.resource_name
INDEX_RESOURCE_NAME

'projects/866155630714/locations/us-west1/indexes/7591600024211947520'

In [24]:
# Using the resource name, you can retrieve an existing MatchingEngineIndex.
tree_ah_index = aiplatform.MatchingEngineIndex(index_name=INDEX_RESOURCE_NAME)

In [25]:
# Create an IndexEndpoint so that it can be accessed via an API.
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DISPLAY_NAME,
    description=DISPLAY_NAME,
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664/operations/5859608076453150720
MatchingEngineIndexEndpoint created. Resource name: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664')


In [26]:
# Deploy your index to the created endpoint. This can take up to 15 minutes.
DEPLOYED_INDEX_ID = "deployed_index_id_unique"

DEPLOYED_INDEX_ID


my_index_endpoint = my_index_endpoint.deploy_index(
    index=tree_ah_index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664/operations/536353316901224448
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664


[id: "deployed_index_id_unique"
index: "projects/866155630714/locations/us-west1/indexes/7591600024211947520"
create_time {
  seconds: 1728491016
  nanos: 713520000
}
index_sync_time {
  seconds: 1728492596
  nanos: 727444000
}
automatic_resources {
  min_replica_count: 2
  max_replica_count: 2
}
deployment_group: "default"
]

Verify number of declared items matches the number of embeddings. Each IndexEndpoint can have multiple indexes deployed to it. For each index, you can retrieve the number of deployed vectors using the index_endpoint._gca_resource.index_stats.vectors_count. The numbers may not match exactly due to potential rate-limiting failures incurred when using the embedding service.

In [27]:
number_of_vectors = sum(
    aiplatform.MatchingEngineIndex(
        deployed_index.index
    )._gca_resource.index_stats.vectors_count
    for deployed_index in my_index_endpoint.deployed_indexes
)

print(f"Expected: {BQ_NUM_ROWS}, Actual: {number_of_vectors}")

Expected: 5000, Actual: 785


## Create online queries
After you build your indexes, you may query against the deployed index to find nearest neighbors.

Note: For the DOT_PRODUCT_DISTANCE distance type, the "distance" property returned with each MatchNeighbor actually refers to the similarity.
Create an embedding for a test question.


In [28]:
test_embeddings = encode_texts_to_embeddings(sentences=["Install GPU for Tensorflow"])

In [29]:
# Test the query to retrieve the similar embeddings.
NUM_NEIGHBOURS = 10

response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_embeddings,
    num_neighbors=NUM_NEIGHBOURS,
)

response

[[MatchNeighbor(id='8356058', distance=0.46752703189849854, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='69280365', distance=0.46094128489494324, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='28882193', distance=0.45510563254356384, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='69329715', distance=0.45241573452949524, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='9058594', distance=0.4408796727657318, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], n

In [30]:
# Verify that the retrieved results are relevant by checking the StackOverflow links.

for match_index, neighbor in enumerate(response[0]):
    print(f"https://stackoverflow.com/questions/{neighbor.id}")

https://stackoverflow.com/questions/8356058
https://stackoverflow.com/questions/69280365
https://stackoverflow.com/questions/28882193
https://stackoverflow.com/questions/69329715
https://stackoverflow.com/questions/9058594
https://stackoverflow.com/questions/43272758
https://stackoverflow.com/questions/46994842
https://stackoverflow.com/questions/56555066
https://stackoverflow.com/questions/26474950
https://stackoverflow.com/questions/63850378


# Clean up the Google Cloud environment
To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the lab. You can also manually delete resources that you created by running the following code.

In [31]:
import os

delete_bucket = False

# Force undeployment of indexes and delete endpoint
my_index_endpoint.delete(force=True)

# Delete indexes
tree_ah_index.delete()

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -rf {BUCKET_URI}

Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664/operations/394489928639053824
MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
Deleting MatchingEngineIndexEndpoint : projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
MatchingEngineIndexEndpoint deleted. . Resource name: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
Deleting MatchingEngineIndexEndpoint resource: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664
Delete MatchingEngineIndexEndpoint backing LRO: projects/866155630714/locations/us-west1/indexEndpoints/6677404484227825664/operations/7861458110819336192
MatchingEngineIndexEndpoint 

## Congratulations
You have now completed the lab! In this lab, you created a semantic search system using Vertex AI Text-Embeddings API and Vertex AI Vector Search. You did that with a big dataset and learned how to work around API limits and quotas, the same way you would have to in a production environment.

### Next steps
Check out the [Generative AI on Vertex AI documentation.](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview)
Learn more about Generative AI on the [Google Cloud Tech YouTube channel.](https://www.youtube.com/@googlecloudtech/)
