# RAG piece-by-piece on Vertex AI
In this notebook, we will build a RAG implementation piece by piece on Vertex AI.


## Prerequisites
**Note:** This notebook and repository are supporting artifacts for the "Google Machine Learning and Generative AI for Solutions Architects" book. The book describes the concepts associated with this notebook, and for some of the activities, the book contains instructions that should be performed before running the steps in the notebooks. Each top-level folder in this repo is associated with a chapter in the book. Please ensure that you have read the relevant chapter sections before performing the activities in this notebook.

**There are also important generic prerequisite steps outlined [here](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Prerequisite-steps/Prerequisites.ipynb).**


**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)
* [Document AI Pricing](https://cloud.google.com/document-ai/pricing)
* [Google Cloud Storage Pricing](https://cloud.google.com/storage/pricing)


## Background / overview

There are many different ways to build RAG implementations on Vertex AI. In Chapter 17, we implemented a RAG solution using Vertex AI Search. Remember what we've mentioned numerous times throughout this book: When there are multiple potential ways of implementing a solution, a solution architecture best practice is to always use the most managed solution possible that meets yourn technical and business needs (which may include cost, but that cost should always be considered in terms of "total cost of ownership" (TCO), which includes the toil of building and managing solutions ourselves). 

For that reason, when implementing a RAG solution in Vertex AI, I recommend starting with Vertex AI Search, as we did in Chapter 17. However, if you want to build your own RAG solution from scratch, in which you perform document chunking and embedding explicitly yourself, you can use services such as Google Cloud Document AI and Vertex AI Vector Search (among others).

This notebook shows an example of using Google Cloud Document AI and Vertex AI Vector Search, as well as Google models such as Gemini and textembedding-gecko.

## Solution Architecture

The solution architecture is shown in the following diagram:

<img src="images/RAG-Vertex.png">

The steps in the diagram are described as follows:

1. Our documents, which are stored in Google Cloud Storage (GCS), are sent to Google Cloud Document AI for chunking. As the name suggests, the chunking process breaks the documents into “chunks,” which are smaller sections of the document. This is required in order to create standard-sized chunks that serve as inputs to the embedding process. The size of the chunks is configurable in Document AI, and further details on this process are described in the Jupyter Notebook. 
2. The chunks are then sent to the Google Cloud “textembedding-gecko” LLM to create embeddings for the chunks. The resulting embeddings are stored in GCS, alongside their respective chunks (this step is omitted from the diagram).
3. We create a Vertex AI Vector Search index, and the embeddings are ingested from GCS to the Vertex AI Vector Search index (the GCS intermediary step is omitted from the diagram).
4. Next, the application/user asks a question that relates to the contents of our documents. The question is sent as a query to textembedding-gecko to be embedded/vectorized.
5. The vectorized query is then used as an input in a request to Vertex AI Vector Search, which searches our index to find similar embeddings. Remember that the embeddings represent an element of semantic meaning, so similar embeddings have similar meanings. This is how we can perform a semantic search to find embeddings that are similar to our query.
6. Next, we take the embeddings returned from our Vertex AI Vector Search query and find the chunks in GCS that relate to those embeddings (remember that Step 2 in our solution created a stored association of chunks and embeddings).
7. Now, it’s finally time to send a prompt to Gemini. The retrieved document chunks from Step 6 serve as context for the prompt. This helps Gemini to respond to our prompt based on the relevant contents from our documents and not just from its pre-trained knowledge.
8. Gemini responds to the prompt.

Note, between some of the steps depicted in the diagram, Google Cloud Storage is used to store the inputs and outputs of each step, but those intermediary processes are omitted to make the diagram more readable. Also, when we implement this solution here in our Jupyter Notebook, the notebook is the “application/user” that coordinates each of the steps in the overall process.

Also, in this case, we are using documents stored in GCS as our source of truth, but we could also use other data, such as data stored in BigQuery.

## Document/data citation
The citation for the document used in this exercise is as follows:

*Hila Zelicha, Jieping Yang, Susanne M Henning, Jianjun Huang, Ru-Po Lee, Gail Thames, Edward H Livingston, David Heber, and Zhaoping Li, 2024. Effect of cinnamon spice on continuously monitored glycemic response in adults with prediabetes: a 4-week randomized controlled crossover trial. DOI:https://doi.org/10.1016/j.ajcnut.2024.01.008*

# Implementation steps

## Install packages

In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform google-cloud-storage google-cloud-documentai vertexai

*The pip installation commands sometimes report various errors. Those errors usually do not affect the activities in this notebook, and you can ignore them.*


## Restart the kernel

The code in the next cell will retart the kernel, which is sometimes required after installing/upgrading packages.

**When prompted, click OK to restart the kernel.**

The sleep command simply prevents further cells from executing before the kernel restarts.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


In [None]:
import time
time.sleep(10)

# (Wait for kernel to restart before proceeding...)

## Set Google Cloud resource variables

The following code will set variables specific to your Google Cloud resources that will be used in this notebook, such as the Project ID, Region, and GCS Bucket.

**Note: This notebook is intended to execute in a Vertex AI Workbench Notebook, in which case the API calls issued in this notebook are authenticated according to the permissions (e.g., service account) assigned to the Vertex AI Workbench Notebook.**

We will use the `gcloud` command to get the Project ID details from the local Google Cloud project, and assign the results to the PROJECT_ID variable. If, for any reason, PROJECT_ID is not set, you can set it manually or change it, if preferred.

We also use a default bucket name for most of the examples and activities in this book, which has the format: `{PROJECT_ID}-aiml-sa-bucket`. You can change the bucket name if preferred.

Also, we're defaulting to the **us-central1** region, but you can optionally replace this with your [preferred region](https://cloud.google.com/about/locations).

In [None]:
PROJECT_ID_DETAILS = !gcloud config get-value project
PROJECT_ID = PROJECT_ID_DETAILS[0]  # The project ID is item 0 in the list returned by the gcloud command
PROJECT_NUMBER_DETAILS = !gcloud projects describe $PROJECT_ID --format="value(projectNumber)" 
PROJECT_NUMBER = PROJECT_NUMBER_DETAILS[0]  # The project number is item 0 in the list returned by the gcloud command 
BUCKET=f"{PROJECT_ID}-aiml-sa-bucket" # Optional: replace with your preferred bucket name, which must be a unique name.
REGION="us-central1" # Optional: replace with your preferred region (See: https://cloud.google.com/about/locations) 
print(f"Project ID: {PROJECT_ID}")
print(f"Project Number: {PROJECT_NUMBER}")
print(f"Bucket Name: {BUCKET}")

## Create bucket

The following code will create the bucket if it doesn't already exist.

If you get an error saying that it already exists, that's fine, you can ignore it and continue with the rest of the steps, unless you want to use a different bucket.

In [None]:
!gsutil mb -l us-central1 gs://{BUCKET}

## Begin implementation

Now that we have performed the prerequisite steps for this activity, it's time to implement the activity.

### Create UID for this session

This will be used in various variable values throughout this notebook to make the values unique to this session.

In [None]:
from datetime import datetime

# generate a unique id for this session
UID = datetime.now().strftime("%m%d%H%M")

## Create Document AI Processor

We will create a Document AI Processor to break our input documents into chunks.

See documentation [here](https://cloud.google.com/document-ai/docs/overview#dai-processors) and [here](https://cloud.google.com/document-ai/docs/layout-parse-chunk) for additional details.

In [None]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# Document AI Variables
location = 'us'  # # Different from REGION; the format is 'us' or 'eu'
processor_display_name = f"RAG-Chunking-Processor-{UID}"
processor_type = 'LAYOUT_PARSER_PROCESSOR'

# Create Document AI client
opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=opts)

# Location path
parent = client.common_location_path(PROJECT_NUMBER, location)

# Create the processor 
processor = client.create_processor(
    parent=parent,
    processor=documentai.Processor(
        display_name=processor_display_name, type_=processor_type
    ),
)

# Print the processor information
print(f"Processor Name: {processor.name}")
print(f"Processor Display Name: {processor.display_name}")
print(f"Processor Type: {processor.type_}")
print(f"Processor State: {processor.state}")

processor_name = processor.name  # Get the processor name for later use

## Process the document to create the chunks

This is the point at which we perform document chunking.

The following code will use our Document AI processor to break our document into chunks.

In [None]:
from typing import Optional
from google.cloud import documentai_v1beta3 as documentai

file_path = "data/Cinnamon.pdf"
mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types

def process_document_sample(
    project_id: str,
    location: str,
    processor_name: str,
    file_path: str,
    mime_type: str,
) -> None:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    name = processor_name
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load binary data
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

    # For more information: https://cloud.google.com/document-ai/docs/reference/rest/v1/ProcessOptions

    # Define the main configuration dictionary
    process_options = documentai.ProcessOptions(
        layout_config=documentai.ProcessOptions.LayoutConfig(
            chunking_config=documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                chunk_size=200,  
                include_ancestor_headings=True
            )
        )
    )

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=raw_document,
        process_options=process_options,
    )
    
    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    document = result.document
    return document

document_object = process_document_sample(PROJECT_NUMBER, location, processor_name, file_path, mime_type)

doc_layout = document_object.document_layout
chunked_doc = document_object.chunked_document

### Notes on tokens and chunks

#### Tokens

LLMs generally work with `tokens` rather than words. When dealing with text, a token often represents subsections of words, and tokenization can be done in different ways, such as breaking text up by characters, or using subword-based tokenization (e.g., the word "unbelievable" could be split into subwords such as "un", "believe", "able").

The exact size and definition of a token can vary based on different tokenization methods, models, languages, and other factors, but a general rule of thumb for English text using subword tokenizers is around 4 characters per token on average.

#### Chunks

When creating embeddings, we usually break a document into chunks and then create embeddings of those chunks. Again, this can be done in different ways, using different tools. In this notebook, we're using Google Cloud Document AI to break our documents into chunks.

For this purpose, one of the parameters we need to specify for our Document AI Processor is the `chunkSize` to use, which is measured (in this case) by number of tokens per chunk. You may need to experiment with the value for this parameter to find the chunk size that works best for your use case (e.g., based on the length and structure of the document sections). You generally want your chunks to capture some level of semantic granularity, but there are trade-offs in terms of this granularity. For example, smaller chunks can capture more granular sematic context, and can provide more precise search results, but can be less efficient (computationally) to process. You also need to ensure chunks sizes are within the input length limits of the embedding model you're using in order to avoid possible truncation.

A good practice is to start with a moderate chunk size and adjust it based on how well it fits your needs.

Fortunately, Document AI can automatically handle some chunking based on layout, even without a set chunkSize, so that can be helpful if you don't know what chunk size to use.

### Review document layout and chunks

#### Print sample of document layout

The following code will print the first couple of blocks from our document layout.

In [None]:
print(doc_layout.blocks[:2])

### Review the text chunks

The following code will consolidate our document chunks into a list that we can use from here onwards.

We will also print the first 5 chunks to see their contents.

In [None]:
text_chunks = [chunk.content for chunk in chunked_doc.chunks] 

In [None]:
print(text_chunks[:5])

## Create embeddings

Now that we have broken our document down into chunks, the next steps is to get embeddings for our chunks.

We will use the `textembedding-gecko` model to create our embeddings (see [documentation here](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings) for more details).

In [None]:
from typing import List

from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

def embed_text(
    texts: List[str] = text_chunks,  # Use your extracted chunks
    task: str = "RETRIEVAL_DOCUMENT",
    model_name: str = "textembedding-gecko@003",
) -> List[List[float]]:
    model = TextEmbeddingModel.from_pretrained(model_name)
    inputs = [TextEmbeddingInput(text, task) for text in texts]
    embeddings = model.get_embeddings(inputs)
    return [embedding.values for embedding in embeddings]

embeddings = embed_text(text_chunks)

### Save embeddings

Next, we will save our embeddings in a local file.

The code in the following cell will also save the text chunks with their associated embeddings, which we can use for retrieval later.

This file will be used as input when creating a Vertex AI Vector Search index in subsequent sections of this notebook.

See the documentation [here](https://cloud.google.com/vertex-ai/docs/vector-search/setup/format-structure) for requirements regarding the input format structure for Vertex AI Vector Search indexes.

In [None]:
import json
filename = f"embeddings-{UID}.json"

def create_embeddings_jsonl(text_chunks, embeddings, filename):
    with open(filename, 'w') as outfile:
        for idx, (text, embedding) in enumerate(zip(text_chunks, embeddings)):
            data = {
                "id": idx,
                "text": text,
                "embedding": embedding
            }
            json.dump(data, outfile, separators=(',', ':'))
            outfile.write('\n')

create_embeddings_jsonl(text_chunks, embeddings, filename)

### Upload file to GCS

Next, we will upload our file to Google Cloud Storage (GCS). This is required for us to ingest our embeddings into Vertex AI Vector Search.

In [None]:
embeddings_path = f"gs://{BUCKET}/chapter-18-embeddings-data-{UID}/batch_root/"

In [None]:
!gsutil cp {filename} {embeddings_path}

## Create Vertex AI Vector Search Index

We will store our embeddings in Vertex AI Vector Search. To do that we need to create an index and specify the GCS location of our embeddings file to be ingested.

See the documentation [here](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) and [here](https://cloud.google.com/vertex-ai/docs/vector-search/configuring-indexes) for more details on creating Index and the parameters.

### First, get the embedding dimensionality

When creating an index in Vertex AI Vector Search, we need to know the dimensionality of our embeddings. We can get that with the following code.

In [None]:
first_embedding = embeddings[0] # Assumes all our embeddings have same dimensionality, so we can just use the first one.
embeddings_dimensionality = len(first_embedding) 
print(f"Embeddings dimensionality: {embeddings_dimensionality}")

### Create the index

This process can take a long time, depending on the amount of embeddings.

In [None]:
from google.cloud import aiplatform

def vector_search_create_index(
    project: str, location: str, display_name: str, gcs_uri: Optional[str] = None
) -> None:
    # Initialize the Vertex AI client
    aiplatform.init(project=project, location=location, staging_bucket=BUCKET)

    # Create Index
    index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
        display_name=display_name,
        contents_delta_uri=f"{embeddings_path}",
        description="RAG Index",
        dimensions=embeddings_dimensionality,
        approximate_neighbors_count=50,
        leaf_node_embedding_count=500,
        leaf_nodes_to_search_percent=7,
        index_update_method="batch_update",  
        distance_measure_type="DOT_PRODUCT_DISTANCE",
    )
    return(index)

location = "us-central1"
display_name = f"RAG-index-{UID}"
vvs_index = vector_search_create_index(PROJECT_ID, REGION, display_name)

## Create Endpoint and Deploy Index

To use our index, we need to deploy it to an endpoint.

### First, create the endpoint

In [None]:
## create `IndexEndpoint`
vvs_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name = f"index-endpoint-{UID}",
    public_endpoint_enabled = True
)

### Deploy the index to the endpoint

**This step can take a long time.**

To check the status in the Google Cloud console, navigate to Vertex AI -> Vector Search, and click on the name of your index.

In [None]:
# Create an ID for the index
DEPLOYED_INDEX_ID = f"deployed_index_{UID}"

vvs_index_endpoint.deploy_index(
    index = vvs_index, deployed_index_id = DEPLOYED_INDEX_ID
)

## Perform a vector-based similarity search

Now that we have created our index, we can use it to perform a vector-based similarity search to find the nearest neighbors to our query.

For this work, we need to first create an embedding of our query using the same model that was used to create embeddings for our text chunks earlier (in our case, we used `textembedding-gecko@003`). This is done using the `embed_query` function.

Let's use one of our questions from Chapter 16 (feel free to replace the `query_text` string with other questions related to the document contents).

In [None]:
# Create embedding for input query
def embed_query(text: str, task: str = "RETRIEVAL_DOCUMENT", model_name: str = "textembedding-gecko@003"):
    model = TextEmbeddingModel.from_pretrained(model_name)
    input = TextEmbeddingInput(text, task)  # Create a single input object
    embeddings = model.get_embeddings([input])  # Pass input as a list
    return embeddings[0].values  # Return the embedding as a list of floats

query_text = "How might cinnamon supplementation interact with other dietary or lifestyle interventions for prediabetes management?"
query_embedding = embed_query(query_text)

# Find the nearest neighbors
response = vvs_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=[query_embedding],
    num_neighbors=3 # You can change this value to return more or fewer neighbors
)

### Print details of all of the nearest neighbors returned

In the response, we can see the details of the nearest neighbors returned, including their IDs and distances relative to our query.

Usually, smaller distance means that the returned neighbors are more similar to our query (e.g., Euclidean distance). However, we are using the `DOT_PRODUCT_DISTANCE`, which means that, counter-intuitively, the higher the value of the `distance` metric, the closer and more similar the returned neighbor is to our query.

In [None]:
print(response)

### Print details of the nearest neighbor returned

In [None]:
print(response[0][0])

### Inspect the nearest neighbor

The following code will display the nearest neighbor embedding.

In [None]:
neighbor_id = response[0][0].id
neighbor_embedding = vvs_index_endpoint.read_index_datapoints(deployed_index_id=DEPLOYED_INDEX_ID, ids= [neighbor_id],)
print(neighbor_embedding)

## Retrieve text chunks represented by the nearest neighbor embeddings

The embeddings are used for performing similarity/neighbor search in the vector space.

If we want to implement a RAG use case, however, the embeddings by themselves are not meaningful to include in the context for interacting with a generative model. 

For that reason, we need to retrieve the text chunks that are associated with the nearest neighbor embeddings.

The following code will do that.

In [None]:
# Extract IDs from the response
neighbor_ids = [neighbor.id for neighbor in response[0]]  # Access the first (and likely only) list in the response

# Fetch the text chunks corresponding to given IDs from the JSONL file
def fetch_text_chunks(ids, filename=filename):
    texts = {}
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if str(data['id']) in ids:  # Ensure the id from JSON is treated as a string for matching
                texts[str(data['id'])] = data['text']
    return [texts[id] for id in ids if id in texts]

# Fetch the corresponding texts using the IDs extracted
neighbor_texts = fetch_text_chunks([str(id) for id in neighbor_ids], filename=filename)  

# Now we have the embeddings and their corresponding text chunks
print(neighbor_texts)

## Use retrieved content with generative AI model

To implement a RAG use case, we can include the retrieved content as context when sending a prompt to a generative AI model.

The following code will send a prompt to Gemini, and will include our retrieved text chunks as context.

### First, send a simple prompt to Gemini, without context.

In [None]:
import vertexai

from vertexai.generative_models import GenerativeModel, ChatSession

vertexai.init(project=PROJECT_ID, location=REGION)
model = GenerativeModel(model_name="gemini-1.0-pro-002")
chat = model.start_chat()

def get_chat_response(chat: ChatSession, prompt: str) -> str:
    text_response = []
    responses = chat.send_message(prompt, stream=True)
    for chunk in responses:
        text_response.append(chunk.text)
    return "".join(text_response)

# Using our query text from earlier
prompt = query_text
print(get_chat_response(chat, prompt))

### Provide context

In the next prompt, we provide our retrieved text chunks as context.

In [None]:
prompt = f"{query_text} in the context of:{neighbor_texts}"
print(get_chat_response(chat, prompt))

### Try with precise answers only from the source document

In this case, we instruct the model to answer only from the retrieved context.

In [None]:
prompt = f"{query_text} Answer only from the following context:{neighbor_texts}"
print(get_chat_response(chat, prompt))

# That's it! Well Done!

# Clean up

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them.**

In [None]:
clean_up = False

## Delete the Document AI Processor

In [None]:
from google.api_core import exceptions as gcp_exceptions
if clean_up:
    try:
        client.delete_processor(name=processor_name)
        print(f"Deleted Processor: {processor_name}")
    except gcp_exceptions.NotFound:
        print(f"Processor not found: {processor_name}")
else:
    print("clean_up parameter is set to False")

## Delete the Vertex Vector Search Index

In [None]:
if clean_up:
    try:
        vvs_index.delete()  # Set force=True to bypass the check for deployed indexes
        print(f"Deleted Matching Engine Index: {index.resource_name}")
    except gcp_exceptions.NotFound:
        print(f"Index not found: {index.resource_name}")
    except Exception as e:
        print(f"Error deleting index: {e}")
else:
    print("clean_up parameter is set to False")

## Delete the Vertex Vector Search Endpoint

In [None]:
if clean_up:
    try:
        # Undeploy the deployed index
        try:
            vvs_index_endpoint.undeploy_index(deployed_index_id=DEPLOYED_INDEX_ID)
            print(f"Undeployed index '{DEPLOYED_INDEX_ID}' from endpoint '{vvs_index_endpoint.name}'.")
        except exceptions.NotFound:
            print(f"Deployed index '{DEPLOYED_INDEX_ID}' not found in endpoint '{vvs_index_endpoint.name}'.")
        
        # Delete the index endpoint
        vvs_index_endpoint.delete()
        print(f"Deleted index endpoint: {vvs_index_endpoint.name}")

    except exceptions.NotFound:
        print(f"Index endpoint not found: {vvs_index_endpoint.name}")
else:
    print("clean_up parameter is set to False")

## Delete GCS Bucket
The bucket can be reused throughout multiple activities in the book. Sometimes, activities in certain chapters make use of artifacts from previous chapters that are stored in the GCS bucket.

I highly recommend **not deleting the bucket** unless you will be performing no further activities in the book. For this reason, there's a separate `delete_bucket` variable to specify if you want to delete the bucket.

If you want to delete the bucket, set the `delete_bucket` parameter to `True`.

In [None]:
delete_bucket = False

In [None]:
if delete_bucket == True:
    # Delete the bucket
    ! gcloud storage rm --recursive gs://$BUCKET
else:
    print("delete_bucket parameter is set to False")