# Multimodal Retrieval Augmented Generation (RAG) using Vertex AI Gemini API

<table align="left">
  <td style="text-align: center">
    <a href="https://github.com/manuyweissel/Master_Thesis_RAG_Guideline/blob/main/intro_rag.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>  
</table>

| | | 
|-|-|
|Author(s) | [Manu Weissel](https://www.linkedin.com/in/man%C3%BA-weissel-618127211/) |

## Overview
Retrieval-Augmented Generation (RAG) is a groundbreaking technique in the field of natural language processing (NLP) that merges the extensive knowledge capacity of large language models (LLMs) with the specificity and relevance of information retrieval systems. By uploading relevant documents, RAG models can generate responses that are not only contextually accurate but also deeply informed by relevant external data sources. 
Retrieval augmented generation (RAG) has become a popular paradigm for enabling LLMs to access external data and also as a mechanism for grounding to mitigate against hallucinations.

The RAG approach involves two main steps:

1. **Retrieval of Information:** The model searches a large dataset or corpus to find pieces of information that are relevant to the input query. This dataset can be anything from the entirety of Wikipedia to a specialized database tailored to a specific domain.
2. **Generation of Responses:** The model then combines this retrieved information with the original query to generate a response. This process leverages the power of generative models, like GPT (Generative Pre-trained Transformer), BERT (Bidirectional and Auto-Regressive Transformers) or Gemini, which are trained to produce coherent and contextually relevant text based on the inputs they receive.

### Gemini
Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. The Gemini API gives you access to the Gemini 1.0 Pro Vision and Gemini 1.0 Pro models.

### Comparing text-based and multimodal RAG
Multimodal RAG offers several advantages over text-based RAG:

1. **Enhanced knowledge access:** Multimodal RAG can access and process both textual and visual information, providing a richer and more comprehensive knowledge base for the LLM.
2. **Improved reasoning capabilities:** By incorporating visual cues, multimodal RAG can make better informed inferences across different types of data modalities.

This notebook shows you how to use RAG with Vertex AI Gemini API, [text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings), and [multimodal embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/multimodal-embeddings), to build a document search engine.

Through hands-on examples, you will discover how to construct a multimedia-rich metadata repository of your document sources, enabling search, comparison, and reasoning across diverse information streams. Summarised you will learn how to perform multimodal RAG where you can perform Q&A over a financial document filled with both text and images.

# Set up Vertex AI


This tutorial uses components of Google Cloud:

- Vertex AI

In order to access Vertex AI and create a free-trial account, following steps are necessary:.
1. Enter your Google credentials [here](https://console.cloud.google.com/freetrial?hl=de&facet_utm_source=%28direct%29&facet_utm_campaign=%28direct%29&facet_utm_medium=%28none%29&facet_url=https%3A%2F%2Fcloud.google.com%2Fvertex-ai%2Fpricing&facet_id_list=%5B39300012%2C+39300022%2C+39300118%2C+39300195%2C+39300251%2C+39300317%2C+39300320%2C+39300326%2C+39300345%2C+39300354%2C+39300364%2C+39300373%2C+39300412%2C+39300421%2C+39300436%2C+39300471%2C+39300488%2C+39300496%2C+39300498%5D&_ga=2.268870644.400606400.1709743568-1134076064.1709743561). If you don't have a google Account you need to create one.
2. You will have to enter your credit card information, but there will be NO costs.
3. Follow the required steps to use the Google Cloud Platform services. 
4. You will be able to use Vertex AI for the next 90 days.
5. Click on following Icon to open this code in you google vertex AI workbench.


<table align="left">
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/manuyweissel/Master_Thesis_RAG_Guideline/main/intro_rag.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>    
</table>

6. On the right side of the page you can a Tutorial to create a user managed notebook instance. Follow the mentioned steps.
- Please start the tutorial and then stay at the first step. Since your project 'My First Project' was already created, you can proceed to the second part of the first step. Don't upgrade your Free Trail if you want to avoid uncessary costs. 
- Press 'Enable the Notebooks API'. This can take up to 5 minutes. DON'T rush to the next step. 
- After enableing the Notebooks API the layout should change and after scrolling down you should be able to press create. Do not proceed with any other steps . 
7. After proceeding with these steps a notebook with JupyterLab should open itself. After loading click on 'OPEN', confirm the deployment and then you are ready to go!  .



## Please proceed if and ONLY if you have set up Vertex AI in JupyterLab on Google Cloud Workspace!

### Objectives

This notebook provides a guide to building a document search engine using multimodal retrieval augmented generation (RAG), step by step:

1. Extract and store metadata of documents containing both text and images, and generate embeddings the documents
2. Search the metadata with text queries to find similar text or images
3. Search the metadata with image queries to find similar images
4. Using a text query as input, search for contexual answers using both text and images

## Getting Started


### Define Google Cloud project information


In [1]:
# Define project information

import sys

PROJECT_ID = ""  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# if not running on colab, try to get the PROJECT_ID automatically
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")

Your project ID is: gemini-pro-langchain


In [2]:
import sys

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries


In [3]:
from IPython.display import Markdown, display
from vertexai.generative_models import (
    Content,
    GenerationConfig,
    GenerationResponse,
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    Image,
    Part,
)

### Load the Gemini 1.0 Pro and Gemini 1.0 Pro Vision model


In [4]:
text_model = GenerativeModel("gemini-1.0-pro")
multimodal_model = GenerativeModel("gemini-1.0-pro-vision")

### Download custom Python modules and utilities

The cell below will download some helper functions needed for this notebook, to improve readability. You can also view the code (`intro_multimodal_rag_utils.py`) directly on [Github](https://raw.githubusercontent.com/manuyweissel/Master_Thesis_RAG_Guideline/main/utils/intro_multimodal_rag_utils.py).

In [5]:
import os
import urllib.request
import sys

if not os.path.exists("utils"):
    os.makedirs("utils")


# download the helper scripts from utils folder
url_prefix = "https://raw.githubusercontent.com/manuyweissel/Master_Thesis_RAG_Guideline/main/utils/"
files = ["intro_multimodal_rag_utils.py"]

for fname in files:
    urllib.request.urlretrieve(f"{url_prefix}/{fname}", filename=f"utils/{fname}")

#### Get documents and images from GCS

In [7]:
from utils.intro_multimodal_rag_utils import get_document_metadata

In [8]:
# Specify the PDF folder with multiple PDF

# pdf_folder_path = "/content/data/" # if running in Google Colab/Colab Enterprise
pdf_folder_path = "data/"  # if running in Vertex AI Workbench.

# Specify the image description prompt. Change it
image_description_prompt = """Explain what is going on in the image.
If it's a table, extract all elements of the table.
If it's a graph, explain the findings in the graph.
Do not include any numbers that are not mentioned in the image.
"""

# Extract text and image metadata from the PDF document
text_metadata_df, image_metadata_df = get_document_metadata(
    multimodal_model,  # we are passing gemini 1.0 pro vision model
    pdf_folder_path,
    image_save_dir="images",
    image_description_prompt=image_description_prompt,
    embedding_size=1408,
    # add_sleep_after_page = True, # Uncomment this if you are running into API quota issues
    # sleep_time_after_page = 5,
    # generation_config = # see next cell
    # safety_settings =  # see next cell
)

print("\n\n --- Completed processing. ---")



 Processing the file: --------------------------------- data/BMW_2023_Q2.pdf 


Processing page: 1
Extracting image from page: 1, saved as: images/BMW_2023_Q2.pdf_image_0_0_28.jpeg
Processing page: 2
Processing page: 3
Processing page: 4
Processing page: 5
Processing page: 6
Processing page: 7
Processing page: 8
Processing page: 9
Processing page: 10
Processing page: 11
Processing page: 12
Processing page: 13
Processing page: 14
Processing page: 15
Processing page: 16
Processing page: 17
Processing page: 18
Processing page: 19


 --- Completed processing. ---


### Import the helper functions to implement RAG

You will be importing the following functions which will be used in the remainder of this notebook to implement RAG:

* **get_similar_text_from_query():** Given a text query, finds text from the document which are relevant, using cosine similarity algorithm. It uses text embeddings from the metadata to compute and the results can be filtered by top score, page/chunk number, or embedding size.
* **print_text_to_text_citation():** Prints the source (citation) and details of the retrieved text from the `get_similar_text_from_query()` function.
* **get_similar_image_from_query():** Given an image path or an image, finds images from the document which are relevant. It uses image embeddings from the metadata.
* **print_text_to_image_citation():** Prints the source (citation) and the details of retrieved images from the `get_similar_image_from_query()` fuction.
* **get_gemini_response():** Interacts with a Gemini model to answer questions based on a combination of text and image inputs.
* **display_images():**  Displays a series of images provided as paths or PIL Image objects.

In [11]:
from utils.intro_multimodal_rag_utils import (
    get_similar_text_from_query,
    print_text_to_text_citation,
    get_similar_image_from_query,
    print_text_to_image_citation,
    get_gemini_response,
    display_images,
)

## Multimodal retrieval augmented generation (RAG)

Let's bring everything together to implement multimodal RAG. You will use all the elements that you've explored in previous sections to implement the multimodal RAG. These are the steps:

* **Step 1:** The user gives a query in text format where the expected information is available in the document and is embedded in images and text.
* **Step 2:** Find all text chunks from the pages in the documents using a method similar to the one you explored in `Text Search`.
* **Step 3:** Find all similar images from the pages based on the user query matched with `image_description` using a method identical to the one you explored in `Image Search`.
* **Step 4:** Combine all similar text and images found in steps 2 and 3 as `context_text` and `context_images`.
* **Step 5:** With the help of Gemini, we can pass the user query with text and image context found in steps 2 & 3. You can also add a specific instruction the model should remember while answering the user query.
* **Step 6:** Gemini produces the answer, and you can print the citations to check all relevant text and images used to address the query.

### Step 1: User query

In [17]:
# this time we are not passing any images, but just a simple text query.

query = """Questions:
 - - What are the key takes for the second quarter for BMW based on the earning call transcript 'BMW_2023_Q2.pdf'? Tell me the four most important key takes. 
 """

### Step 2: Get all relevant text chunks

In [18]:
# Retrieve relevant chunks of text based on the query
matching_results_chunks_data = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=10,
    chunk_text=True,
)

### Step 3: Get all relevant images

In [19]:
# Get all relevant images based on user query
matching_results_image_fromdescription_data = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,
    column_name="text_embedding_from_image_description",
    image_emb=False,
    top_n=10,
    embedding_size=1408,
)

### Step 4: Create context_text and context_images

In [20]:
# combine all the selected relevant text chunks
context_text = []
for key, value in matching_results_chunks_data.items():
    context_text.append(value["chunk_text"])
final_context_text = "\n".join(context_text)

# combine all the relevant images and their description generated by Gemini
context_images = []
for key, value in matching_results_image_fromdescription_data.items():
    context_images.extend(
        ["Image: ", value["image_object"], "Caption: ", value["image_description"]]
    )

### Step 5: Pass context to Gemini

In [21]:
prompt = f""" Instructions: Compare the images and the text provided as Context: to answer multiple Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.
If unsure, respond, "Not enough context to answer".

Context:
 - Text Context:
 {final_context_text}
 - Image Context:
 {context_images}

{query}

Answer:
"""

# Generate Gemini response with streaming output
response = get_gemini_response(
        multimodal_model,
        model_input=[prompt],
        stream=True,
        generation_config=GenerationConfig(temperature=0.4, max_output_tokens=2048),
    )
Markdown(response)

1. **Strong financial performance:** BMW delivered a solid performance in Q2 2023, with an EBIT margin of 11.3% for the quarter and 12.6% for the first half-year.
2. **Automotive segment growth:** The Automotive segment achieved an EBIT margin of 9.2% in Q2 and 10.6% for the half-year, driven by strong demand, favorable pricing, and cost discipline.
3. **Increased guidance:** BMW raised its guidance for the EBIT margin and return on capital employed in the Automotive segment, as well as the outlook for return on equity in the Financial Services segment.
4. **Continued investment in future technologies:** BMW continues to invest in electromobility, digitalization, and automated driving, with a CapEx ratio of 5.1% for Q2 and 4.4% for the first half-year.

### Step 6: Print citations and references

In [22]:
# Text citations

print_text_to_text_citation(
    matching_results_chunks_data,
    print_top=False,
    chunk_text=True,
)

[91mCitation 1: Matched text: 
[0m
[94mscore: [0m 0.8
[94mfile_name: [0m BMW_2023_Q2.pdf
[94mpage_number: [0m 1
[94mchunk_number: [0m 1
[94mchunk_text: [0m 1 Bayerische Motoren Werke 
Aktiengesellschaft (BMWYY) Q2 2023 
Earnings Call Transcript 
Aug. 03, 2023 3:04 PM ETBayerische Motoren Werke Aktiengesellschaft ADR EACH 
REPR 0.33333 SHS SPONSORED (BMWYY) Stock, BAMXF Stock, BYMOF Stock 
 
 
 
SA Transcripts 
145.06K Followers 
Following 
2 
Q2: 2023-08-03 Earnings Summary 
Play CallSlides 
EPS of $1.60 misses by $0.23 | Revenue of $40.73B (15.17% Y/Y) misses by $512.38M 
Bayerische Motoren Werke Aktiengesellschaft (OTCPK:BMWYY) Q2 2023 Earnings 
Conference Call August 3, 2023 8:00 AM ET 
Company Participants 
Max Schberl  Investor Relations 
Walter Mertl  Chief Financial Officer 
Oliver Zipse  Chairman of the Board of Management, BMW AG 
Conference Call Participants 
Daniel Roeska  Bernstein Research 
Dorothee Cresswell  Exane BNP Paribas 
Henning Cosman  Barclays 
Michae