# Multimodal retrieval augmentation generation (mRAG)

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10.13

## Overview

{TODO: Include a paragraph or two explaining what this example demonstrates, who should be interested in it, and what you need to know before you get started.}

Learn more about [web-doc-title](linkback-to-webdoc-page). {TODO: if more than one primary feature, add tag/linkback for each one}

### Objective

In this tutorial, you learn how to {TODO: Complete the sentence explaining briefly what you will learn from the notebook, such as
training, hyperparameter tuning, or serving}:

This notebooks performs best;
- document have both text & images 
- tables in the doc are available as images
- the images don't require too much context. If the document contain particular domain knowledge, make sure to pass that info in the prompt. 

This tutorial uses the following Google Cloud ML services and resources:

- *{TODO: Add high level bullets for the services/resources demonstrated; e.g., Vertex AI Training}*


The steps performed include:

- *{TODO: Add high level bullets for the steps of performed in the notebook}*

### Dataset

{TODO: Include a paragraph with Dataset information and where to obtain it.} 

## Installation

Install the following packages required to execute this notebook. 

Please upgrade to the latest GA version of each package; i.e., --upgrade

In [None]:
! pip3 install --upgrade --user google-cloud-aiplatform pymupdf rich

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Set up your Google Cloud account and authenticate
Follow the instructions to set up your account and autenticate from [Google's tutorials](https://console.cloud.google.com/cloud-resource-manager). 

In [None]:
# Define project information

import sys

PROJECT_ID = ""  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# if not running on Colab, try to get the PROJECT_ID automatically
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")

### Import libraries

In [None]:
import sys

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

from IPython.display import Markdown, display
from rich.markdown import Markdown as rich_Markdown
from vertexai.generative_models import (
    Content,
    GenerationConfig,
    GenerationResponse,
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    Image,
    Part,
)

text_model = GenerativeModel("gemini-1.5-pro")
multimodal_model = GenerativeModel("gemini-1.5-pro-001")
multimodal_model_flash = GenerativeModel("gemini-1.5-flash-001")

### Initialize

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

# Instantiate text model with appropriate name and version
text_model = GenerativeModel("gemini-1.0-pro")  # works with text, code

# Multimodal models: Choose based on your performance/cost needs
multimodal_model_15 = GenerativeModel(
    "gemini-1.5-pro-001"
)  # works with text, code, images, video(with or without audio) and audio(mp3) with 1M input context - complex reasoning

# Load text embedding model from pre-trained source
text_embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")

# Load multimodal embedding model from pre-trained source
multimodal_embedding_model = MultiModalEmbeddingModel.from_pretrained(
    "multimodalembedding"
)  # works with image, image with caption(~32 words), video, video with caption(~32 words)

from multimodal_qa_with_rag_utils import (
    get_document_metadata,
    set_global_variable,
)

set_global_variable("text_embedding_model", text_embedding_model)
set_global_variable("multimodal_embedding_model", multimodal_embedding_model)

## STEP1 - Extracting document metadata from both text and images
To contextualise the search with the images / documents that are interest to us, we need to extract the metadata from our documents. We will then create embeddings to find similar text or image per our prompgt. In addition, the metadata wil be used for references and citations. It includes essencial elements like page numbers, files manes, images counter, etc. 

### STEP 1.1 - Generate Embeddings from the metadata
Embeddings are used to perform similarity search when querying the data. We are extrating metadata from both images and text as there might be information only given in an image and vice versa. 

``` get_document_metadata() ``` : extract text & image metadata from the document. Returns two data frames ```text_metadata``` and ```image_metadata```. 


### STEP 1.2 - Analyse the embeddings 
Using the functions below, check if the embeddings are correct for your data. 
- ```text_metadata```:
    - ```text```: the original text from the page
    - ```text_embedding_page```: the embedding of the original text from the page
    - ```chunk_tex```: the original text divided into smaller chunks
    - ```chunk_number```: the index of each text chunk
    - ```text_embedding_chunk```: the embedding of each text chunk
- ```image_metadata```:
    - ```img_desc```: Gemini-generated textual description of the image.
    - ```mm_embedding_from_text_desc_and_img```: Combined embedding of image and its description, capturing both visual and textual information.
    - ```mm_embedding_from_img_only```: Separate text embedding of the generated description, enabling textual analysis and comparison.

In [None]:
%%time
# Specify the PDF folder with multiple PDF ~7m

print(
    "Removing pre-exsisting images folder, if any"
)
! rm -rf images/

pdf_folder_path = "data/"  # if running in Vertex AI Workbench.

# Specify the image description prompt. Change it
# image_description_prompt = """Explain what is going on in the image.
# If it's a table, extract all elements of the table.
# If it's a graph, explain the findings in the graph.
# Do not include any numbers that are not mentioned in the image.
# """

image_description_prompt = """The interpretation of pharmaceutical information is inherently complex, involving multiple data types such as text, diagrams, tables, and schematics. While large language models (LLMs) demonstrate proficiency in answering straightforward questions, complex inquiries demand an integrated approach to multiple information modalities. This hackathon aims to leverage Google AI's advanced tools and technologies to develop solutions that can effectively interpret and reason with multi-modal data from pharmaceutical annual reports.
Participants will work on tasks around complex reasoning such as question answering, information extraction, and data synthesis, showcasing the potential of multi-modal reasoning in the pharmaceutical domain.

Important Guidelines:
* Prioritize accuracy:  If you are uncertain about any detail, state "Unknown" or "Not visible" instead of guessing.
* Avoid hallucinations: Do not add information that is not directly supported by the image.
* Be specific: Use precise language to describe shapes, colors, textures, and any interactions depicted.
* Consider context: If the image is a screenshot or contains text, incorporate that information into your description.
"""


# Extract text and image metadata from the PDF document
text_metadata_df, image_metadata_df = get_document_metadata(
    multimodal_model_15,  # we are passing Gemini 1.5 Pro
    pdf_folder_path,
    image_save_dir="images",
    image_description_prompt=image_description_prompt,
    embedding_size=1408,
    # add_sleep_after_page = True, # Uncomment this if you are running into API quota issues
    # sleep_time_after_page = 5,
    add_sleep_after_document=True,  # Uncomment this if you are running into API quota issues
    sleep_time_after_document=5,  # Increase the value in seconds, if you are still getting quota issues. It will slow down the processing.
    # generation_config = # see next cell
    # safety_settings =  # see next cell
)

print("\n\n --- Completed processing. ---")

## Inspect metadata

In [None]:
text_metadata_df.head()

In [None]:
image_metadata_df.head()

## Import Helper Functions

```/utils/intro_multimodal_rag_utils.py``` includes the helper functions. 

- ```get_similar_text_from_query()```: Given a text query, finds text from the document which are relevant, using cosine similarity algorithm. It uses text embeddings from the metadata to compute and the results can be filtered by top score, page/chunk number, or embedding size.
- ```print_text_to_text_citation()```: Prints the source (citation) and details of the retrieved text from the get_similar_text_from_query() function.
- ```get_similar_image_from_query()```: Given an image path or an image, finds images from the document which are relevant. It uses image embeddings from the metadata.
- ```print_text_to_image_citation()```: Prints the source (citation) and the details of retrieved images from the get_similar_image_from_query() function.
- ```get_gemini_response()```: Interacts with a Gemini model to answer questions based on a combination of text and image inputs.
- ```display_images()```: Displays a series of images provided as paths or PIL Image objects.

In [None]:
from intro_multimodal_rag_utils import (
    get_similar_text_from_query,
    print_text_to_text_citation,
    get_similar_image_from_query,
    print_text_to_image_citation,
    get_gemini_response,
    display_images,
)