### Step 1:- Install Vertex AI SDK for Python and other dependencies

In [None]:
%pip install -U -q google-cloud-aiplatform langchain-core langchain-google-vertexai langchain-text-splitters langsmith langchainhub langchain-experimental "unstructured[all-docs]" pypdf pydantic lxml pillow matplotlib opencv-python tiktoken

### Step 1.1:- Restart current runtime

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Step 1.2:- Authenticate your notebook environment (Colab only)

In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Step 1.3:- Define Google Cloud project information

In [None]:
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# For Vector Search Staging
GCS_BUCKET = ""  # @param {type:"string"}
GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"

### Step 1.4:- Initialize the Vertex AI SDK

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

### Step 2:- Import libraries

In [None]:
import base64
import os
import uuid
import re

from typing import List, Tuple

from IPython.display import display, Image, Markdown

from langchain.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore

from langchain_community.vectorstores import Chroma

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser

from langchain_text_splitters import CharacterTextSplitter

from langchain_google_vertexai import (
    VertexAI,
    ChatVertexAI,
    VertexAIEmbeddings,
    VectorSearchVectorStore,
)

from unstructured.partition.pdf import partition_pdf

In [None]:
import os
from uuid import uuid4

unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Tracing Walkthrough - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = ""  # Update to your API key

## Step 3:- Partition PDF tables, text, and images

In [None]:
pdf_folder_path = "/content/data/"
pdf_file_name = "/content/google-14k-merged.pdf"

[Unstructured-io](https://unstructured-io.github.io/unstructured/introduction.html)

In [None]:
!apt-get install poppler-utils
!sudo apt update
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

In [None]:
# Extract images, tables, and chunk text from a PDF file.
raw_pdf_elements = partition_pdf(
    filename=pdf_file_name,
    strategy="hi_res",  # mandatory to use ``hi_res`` strategy
    extract_images_in_pdf=True,
    infer_table_structure=True,
    extract_image_block_types=["Image", "Table"],
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=pdf_folder_path,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
raw_pdf_elements

[<unstructured.documents.elements.CompositeElement at 0x7af2e8daa1d0>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8daa350>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8da8280>,
 <unstructured.documents.elements.Table at 0x7af2e8dab310>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8daaf50>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8dab190>,
 <unstructured.documents.elements.Table at 0x7af2e8daa260>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8daa200>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8dabf40>,
 <unstructured.documents.elements.CompositeElement at 0x7af2e8da8130>]

In [None]:
# Categorize extracted elements from a PDF into tables and texts.
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

In [None]:
tables

['Unvested Restricted Stock Units Weighted- Number of Grant Date Shares Fair Value Unvested as of December 31, 2020 19,288,793 $ 1,262.13 Granted 10,582,700 $ 1,949.16 Vested (11,209,486) $ 1,345.98 Forfeited/canceled (1,767,294) $ 1,425.48 Unvested as of December 31, 2021 16,894,713 $ 1,626.13',
 'Total Number of Approxi Shares lar Value of Purchased Shares that Total Number of Total Numberof A\\ Price Ai Price Partof Publicly Yet Be Purchased A Shares Inder Purchased Purchased ClassA Share Class C Share Programs Period (in thousands) "(in thousands) "’ bl _*___{inthousands)" __(in millions) _ October 1 - 31 126 1,445 $ 2,812.76 $ 2,794.72 1.571 § 26,450 November 1 - 30 289 1,393 $ 2,943.97 $ 2,956.73 1,682 $ 21,479 December 1 - 31 250 1,169 $ 2,880.79 $ 2,898.56 1,419 $ 17,371 Total 665 4,007 4,672']

In [None]:
# Optional: Enforce a specific token size for texts
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=10000, chunk_overlap=0
)
joined_texts = " ".join(texts)
texts_4k_token = text_splitter.split_text(joined_texts)

### Step 4:- Generate summaries

In [None]:
MODEL_NAME = "gemini-1.5-pro-preview-0514"

In [None]:
# Generate summaries of text elements
def generate_text_summaries(
    texts: List[str], tables: List[str], summarize_texts: bool = False
) -> Tuple[List, List]:
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """

    prompt = PromptTemplate.from_template(prompt_text)

    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )

    # Text summary chain
    model = VertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=1024
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts:
        if summarize_texts:
            text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
        else:
            text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 1})

    return text_summaries, table_summaries

In [None]:
# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts_4k_token, tables, summarize_texts=True
)

In [None]:
text_summaries

["This document from Alphabet Inc.'s 2021 10-K report details the company's financial performance, stock information, and the impact of COVID-19. It includes tables outlining revenue by geography, cost of revenues, net income per share calculations, stock-based awards, and deferred income taxes. The document also covers Alphabet's stock performance, dividend policy, share repurchases, and executive summaries of financial results. Key highlights include a 41% revenue increase driven by Google Services and Google Cloud, a 31% increase in cost of revenues, and a 20% increase in operating expenses. The report also acknowledges the significant impact of COVID-19 on advertising revenue and overall financial results. \n"]

### Step 4.1:- Generate Image Summary

In [None]:
def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

In [None]:
def image_summarize(img_base64, prompt):
    """Make image summary"""
    model = ChatVertexAI(model_name=MODEL_NAME, max_output_tokens=1024)

    msg = model(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content

In [None]:
def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval.
    If it's a table, extract all elements of the table.
    If it's a graph, explain the findings in the graph.
    Do not include any numbers that are not mentioned in the image.
    """

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries

In [None]:
# Image summaries
img_base64_list, image_summaries = generate_img_summaries("/content/figures")

  warn_deprecated(


In [None]:
img_base64_list

['/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCADkBRQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigA

## Step 5:- Create & Deploy Vertex AI Vector Search Index & Endpoint

In [None]:
DIMENSIONS = 768  # Dimensions output from textembedding-gecko

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="mm_rag_langchain_index",
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="Multimodal RAG LangChain Index",
)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Creating MatchingEngineIndex
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Create MatchingEngineIndex backing LRO: projects/333878807818/locations/us-central1/indexes/7269570660622401536/operations/607177018374619136
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex created. Resource name: projects/333878807818/locations/us-central1/indexes/7269570660622401536
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:To use this MatchingEngineIndex in another session:
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:index = aiplatform.MatchingEngineIndex('projects/333878807818/locations/us-central1/indexes/7269570660622401536')


In [None]:
DEPLOYED_INDEX_ID = "mm_rag_langchain_index_endpoint"

index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DEPLOYED_INDEX_ID,
    description="Multimodal RAG LangChain Index Endpoint",
    public_endpoint_enabled=True,
)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Creating MatchingEngineIndexEndpoint
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Create MatchingEngineIndexEndpoint backing LRO: projects/333878807818/locations/us-central1/indexEndpoints/6197713949308223488/operations/7347939760641409024
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint created. Resource name: projects/333878807818/locations/us-central1/indexEndpoints/6197713949308223488
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:To use this MatchingEngineIndexEndpoint in another session:
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/333878807818/locations/us-central1/indexEndpoints/6197713949308223488')


In [None]:
index_endpoint = index_endpoint.deploy_index(
    index=index, deployed_index_id="mm_rag_langchain_deployed_index"
)
index_endpoint.deployed_indexes

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/333878807818/locations/us-central1/indexEndpoints/6197713949308223488


AlreadyExists: 409 There already exists a DeployedIndex with same ID "mm_rag_langchain_deployed_index" deployed or being deployed at the following IndexEndpoint: projects/333878807818/locations/us-central1/indexEndpoints/3749444601878937600. Please use a different ID.

## Step 6:- Create retriever & load documents

In [None]:
vectorstore = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=LOCATION,
    gcs_bucket_name=GCS_BUCKET,
    index_id=index.name,
    endpoint_id=index_endpoint.name,
    embedding=VertexAIEmbeddings(model_name="textembedding-gecko@003"),
)

ValueError: No index with id projects/333878807818/locations/us-central1/indexes/7269570660622401536 deployed on endpoint mm_rag_langchain_index_endpoint.

In [None]:
docstore = InMemoryStore()

id_key = "doc_id"
# Create the multi-vector retriever
retriever_multi_vector_img = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

In [None]:
# Raw Document Contents
doc_contents = texts + tables + img_base64_list

doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries + table_summaries + image_summaries)
]

retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))

# If using Vertex AI Vector Search, this will take a while to complete.
# You can cancel this cell and continue later.
retriever_multi_vector_img.vectorstore.add_documents(summary_docs)

INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Updating MatchingEngineIndex index: projects/333878807818/locations/us-central1/indexes/622257610623549440
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Update MatchingEngineIndex index backing LRO: projects/333878807818/locations/us-central1/indexes/622257610623549440/operations/1111474623523848192
INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex index Updated. Resource name: projects/333878807818/locations/us-central1/indexes/622257610623549440


['7736af10-adb1-4b0e-b45a-5adb6a847e6b',
 '0ca872f1-7fc0-43d6-bb78-60ccb9c33d4a',
 'a4025448-f69a-4f29-b609-36978974aeb4',
 'd16ee2da-303b-43a5-b13d-aa1485cbef00',
 '588dde30-ef35-45dc-98d3-d97445252fbb',
 'e21fca03-97bb-44fe-9924-d7ecd1bd3629',
 '3a4ae875-9fbc-4ed3-a061-bcb90783eef7',
 '16f37515-6fa3-439e-8a2a-fc04fd8d9bf5',
 '64eea260-d762-4e8a-963c-37da01a7005a',
 '2a632bc6-b385-4aa6-b83f-daae4f22074a',
 '73cb1fef-3d60-445b-b420-dc38b12d52bb',
 '0a61fa12-d738-43e7-864a-0cd9f53f7efb',
 '4e572140-2494-4d07-9d17-474a6ea4c527',
 '3bc5e2b6-e2d0-4a67-b5cd-56ec557b7a86',
 '82499d42-4d0f-40d7-b88a-a49bee3caa29',
 '03aa23d6-5f8e-4d55-8fee-61214a62d48b']

## Step 7:- Create Chain with Retriever and Gemini LLM

In [None]:
def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xFF\xD8\xFF": "jpg",
        b"\x89\x50\x4E\x47\x0D\x0A\x1A\x0A": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}

In [None]:
def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = [
        {
            "type": "text",
            "text": (
                "You are financial analyst tasking with providing investment advice.\n"
                "You will be given a mix of text, tables, and image(s) usually of charts or graphs.\n"
                "Use this information to provide investment advice related to the user's question. \n"
                f"User-provided question: {data_dict['question']}\n\n"
                "Text and / or tables:\n"
                f"{formatted_texts}"
            ),
        }
    ]

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            messages.append(
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            )
    return [HumanMessage(content=messages)]

In [None]:
# Create RAG chain
chain_multimodal_rag = (
    {
        "context": retriever_multi_vector_img | RunnableLambda(split_image_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(img_prompt_func)
    | ChatVertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=1024
    )  # Multi-modal LLM
    | StrOutputParser()
)

In [None]:
query = """
 - What are the critical difference between various graphs for Class A Share?
 - Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
 - Identify key chart patterns for Google Class A shares.
 - What is cost of revenues, operating expenses and net income for 2020. Do mention the percentage change
 - What was the effect of Covid in the 2020 financial year?
 - What are the total revenues for APAC and USA for 2021?
 - What is deferred income taxes?
 - How do you compute net income per share?
 - What drove percentage change in the consolidated revenue and cost of revenue for the year 2021 and was there any effect of Covid?
 - What is the cause of 41% increase in revenue from 2020 to 2021 and how much is dollar change?
"""

In [None]:
# List of source documents
docs = retriever_multi_vector_img.get_relevant_documents(query, limit=10)

source_docs = split_image_text_types(docs)

print(source_docs["texts"])

for i in source_docs["images"]:
    display(Image(base64.b64decode(i)))

  warn_deprecated(


[]


In [None]:
result = chain_multimodal_rag.invoke(query)

Markdown(result)

Please provide me with the text, tables, and images (charts or graphs) related to Google Class A shares. I need this information to answer your questions accurately and provide you with the best possible investment advice. 

For example, to compare the performance of Class A shares to an index, I need to see a chart of the share price over time. Similarly, to identify key chart patterns, I need to see a candlestick chart or a line chart of the share price. 

Once you provide me with the necessary data, I can:

* Analyze the different graphs for Class A shares and explain their key differences.
* Identify an index that closely matches the performance of Class A shares (excluding Google).
* Point out key chart patterns for Google Class A shares.
* Find the cost of revenues, operating expenses, and net income for 2020 and calculate the percentage change.
* Assess the impact of COVID-19 on the 2020 financial year.
* Calculate the total revenues for APAC and the USA in 2021.
* Explain deferred income taxes.
* Explain how net income per share is calculated.
* Analyze the factors driving the percentage change in consolidated revenue and cost of revenue for 2021, including any COVID-related effects.
* Determine the cause of the 41% revenue increase from 2020 to 2021 and calculate the dollar change. 

Please provide the necessary data so I can help you. 
