# Multimodal Parsing with Gemini 2.0 Flash

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/gemini2_flash.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of Gemini 2.0 Flash.

LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.


## Setup

Download the data - we'll use a technical datasheet for a programmable logic device (Xilinx's XC9500 In-System Programmable CPLD).

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
!wget "https://media.digikey.com/pdf/Data%20Sheets/AMD/XC9500_CPLD_Family.pdf" -O data/XC9500_CPLD_Family.pdf

--2025-02-16 23:23:18--  https://media.digikey.com/pdf/Data%20Sheets/AMD/XC9500_CPLD_Family.pdf
Resolving media.digikey.com (media.digikey.com)... 23.197.33.140
Connecting to media.digikey.com (media.digikey.com)|23.197.33.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 201899 (197K) [application/pdf]
Saving to: ‘data/XC9500_CPLD_Family.pdf’


2025-02-16 23:23:18 (3.53 MB/s) - ‘data/XC9500_CPLD_Family.pdf’ saved [201899/201899]



## Initialize LlamaParse

Initialize LlamaParse in multimodal mode, and specify the vendor as `gemini-2.0-flash-001`.

**NOTE**: Current pricing is 2 credits for a 1 page ($0.006 USD / page). This includes core model, infra, and algorithm costs to fully process the page. 

In [3]:
from llama_index.core.schema import TextNode
from typing import List
import json


def get_text_nodes(json_list: List[dict]):
    text_nodes = []
    for idx, page in enumerate(json_list):
        text_node = TextNode(text=page["md"], metadata={"page": page["page"]})
        text_nodes.append(text_node)
    return text_nodes


def save_jsonl(data_list, filename):
    """Save a list of dictionaries as JSON Lines."""
    with open(filename, "w") as file:
        for item in data_list:
            json.dump(item, file)
            file.write("\n")


def load_jsonl(filename):
    """Load a list of dictionaries from JSON Lines."""
    data_list = []
    with open(filename, "r") as file:
        for line in file:
            data_list.append(json.loads(line))
    return data_list

In [None]:
# from llama_parse import LlamaParse

# parsing_instruction = """
# You are given a technical datasheet of an electronic component.
# For any graphs, try to create a 2D table of relevant values, along with a description of the graph.
# For any schematic diagrams, MAKE SURE to describe a list of all components and their connections to each other.
# Make sure that you always parse out the text with the correct reading order.
# """

# parser = LlamaParse(
#     result_type="markdown",
#     use_vendor_multimodal_model=True,
#     vendor_multimodal_model_name="gemini-2.0-flash-001",
#     invalidate_cache=True,
#     parsing_instruction=parsing_instruction,
#     api_key=llama_index_key
# )
# json_objs = parser.get_json_result("./data/XC9500_CPLD_Family.pdf")
# print(json_objs)
# json_list = json_objs[0]["pages"]
# docs = get_text_nodes(json_list)

In [5]:
import pymupdf
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

def extract_high_res_image(pdf_path, page_number, dpi=300, output_path="output.png"):
    doc = pymupdf.open(pdf_path)  # Open the PDF
    page = doc[page_number]  # Select the page
    zoom = dpi / 72  # Example: 300 DPI => 300/72 = 4.17
    mat = pymupdf.Matrix(zoom, zoom)
    pix = page.get_pixmap(matrix=mat, alpha=False)
    pix.save(output_path)
    print(f"Saved high-res image at {output_path}")

class SugoojiParse:
    def __init__(self, prompt):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
        self.prompt = prompt

    def get_json_result(self, filename, pages):
        d = {}
        d["pages"] = []
        for page_number in pages:  # iterate through the pages
            subd = {}
            subd["page"] = page_number
            extract_high_res_image(pdf_path=f"{filename}/{filename}.pdf",
                                page_number=page_number,
                                dpi=108,
                                output_path=f"{filename}/{filename}-{page_number}.png")
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": f"{filename}/{filename}-{page_number}.png",
                        },
                        {"type": "text", "text": self.prompt},
                    ],
                }
            ]

            text = self.processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            image_inputs, video_inputs = process_vision_info(messages)
            inputs = self.processor(
                text=[text],
                images=image_inputs,
                padding=True,
                return_tensors="pt",
            )
            inputs = inputs.to("cuda")

            # Inference: Generation of the output
            generated_ids = self.model.generate(**inputs, max_new_tokens=1024)
            generated_ids_trimmed = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
            ]
            output_text = self.processor.batch_decode(
                generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )
            # print(output_text)
            subd["md"] = output_text[0]
            d["pages"].append(subd)
        return d

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
parsing_instruction = "Please convert the MAIN table of the image to markdown table format. MAKE SURE all cells are accounted for. Only output the table"
parser = SugoojiParse(prompt=parsing_instruction)
json_objs = parser.get_json_result("2025Q1", [4])
print(json_objs)
json_list = json_objs["pages"]
docs = get_text_nodes(json_list)

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.20it/s]


Saved high-res image at 2025Q1/2025Q1-4.png
{'pages': [{'page': 4, 'md': "| Unaudited | Share capital RM'000 | Translation reserve RM'000 | Employee share-based reserve RM'000 | Put option reserve RM'000 | Retained earnings RM'000 | Total RM'000 | Non-controlling interests RM'000 | Total equity RM'000 |\n|---|---|---|---|---|---|---|---|---|\n| At 1 April 2024 | 399,555 | 186 | 5,615 | (36,955) | 298,012 | 666,413 | 26,558 | 692,971 |\n| Foreign currency translation differences for foreign operations/ Total other comprehensive income for the period | -- | 1,241 | -- | -- | -- | 1,241 | 69 | 1,310 |\n| Profit for the period | -- | -- | -- | -- | 25,995 | 25,995 | 516 | 26,511 |\n| Total comprehensive income for the period | -- | 1,241 | -- | -- | 25,995 | 27,236 | 585 | 27,821 |\n| Contributions by and distributions to owners of the Company | | | | | | | | |\n| Issue of shares pursuant to ESOS | 1,340 | -- | (230) | -- | -- | 1,110 | -- | 1,110 |\n| Share-based payment | -- | -- | 261 |

In [7]:
# Optional: Save
save_jsonl([d.dict() for d in docs], "docs_gemini_2.0_flash.jsonl")

In [8]:
# Optional: Load
from llama_index.core import Document

docs_dicts = load_jsonl("docs_gemini_2.0_flash.jsonl")
docs = [Document.parse_obj(d) for d in docs_dicts]

/tmp/ipykernel_8030/4252911382.py:5: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  docs = [Document.parse_obj(d) for d in docs_dicts]


### Setup GPT-4o baseline

For comparison, we will also parse the document using GPT-4o ($0.03 per page).

In [None]:
from llama_parse import LlamaParse

parser_gpt4o = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model="openai-gpt4o",
    invalidate_cache=True,
    parsing_instruction=parsing_instruction,
)
json_objs_gpt4o = parser_gpt4o.get_json_result("./data/XC9500_CPLD_Family.pdf")
json_list_gpt4o = json_objs_gpt4o[0]["pages"]
docs_gpt4o = get_text_nodes(json_list_gpt4o)

Started parsing the file under job_id 23c6627c-2e3d-46c9-88a0-7945d7e65d96


In [None]:
# Optional: Save
save_jsonl([d.dict() for d in docs_gpt4o], "docs_gpt4o.jsonl")

In [None]:
# Optional: Load
from llama_index.core import Document

docs_gpt4o_dicts = load_jsonl("docs_gpt4o.jsonl")
docs_gpt4o = [Document.parse_obj(d) for d in docs_gpt4o_dicts]

## View Results

Let's visualize the results between GPT-4o and Gemini Flash 2.0 along with the original document page.

Check out an example page 2 below.

![xc9500_img](XC9500_CPLD_Family_p3.png)

We see that the parsed text is fairly similar between Gemini 2.0 Flash and GPT-4o. 

In [11]:
# using Gemini 2.0 Flash
print(docs[0].get_content(metadata_mode="all"))

page: 4

| Unaudited | Share capital RM'000 | Translation reserve RM'000 | Employee share-based reserve RM'000 | Put option reserve RM'000 | Retained earnings RM'000 | Total RM'000 | Non-controlling interests RM'000 | Total equity RM'000 |
|---|---|---|---|---|---|---|---|---|
| At 1 April 2024 | 399,555 | 186 | 5,615 | (36,955) | 298,012 | 666,413 | 26,558 | 692,971 |
| Foreign currency translation differences for foreign operations/ Total other comprehensive income for the period | -- | 1,241 | -- | -- | -- | 1,241 | 69 | 1,310 |
| Profit for the period | -- | -- | -- | -- | 25,995 | 25,995 | 516 | 26,511 |
| Total comprehensive income for the period | -- | 1,241 | -- | -- | 25,995 | 27,236 | 585 | 27,821 |
| Contributions by and distributions to owners of the Company | | | | | | | | |
| Issue of shares pursuant to ESOS | 1,340 | -- | (230) | -- | -- | 1,110 | -- | 1,110 |
| Share-based payment | -- | -- | 261 | -- | -- | 261 | -- | 261 |
| Changes in put option liability | -- | -- |

In [None]:
# using GPT-4o
print(docs_gpt4o[2].get_content(metadata_mode="all"))

page: 3

The diagram illustrates the architecture of the XC9500 In-System Programmable CPLD Family. Here's a breakdown of the components and their connections:

1. **JTAG Port**: 
   - Connects to the JTAG Controller.

2. **JTAG Controller**: 
   - Interfaces with the In-System Programming Controller.

3. **In-System Programming Controller**: 
   - Manages programming of the device.

4. **I/O Blocks**: 
   - Connect to external I/O pins.
   - Interface with the Fast CONNECT Switch Matrix.

5. **Fast CONNECT Switch Matrix**: 
   - Connects I/O Blocks to Function Blocks.
   - Provides 36 inputs and 18 outputs to each Function Block.

6. **Function Blocks (FB)**: 
   - Each block contains 18 macrocells.
   - Capable of implementing combinatorial or registered functions.
   - Receives global clock, output enable, and set/reset signals.
   - Outputs drive the Fast CONNECT Switch Matrix.
   - Supports local feedback paths for fast counters and state machines.

7. **I/O/GCK, I/O/GSR, I/O/GTS*

## Setup RAG Pipeline

Let's setup a RAG pipeline over this data.

(we also use gpt4o-mini for the actual text synthesis step).

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

In [21]:
# from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
# from llama_index.llms.openai import OpenAI

index = VectorStoreIndex(docs)
query_engine = index.as_query_engine(similarity_top_k=5)

# index_gpt4o = VectorStoreIndex(docs_gpt4o)
# query_engine_gpt4o = index_gpt4o.as_query_engine(similarity_top_k=5)

Retrying llama_index.embeddings.openai.base.OpenAIEmbedding._get_text_embeddings.<locals>._retryable_get_embeddings in 1.0 seconds as it raised APIConnectionError: Connection error..


APIConnectionError: Connection error.

In [None]:
query = "Give me the full output slew-Rate curve for (a) Rising and (b) Falling Outputs"

response = query_engine.query(query)
response_gpt4o = query_engine_gpt4o.query(query)

In [None]:
print(response)

The full output slew-rate curve for (a) Rising and (b) Falling Outputs is represented in a graph where the output voltage starts at 1.5V and reaches the desired output level over a time period defined as T<sub>SLEW</sub>. The curve illustrates the gradual increase in voltage for rising outputs and the gradual decrease for falling outputs, effectively showing how the output edge rates can be controlled to reduce system noise.


In [None]:
print(response.source_nodes[0].get_content())

# XC9500 In-System Programmable CPLD Family

Each output has independent slew rate control. Output edge rates may be slowed down to reduce system noise (with an additional time delay of T<sub>SLEW</sub>) through programming. See Figure 11.

Each IOB provides user programmable ground pin capability. This allows device I/O pins to be configured as additional ground pins. By tying strategically located programmable ground pins to the external ground connection, system noise generated from large numbers of simultaneous switching outputs may be reduced.

A control pull-up resistor (typically 10K ohms) is attached to each device I/O pin to prevent them from floating when the device is not in normal user operation. This resistor is active during device programming mode and system power-up. It is also activated for an erased device. The resistor is deactivated during normal operation.

The output driver is capable of supplying 24 mA output drive. All output drivers in the device may be configu

In [None]:
print(response_gpt4o)

The output slew-rate curve for (a) Rising and (b) Falling Outputs is represented in a timing diagram where the output voltage transitions from a low state to a high state and vice versa. 

For the rising output, the curve starts at 1.5V and transitions to the desired output voltage level over a time period defined as T<sub>SLEW</sub>. 

For the falling output, the curve similarly begins at the high output voltage and decreases to a low state, also taking the time defined as T<sub>SLEW</sub> to complete the transition.

The specific values and graphical representation would typically be illustrated in a figure, but the key takeaway is that the output slew rate can be controlled to manage system noise by programming the desired T<sub>SLEW</sub> time.


In [None]:
print(response_gpt4o.source_nodes[0].get_content())

# XC9500 In-System Programmable CPLD Family

Each output has independent slew rate control. Output edge rates may be slowed down to reduce system noise (with an additional time delay of T<sub>SLEW</sub>) through programming. See Figure 11.

Each IOB provides user programmable ground pin capability. This allows device I/O pins to be configured as additional ground pins. By tying strategically located programmable ground pins to the external ground connection, system noise generated from large numbers of simultaneous switching outputs may be reduced.

A control pull-up resistor (typically 10K ohms) is attached to each device I/O pin to prevent them from floating when the device is not in normal user operation. This resistor is active during device programming mode and system power-up. It is also activated for an erased device. The resistor is deactivated during normal operation.

The output driver is capable of supplying 24 mA output drive. All output drivers in the device may be configu