<a href="https://colab.research.google.com/github/maleledignity-code/docusaurus-starter/blob/main/examples/parse/demo_json.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaParse JSON Mode + Multimodal RAG

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/demo_json.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows you how to use LlamaParse JSON mode with LlamaIndex to build a simple multimodal RAG pipeline.

Using JSON mode gives you back a list of json dictionaries, which contains both text and images. You can then download these images and use a multimodal model to extract information and index them.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-19-2025   | 0.6.61  | Maintained |

## Setup

Define imports, env variables, global LLM/embedding models.

In [1]:
%pip install llama-index
%pip install "llama-index-core>=0.13.2<0.14.0"
%pip install "llama-index-llms-anthropic>=0.8.4<0.9.0"
%pip install "llama-index-embeddings-huggingface>=0.6.0<0.7.0"
%pip install llama-cloud-services

[31mERROR: Invalid requirement: 'llama-index-core>=0.13.2<0.14.0': Expected end or semicolon (after version specifier)
    llama-index-core>=0.13.2<0.14.0
                    ~~~~~~~~^[0m[31m
[0m[31mERROR: Invalid requirement: 'llama-index-llms-anthropic>=0.8.4<0.9.0': Expected end or semicolon (after version specifier)
    llama-index-llms-anthropic>=0.8.4<0.9.0
                              ~~~~~~~^[0m[31m
[0m[31mERROR: Invalid requirement: 'llama-index-embeddings-huggingface>=0.6.0<0.7.0': Expected end or semicolon (after version specifier)
    llama-index-embeddings-huggingface>=0.6.0<0.7.0
                                      ~~~~~~~^[0m[31m


In [2]:
import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

# Using Anthropic API for LLMs
os.environ["ANTHROPIC_API_KEY"] = "sk-..."

In [3]:
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-4-sonnet-20250514")

ModuleNotFoundError: No module named 'llama_index.llms.anthropic'

In [None]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = "local:Qwen/Qwen3-Embedding-0.6B"

## Load Data

Let's load in the Uber 10Q report.

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

## Using LlamaParse in JSON Mode for PDF Reading

We show you how to run LlamaParse in JSON mode for PDF reading.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

result = await parser.aparse("./uber_10q_march_2022.pdf")

In [None]:
text_nodes = await result.aget_text_nodes(split_by_page=True)
image_nodes = await result.aget_image_nodes(
    include_screenshot_images=True,
    include_object_images=False,
    image_download_dir="./uber_10q_images",
)

## Extract/Index images from image dicts

Here we use a multimodal model to caption images and create text nodes for indexing.

In [None]:
from llama_index.core.llms import ChatMessage, ImageBlock, TextBlock
from llama_index.core.schema import ImageNode, TextNode
from llama_index.llms.anthropic import Anthropic


async def get_image_text_nodes(image_nodes: list[ImageNode]):
    """Extract out text from images using a multimodal model."""
    llm = Anthropic(model="claude-3-5-haiku-20241022", max_tokens=300)
    img_text_nodes = []
    for image_node in image_nodes:
        image_path = image_node.image_path
        message = ChatMessage(
            role="user",
            blocks=[
                TextBlock(text="Describe the images as alt text"),
                ImageBlock(path=image_path),
            ],
        )
        response = await llm.achat([message])
        text_node = TextNode(
            text=str(response.message.content), metadata={"path": image_path}
        )
        img_text_nodes.append(text_node)

    return img_text_nodes

In [None]:
image_text_nodes = await get_image_text_nodes(image_nodes)

In [None]:
image_text_nodes[0].get_content()

## Build Index across image and text nodes

Here we build a vector index across both text nodes and text nodes extracted from images.

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(text_nodes + image_text_nodes)

In [None]:
query_engine = index.as_query_engine()

In [None]:
# ask question over image!
response = query_engine.query(
    "What does the bar graph titled 'Monthly Active Platform Consumers' show?"
)
print(str(response))

In [None]:
# ask question over text!
response = query_engine.query("What are the main risk factors for Uber?")
print(str(response))