# ðŸ““ The GenAI Revolution Cookbook

**Title:** Multi-Document Agent with LlamaIndex: The Ultimate Guide [2025]

**Description:** Build a production-ready multi-document agent with LlamaIndex, turning PDFs into retrieval and summarization tools using semantic selection for accurate answers.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You're Building

You'll build a multi\-document research assistant that answers questions across multiple PDFs with precise citations. The agent uses semantic vector search for targeted queries, hierarchical summarization for high\-level synthesis, and function calling to route queries to the right tool. By the end, you'll have a runnable notebook that handles cross\-document Q\&A, enforces consistent citations in `[file_name p.page_label]` format, and includes a minimal validation suite.

**Prerequisites:**

* Python 3\.10\+

* OpenAI API key

* 2â€“3 sample PDFs (research papers, reports, or technical documents)

* Expected cost: \~$0\.10â€“$0\.50 per summary\-heavy query depending on document size

## Why This Approach Works

**Per\-Document Tool Isolation**
Each PDF gets its own vector and summary tool. This prevents cross\-contamination, enables precise citations, and lets the agent reason about which document to query for a given question.

**Semantic Tool Retrieval**
An object index embeds tool descriptions and retrieves the top\-k relevant tools per query. This scales to dozens of documents without overwhelming the agent's context window.

**Dual Retrieval Strategy**
Vector tools handle narrow, fact\-based queries ("What dataset did the authors use?"). Summary tools handle broad synthesis ("Compare the main contributions across papers"). The agent picks the right mode based on query semantics.

**Citation Enforcement**
Every tool attaches file name and page metadata to results. The system prompt instructs the agent to cite sources after each claim, and you can post\-process responses to format citations programmatically.

## How It Works (High\-Level Overview)

1. **Load and chunk PDFs** â€“ Extract text, split into sentence\-aware chunks, normalize metadata for citations.

2. **Build per\-document tools** â€“ Create vector and summary tools for each PDF; wrap them with clear descriptions.

3. **Index tools semantically** â€“ Embed tool descriptions in an object index for dynamic retrieval.

4. **Assemble the agent** â€“ Use function calling with a strict system prompt to route queries and enforce citations.

5. **Validate and iterate** â€“ Run test queries, inspect tool selection, tune retrieval thresholds and temperature.

## Setup \& Installation

Run this cell first to install all required packages with pinned versions:

In [None]:
%pip install --upgrade -q "llama-index==0.10.40" "llama-index-llms-openai>=0.1.0" "llama-index-embeddings-openai>=0.1.0" "pypdf>=4.0.0" nest_asyncio python-dotenv

Next, configure your OpenAI API key. If running in Colab, add your key to Secrets (Settings â†’ Secrets â†’ OPENAI\_API\_KEY). Otherwise, create a `.env` file with `OPENAI_API_KEY=your_key`.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

# Fail early if key is missing
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env or Colab Secrets"
print("API key loaded.")

Set up logging, suppress warnings, and enable async support for a clean notebook environment:

In [None]:
import logging
import warnings
import nest_asyncio

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
nest_asyncio.apply()

Configure the LLM and embedding model globally for all LlamaIndex operations:

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Use GPT-4o for reliable function calling; fallback to gpt-4o-mini if needed
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

Create a data directory and download sample PDFs programmatically so the notebook runs end\-to\-end:

In [None]:
import urllib.request

DATA_DIR = "data"
os.makedirs(DATA_DIR, exist_ok=True)

# Example: download public arXiv papers (replace with your own PDFs)
sample_urls = [
    ("https://arxiv.org/pdf/2005.11401.pdf", "paper1.pdf"),  # GPT-3 paper
    ("https://arxiv.org/pdf/2303.08774.pdf", "paper2.pdf"),  # GPT-4 paper
]

for url, fname in sample_urls:
    fpath = os.path.join(DATA_DIR, fname)
    if not os.path.exists(fpath):
        print(f"Downloading {fname}...")
        urllib.request.urlretrieve(url, fpath)

pdf_files = [f for f in os.listdir(DATA_DIR) if f.lower().endswith(".pdf")]
print(f"Found {len(pdf_files)} PDFs:", pdf_files)

## Step\-by\-Step Implementation

### Step 1: Load and Chunk PDFs

Load documents from the data directory. The PDF reader attaches page metadata automatically:

In [None]:
from llama_index.core import SimpleDirectoryReader, Document

docs = SimpleDirectoryReader(DATA_DIR, recursive=False).load_data()
print(f"Loaded {len(docs)} documents")

Split documents into sentence\-aware chunks for semantic retrieval. Sentence\-aware splitting avoids fragmenting thoughts mid\-sentence, giving the vector index better semantic units. This directly improves retrieval quality, especially for dense technical writing like research papers or legal clauses. For more strategies to boost retrieval accuracy in RAG systems, see our guide on [retrieval tricks to boost answer accuracy](/article/rag-application-7-retrieval-tricks-to-boost-answer-accuracy-2).

In [None]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs, show_progress=True)
print(f"Total chunks: {len(nodes)}")

Normalize metadata for accurate citations. Ensure every node has `file_name` and `page_label`:

In [None]:
for n in nodes:
    meta = n.metadata or {}
    if "file_name" not in meta:
        file_path = meta.get("file_path", meta.get("source", "unknown"))
        meta["file_name"] = os.path.basename(file_path) if isinstance(file_path, str) else "unknown"
    if "page_label" not in meta:
        meta["page_label"] = str(meta.get("page_number", "N/A"))
    n.metadata = meta

print("Sample chunk metadata:", nodes[0].metadata)
print("Sample chunk text:", nodes[0].text[:300], "...")

Group nodes by document for per\-document tool creation:

In [None]:
from collections import defaultdict

nodes_by_file = defaultdict(list)
for n in nodes:
    nodes_by_file[n.metadata["file_name"]].append(n)

print({k: len(v) for k, v in nodes_by_file.items()})

### Step 2: Build Per\-Document Vector Tools

Create a vector index for each document to enable precise passage retrieval:

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool

vector_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    v_index = VectorStoreIndex(doc_nodes, show_progress=True)
    v_engine = v_index.as_query_engine(similarity_top_k=5)
    v_tool = QueryEngineTool.from_defaults(
        name=f"vector_{fname}",
        query_engine=v_engine,
        description=(
            f"Semantic vector search for {fname}. "
            "Use for targeted, specific questions that require exact passages and citations."
        ),
        metadata={"file_name": fname, "tool_type": "vector"}
    )
    vector_tools[fname] = v_tool

print(f"Vector tools created: {len(vector_tools)}")

Test a vector tool to verify retrieval quality:

In [None]:
sample_file = next(iter(vector_tools.keys()))
resp = vector_tools[sample_file].query_engine.query("What problem does this paper address?")
print(resp)

### Step 3: Build Per\-Document Summary Tools

Create a summary index for each document to enable hierarchical summarization:

In [None]:
from llama_index.core import SummaryIndex

summary_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    s_index = SummaryIndex(doc_nodes)
    s_engine = s_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True
    )
    s_tool = QueryEngineTool.from_defaults(
        name=f"summary_{fname}",
        query_engine=s_engine,
        description=(
            f"Hierarchical summarization for {fname}. "
            "Use for overviews, key contributions, limitations, and document-wide synthesis."
        ),
        metadata={"file_name": fname, "tool_type": "summary"}
    )
    summary_tools[fname] = s_tool

print(f"Summary tools created: {len(summary_tools)}")

Test a summary tool to verify synthesis quality:

In [None]:
sample_file = next(iter(summary_tools.keys()))
resp = summary_tools[sample_file].query_engine.query("Provide a 5-bullet executive summary.")
print(resp)

### Step 4: Index Tools Semantically

Build an object index over all tools for semantic tool selection. This embeds tool descriptions and retrieves the top\-k relevant tools per query:

In [None]:
from llama_index.core.objects import ObjectIndex

all_tools = list(vector_tools.values()) + list(summary_tools.values())

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
    show_progress=True
)

tool_retriever = obj_index.as_retriever(similarity_top_k=3)

Inspect which tools are retrieved for different queries to debug tool selection:

In [None]:
def inspect_tools(query: str):
    tools = tool_retriever.retrieve(query)
    print(f"Query: {query}")
    for i, t in enumerate(tools, 1):
        tool = getattr(t, "object", None) or t.node.metadata.get("object")
        print(f"#{i} -> {tool.metadata.get('tool_type')} | {tool.metadata.get('file_name')} | {tool.name}")

inspect_tools("Summarize key contributions across the papers.")
inspect_tools("What dataset did the authors use for evaluation?")

### Step 5: Assemble the Agent

Create the agent using function calling and a strict system prompt that enforces citation format. While frameworks like LangChain and CrewAI are solid, LlamaIndex specializes in document workflows with first\-class support for indexing, retrieval, summarization, and agentic tool use that map cleanly to this problem. If you're interested in foundational agent patterns, check out our step\-by\-step tutorial on [building an LLM agent from scratch with GPT\-4 ReAct](/article/how-to-build-an-llm-agent-from-scratch-with-gpt-4-react-5).

In [None]:
from llama_index.core.agent import FunctionCallingAgentWorker, AgentRunner

SYSTEM_PROMPT = """You are a multi-document research assistant.
- Use only the provided tools.
- Prefer vector tools for specific, narrow questions.
- Prefer summary tools for high-level synthesis.
- Always cite sources as [file_name p.page_label] after each relevant sentence.
- If you cannot find relevant evidence, say so explicitly."""

agent_worker = FunctionCallingAgentWorker.from_tools(
    tools=all_tools,
    llm=Settings.llm,
    system_prompt=SYSTEM_PROMPT,
    tool_retriever=tool_retriever,
)

agent = AgentRunner(agent_worker)

## Run and Validate

Run a cross\-document query and verify the agent synthesizes answers with citations:

In [None]:
response = agent.chat(
    "Compare the main challenges and proposed collaboration mechanisms across the papers."
)
print(str(response))

Run a suite of test queries to validate agent routing, retrieval, and summarization:

In [None]:
tests = [
    "List the datasets used by each paper and compare evaluation metrics.",
    "Provide a high-level summary of the main contributions across documents.",
    "According to the authors, what are the primary limitations?"
]

for q in tests:
    print("\nQ:", q)
    resp = agent.chat(q)
    print("A:", str(resp))

Inspect tool selection for observability and tuning:

In [None]:
def print_selected_tools(query: str):
    cands = tool_retriever.retrieve(query)
    print(f"Query: {query}")
    for i, c in enumerate(cands, 1):
        tool = getattr(c, "object", None) or c.node.metadata.get("object")
        print(f"  {i}. {tool.name} | {tool.metadata['tool_type']} | {tool.metadata['file_name']} | score={c.score:.3f}")

print_selected_tools("Provide an executive summary across all documents.")
print_selected_tools("Which sections discuss model architecture details?")

Add a simple guardrail to report when no sufficiently relevant evidence is found:

In [None]:
MIN_SCORE = 0.25

def safe_query(query: str) -> str:
    cands = tool_retriever.retrieve(query)
    if not cands or max(c.score for c in cands) < MIN_SCORE:
        return "No sufficiently relevant sources found. Please rephrase or specify a document/section."
    return str(agent.chat(query))

print(safe_query("What is the capital of Mars?"))

## Conclusion

You've built a multi\-document research assistant that routes queries to the right tool, retrieves precise passages, and enforces consistent citations. Key decisions include per\-document tool isolation for clean attribution, semantic tool retrieval for scalability, and dual retrieval modes (vector for specifics, summary for synthesis).

**Next steps to harden for production:**

1. **Persist indices** â€“ Save vector and summary indices to disk or a vector database (e.g., pgvector, Pinecone) to avoid re\-embedding on every run.

2. **Add retries and rate limits** â€“ Wrap LLM calls with exponential backoff and timeout handling for robustness.

3. **Implement structured logging** â€“ Use LlamaIndex callbacks or a logging framework to trace tool calls, latency, and token usage.

4. **Cache answers** â€“ Use an in\-memory LRU cache or a persistent store like Redis for repeated queries. For a deep dive into implementing semantic caching with Redis Vector to optimize LLM costs, see [how to implement semantic cache with Redis Vector](/article/semantic-cache-llm-how-to-implement-with-redis-vector-to-cut-costs-6).

5. **Post\-process citations** â€“ Extract `source_nodes` from responses and format citations programmatically to ensure consistency beyond prompt\-based enforcement.