<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/agent/multi_document_agents-v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Document Agents (V1)

In this guide, you learn towards setting up a multi-document agent over the LlamaIndex documentation.

This is an extension of V0 multi-document agents with the additional features:
- Reranking during document (tool) retrieval
- Query planning tool that the agent can use to plan


We do this with the following architecture:

- setup a "document agent" over each Document: each doc agent can do QA/summarization within its doc
- setup a top-level agent over this set of document agents. Do tool retrieval and then do CoT over the set of tools to answer a question.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install llama-index-core
%pip install llama-index-agent-openai
%pip install llama-index-readers-file
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install unstructured[html]
! pip install nltk
! pip install llama_index.readers.file
!pip install llama-index-question-gen-openai
!pip install PyMuPDF


In [2]:
%load_ext autoreload
%autoreload 2

## Setup and Download Data

In this section, we'll load in the LlamaIndex documentation.

In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [5]:
from llama_index.readers.file import UnstructuredReader

reader = UnstructuredReader()

In [8]:
from pathlib import Path

# Path to your PDF files
pdf_files_path = Path("/content/drive/MyDrive/MMMCZCS")

# Gathering all PDF files
all_pdf_files = list(pdf_files_path.rglob("*.pdf"))

print(f"Total PDF files found: {len(all_pdf_files)}")

Total PDF files found: 3


In [28]:
from llama_index.core import Document
import fitz  # PyMuPDF



# TODO: set to higher value if you want more docs
doc_limit = 30



docs = []
for idx, pdf_path in enumerate(all_pdf_files):
    if idx >= doc_limit:  # Adjusted to >= to process exactly doc_limit files
        break
    print(f"Processing file {idx+1}/{min(len(all_pdf_files), doc_limit)}: {pdf_path}")

    # Opening the PDF file
    with fitz.open(pdf_path) as doc:
        text = ""
        for page in doc:
            text += page.get_text() + "\n\n"  # Extracting text from each page

    # Creating Document
    loaded_doc = Document(
        text=text,
        metadata={"path": str(pdf_path)},
    )
    docs.append(loaded_doc)

print(f"Processed {len(docs)} documents.")





Processing file 1/3: /content/drive/MyDrive/MMMCZCS/LCA_Calculating_GHG_Intensity.pdf
Processing file 2/3: /content/drive/MyDrive/MMMCZCS/Implications_2023_IMO_GHG_Strategy.pdf
Processing file 3/3: /content/drive/MyDrive/MMMCZCS/Assessing_impacts_EU_and_US_policies.pdf
Processed 3 documents.


Define Global LLM + Embeddings

In [11]:
import os

os.environ["OPENAI_API_KEY"] = "key"

import nest_asyncio

nest_asyncio.apply()

In [12]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

llm = OpenAI(model="gpt-3.5-turbo")
Settings.llm = llm
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=256
)

## Building Multi-Document Agents

In this section we show you how to construct the multi-document agent. We first build a document agent for each document, and then define the top-level parent agent with an object index.

### Build Document Agent for each Document

In this section we define "document agents" for each document.

We define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.

This document agent can dynamically choose to perform semantic search or summarization within a given document.

We create a separate document agent for each city.

In [84]:
from llama_index.agent.openai import OpenAIAgent
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter
import os
from tqdm.notebook import tqdm
import pickle


async def build_agent_per_doc(nodes, file_base):
    print(file_base)

    vi_out_path = f"./data/llamaindex_docs/{file_base}"
    summary_out_path = f"./data/llamaindex_docs/{file_base}_summary.pkl"
    if not os.path.exists(vi_out_path):
        Path("./data/llamaindex_docs/").mkdir(parents=True, exist_ok=True)
        # build vector index
        vector_index = VectorStoreIndex(nodes)
        vector_index.storage_context.persist(persist_dir=vi_out_path)
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=vi_out_path),
        )

    # build summary index
    summary_index = SummaryIndex(nodes)

    # define query engines
    vector_query_engine = vector_index.as_query_engine(llm=llm)
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize", llm=llm
    )

    # extract a summary
    if not os.path.exists(summary_out_path):
        Path(summary_out_path).parent.mkdir(parents=True, exist_ok=True)
        summary = str(
            await summary_query_engine.aquery(
                "Extract a concise 3-4 line summary of this document"
            )
        )
        pickle.dump(summary, open(summary_out_path, "wb"))
    else:
        summary = pickle.load(open(summary_out_path, "rb"))

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name=f"vector_tool_{file_base}",
                description=f"Useful for questions related to specific facts",
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name=f"summary_tool_{file_base}",
                description=f"Useful for summarization questions",
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-4-0125-preview	")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
        system_prompt=f"""\
You are a specialized agent designed to answer queries about the `{file_base}.pdf` part of the Mærsk Mc-Kinney Møller Center for Zero Carbon Shipping.
You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
""",
    )

    return agent, summary


async def build_agents(docs):
    node_parser = SentenceSplitter()

    # Build agents dictionary
    agents_dict = {}
    extra_info_dict = {}

    # # this is for the baseline
    # all_nodes = []

    for idx, doc in enumerate(tqdm(docs)):
        nodes = node_parser.get_nodes_from_documents([doc])
        # all_nodes.extend(nodes)

        # ID will be base + parent
        file_path = Path(doc.metadata["path"])
        file_base = str(file_path.parent.stem) + "_" + str(file_path.stem)
        agent, summary = await build_agent_per_doc(nodes, file_base)

        agents_dict[file_base] = agent
        extra_info_dict[file_base] = {"summary": summary, "nodes": nodes}

    return agents_dict, extra_info_dict

In [14]:
agents_dict, extra_info_dict = await build_agents(docs)

  0%|          | 0/3 [00:00<?, ?it/s]

MMMCZCS_LCA_Calculating_GHG_Intensity
MMMCZCS_Implications_2023_IMO_GHG_Strategy
MMMCZCS_Assessing_impacts_EU_and_US_policies


### Build Retriever-Enabled OpenAI Agent

We build a top-level agent that can orchestrate across the different document agents to answer any user query.

This `RetrieverOpenAIAgent` performs tool retrieval before tool use (unlike a default agent that tries to put all tools in the prompt).

**Improvements from V0**: We make the following improvements compared to the "base" version in V0.

- Adding in reranking: we use Cohere reranker to better filter the candidate set of documents.
- Adding in a query planning tool: we add an explicit query planning tool that's dynamically created based on the set of retrieved tools.


In [15]:
# define tool for each document agent
all_tools = []
for file_base, agent in agents_dict.items():
    summary = extra_info_dict[file_base]["summary"]
    doc_tool = QueryEngineTool(
        query_engine=agent,
        metadata=ToolMetadata(
            name=f"tool_{file_base}",
            description=summary,
        ),
    )
    all_tools.append(doc_tool)

In [16]:
print(all_tools[0].metadata)

ToolMetadata(description='The document outlines a standardized methodology for calculating the greenhouse gas intensity of maritime fuels, emphasizing global warming potential assessment and providing guidance on emissions accounting principles and credits for various stages of the fuel life cycle.', name='tool_MMMCZCS_LCA_Calculating_GHG_Intensity', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)


In [87]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import (
    ObjectIndex,
    ObjectRetriever,
)
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.schema import QueryBundle
from llama_index.llms.openai import OpenAI


llm = OpenAI(model_name="gpt-4-0613")

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)
vector_node_retriever = obj_index.as_node_retriever(
    similarity_top_k=10,
)


# define a custom object retriever that adds in a query planning tool
class CustomObjectRetriever(ObjectRetriever):
    def __init__(
        self,
        retriever,
        object_node_mapping,
        node_postprocessors=None,
        llm=None,
    ):
        self._retriever = retriever
        self._object_node_mapping = object_node_mapping
        self._llm = llm or OpenAI("gpt-4-0613")
        self._node_postprocessors = node_postprocessors or []

    def retrieve(self, query_bundle):
        if isinstance(query_bundle, str):
            query_bundle = QueryBundle(query_str=query_bundle)

        nodes = self._retriever.retrieve(query_bundle)
        for processor in self._node_postprocessors:
            nodes = processor.postprocess_nodes(
                nodes, query_bundle=query_bundle
            )
        tools = [self._object_node_mapping.from_node(n.node) for n in nodes]

        sub_question_engine = SubQuestionQueryEngine.from_defaults(
            query_engine_tools=tools, llm=self._llm
        )
        sub_question_description = f"""\
Useful for any queries that involve comparing multiple documents. ALWAYS use this tool for comparison queries - make sure to call this \
tool with the original query. Do NOT use the other tools for any queries involving multiple documents. Give an answer that is 8-9 lines.
"""
        sub_question_tool = QueryEngineTool(
            query_engine=sub_question_engine,
            metadata=ToolMetadata(
                name="compare_tool", description=sub_question_description
            ),
        )

        return tools + [sub_question_tool]

In [18]:

os.environ["COHERE_API_KEY"] = "key"


In [19]:
# wrap it with ObjectRetriever to return objects
custom_obj_retriever = CustomObjectRetriever(
    vector_node_retriever,
    obj_index.object_node_mapping,
    node_postprocessors=[CohereRerank(top_n=5)],
    llm=llm,
)

In [20]:
tmps = custom_obj_retriever.retrieve("hello")

# should be 5 + 1 -- 5 from reranker, 1 from subquestion
print(len(tmps))

4


In [88]:
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.agent import ReActAgent

top_agent = OpenAIAgent.from_tools(
    tool_retriever=custom_obj_retriever,
    system_prompt=""" \
You are an agent designed to answer queries about the zero carbon shipping and Mærsk McKinney Møller Center for Zero Carbon Shipping.
Please always use the tools provided to answer a question. Answer in 8-9 lines. Do not rely on prior knowledge.\

""",
    llm=llm,
    verbose=True,
)

# top_agent = ReActAgent.from_tools(
#     tool_retriever=custom_obj_retriever,
#     system_prompt=""" \
# You are an agent designed to answer queries about the documentation.
# Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

# """,
#     llm=llm,
#     verbose=True,
# )

### Define Baseline Vector Store Index

As a point of comparison, we define a "naive" RAG pipeline which dumps all docs into a single vector index collection.

We set the top_k = 4

In [22]:
all_nodes = [
    n for extra_info in extra_info_dict.values() for n in extra_info["nodes"]
]

In [23]:
base_index = VectorStoreIndex(all_nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=4)

## Running Example Queries

Let's run some example queries, ranging from QA / summaries over a single document to QA / summarization over multiple documents.

In [29]:
response = top_agent.query(
    "What are LCA calculations?",
)

Added user message to memory: What are LCA calculations?
=== Calling Function ===
Calling function: tool_MMMCZCS_LCA_Calculating_GHG_Intensity with args: {"input":"LCA calculations"}
Added user message to memory: LCA calculations
=== Calling Function ===
Calling function: summary_tool_MMMCZCS_LCA_Calculating_GHG_Intensity with args: {
  "input": "LCA calculations"
}
Got output: LCA calculations involve assessing the environmental impacts of a product, process, or activity throughout its entire life cycle. This includes defining the system boundary to determine what processes and activities will be considered, collecting data on inputs and outputs, and evaluating the environmental impacts associated with each stage. The calculations aim to provide a comprehensive understanding of the greenhouse gas emissions intensity of maritime fuels by considering factors such as feedstock extraction, fuel production, transportation, distribution, storage, and bunkering. The methodology developed by 

In [25]:
print(response)

Life Cycle Assessment (LCA) calculations are a methodology used to evaluate the environmental impacts of a product, process, or service throughout its entire life cycle. This assessment considers all stages, from raw material extraction, production, distribution, use, and disposal. LCA calculations help in quantifying the environmental burdens associated with a product or service, including energy consumption, resource depletion, and emissions to air, water, and land. The results of LCA calculations provide valuable insights for decision-making to improve the environmental performance of products and processes.


In [30]:
# baseline
response = base_query_engine.query(
    "What is zero carbon shipping?",
)
print(str(response))

Zero carbon shipping refers to the concept of operating ships without emitting carbon dioxide or other greenhouse gases into the atmosphere. This involves using energy sources that do not produce carbon emissions, such as sustainable fuels like biofuels, hydrogen, or ammonia, as well as implementing energy efficiency measures to reduce overall energy consumption. The goal of zero carbon shipping is to significantly reduce the environmental impact of the shipping industry and contribute to global efforts to combat climate change.


In [33]:
response = top_agent.query(
    "Compare the how different arguments are used for the nessecity to reach zero carbon shipping from the EU, the US and the IMO respectively?"
)

Added user message to memory: Compare the how different arguments are used for the nessecity to reach zero carbon shipping from the EU, the US and the IMO respectively?
=== Calling Function ===
Calling function: compare_tool with args: {"input":"How are different arguments used for the necessity to reach zero carbon shipping from the EU, the US, and the IMO respectively?"}
Generated 5 sub questions.
[1;3;38;2;237;90;200m[tool_MMMCZCS_Assessing_impacts_EU_and_US_policies] Q: What are the benefits of EU policies in promoting the rapid adoption of alternative maritime fuels?
[0mAdded user message to memory: What are the benefits of EU policies in promoting the rapid adoption of alternative maritime fuels?
[1;3;38;2;90;149;237m[tool_MMMCZCS_Assessing_impacts_EU_and_US_policies] Q: What are the challenges faced by the US policies in promoting the rapid adoption of alternative maritime fuels?
[0mAdded user message to memory: What are the challenges faced by the US policies in promoting t

In [34]:
print(response)

The EU emphasizes creating incentives like green premiums and sustainability reporting directives, the US focuses on bridging the cost gap and immediate action, while the IMO stresses ambitious targets, carbon intensity reduction goals, and mid-term measures for fuel standards and GHG emissions pricing in their arguments for the necessity to reach zero carbon shipping.


In [35]:
response = top_agent.query(
    "What are the main arguments the EU uses in regard to the nessecity of reaching zero carbon shipping?"
)

Added user message to memory: What are the main arguments the EU uses in regard to the nessecity of reaching zero carbon shipping?
=== Calling Function ===
Calling function: tool_MMMCZCS_Assessing_impacts_EU_and_US_policies with args: {"input":"EU arguments for zero carbon shipping"}
Added user message to memory: EU arguments for zero carbon shipping
=== Calling Function ===
Calling function: vector_tool_MMMCZCS_Assessing_impacts_EU_and_US_policies with args: {
  "input": "EU arguments for zero carbon shipping"
}
Got output: The EU emphasizes the importance of achieving net-zero emissions in the shipping industry by 2050 and aims for a 5-10% uptake of alternative fuels by 2030. To support these goals, EU policies focus on narrowing the cost gap between alternative fuels and conventional fossil fuels through subsidies on alternatives and penalties on traditional fuels. Additionally, the EU's FuelEU pooling mechanism offers time-limited first-mover rewards to incentivize early investment

In [None]:
print(str(response))


DEMO INTERFACE



**This is the part for the demo interface**



In [None]:
!pip install ipywidgets
import ipywidgets as widgets
from IPython.display import display
import asyncio

In [89]:
async def query_agent(question):
    response = top_agent.query(question)
    print("Question:\n", question)
    print("")
    print("Agent Response:\n", response)

In [100]:
# Text input for the question
question_input_style = {'description_width': 'initial'}


question_input = widgets.Text(
    value='',
    placeholder='Du kan også skrive dit eget spørgsmål her...',
    style=question_input_style,
    disabled=False,
    layout={'width': '70%', 'height': '50px'}  # Adjust size as needed

)

# Predefined input button 1

predefined_question = "Compare the policies of the EU, US and IMO in regard to zero carbon shipping"
example_question_button = widgets.Button(
    description = predefined_question,
    button_style='info',  # Adds color styling to the button
    tooltip='Click to insert an example question',
    layout={'width': '70%', 'height': '40px'}  # Adjust size as needed
)


def insert_example_question(b):
    question_input.value = predefined_question

example_question_button.on_click(insert_example_question)


# Predefined input button 2

predefined_question2 = "What is the LCA methodology?"
example_question_button2 = widgets.Button(
    description = predefined_question2,
    button_style='info',  # Adds color styling to the button
    tooltip='Click to insert an example question',
    layout={'width': '70%', 'height': '40px'}  # Adjust size as needed
)

def insert_example_question2(b):
    question_input.value = predefined_question2

example_question_button2.on_click(insert_example_question2)

# New Textbox for instructions or feedback
feedback_textbox = widgets.Textarea(
    value='\nFølgende publications er benyttet: \n\n- Assessing impacts of EU and US policies on accelerated deployment of alternative maritime fuels \n\n- MMMCZCS LCA Methodology for Calculating the GHG Intensity of Maritime Fuels \n\n- Implications of the 2023 IMO GHG Strategy for the Shipping Industry',
    placeholder='Type something',
    disabled=True,  # Make it read-only
    layout={'width': '1000%', 'height': '140px'}  # Adjust size as needed
)

# Button to submit the question
submit_button = widgets.Button(
    description='Ask the MMMCZCS-Agent',
    disabled=False,
    button_style='success', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click to submit your question',
    layout={'width': '20%', 'height': '50px'}  # Adjust size as needed

)


# Output widget to display the response
output = widgets.Output()

display(example_question_button, example_question_button2, question_input, submit_button, feedback_textbox, output)

def on_button_clicked(b):
    # Display the question and response in the output widget
    with output:
        output.clear_output()
        print("Question:", question_input.value)
        # Asynchronously display the agent's response
        asyncio.run(query_agent(question_input.value))

submit_button.on_click(on_button_clicked)


Button(button_style='info', description='Compare the policies of the EU, US and IMO in regard to zero carbon s…

Button(button_style='info', description='What is the LCA methodology?', layout=Layout(height='40px', width='70…

Text(value='', layout=Layout(height='50px', width='70%'), placeholder='Du kan også skrive dit eget spørgsmål h…

Button(button_style='success', description='Ask the MMMCZCS-Agent', layout=Layout(height='50px', width='20%'),…

Textarea(value='\nFølgende publications er benyttet: \n\n- Assessing impacts of EU and US policies on accelera…

Output()