# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

This notebook shows how to use an implementation of RAPTOR with llama-index, leveraging the RAPTOR llama-pack.

RAPTOR works by recursively clustering and summarizing clusters in layers for retrieval.

There two retrieval modes:
- tree_traversal -- traversing the tree of clusters, performing top-k at each level in the tree.
- collapsed -- treat the entire tree as a giant pile of nodes, perform simple top-k.

See [the paper](https://arxiv.org/abs/2401.18059) for full algorithm details.

## Setup

In [1]:
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface

Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.5.0-py3-none-any.whl.metadata (2.8 kB)
Collecting llama-index-core<0.13.0,>=0.12.0 (from llama-index-llms-huggingface)
  Downloading llama_index_core-0.12.45-py3-none-any.whl.metadata (2.5 kB)
Collecting torch<3.0.0,>=2.1.2 (from llama-index-llms-huggingface)
  Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting transformers<5.0.0,>=4.37.0 (from transformers[torch]<5.0.0,>=4.37.0->llama-index-llms-huggingface)
  Downloading transformers-4.53.0-py3-none-any.whl.metadata (39 kB)
Collecting aiohttp<4,>=3.8.6 (from llama-index-core<0.13.0,>=0.12.0->llama-index-llms-huggingface)
  Using cached aiohttp-3.12.13-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting aiosqlite (from llama-index-core<0.13.0,>=0.12.0->llama-index-llms-huggingface)
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting banks<3,>=2.0.0 (fro

In [1]:
!pip install llama-index ipywidgets

Collecting llama-index
  Downloading llama_index-0.12.44-py3-none-any.whl.metadata (12 kB)
Collecting ipywidgets
  Using cached ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.12-py3-none-any.whl.metadata (439 bytes)
Collecting llama-index-cli<0.5,>=0.4.2 (from llama-index)
  Downloading llama_index_cli-0.4.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-embeddings-openai<0.4,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.7.7-py3-none-any.whl.metadata (3.3 kB)
Collecting llama-index-llms-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_llms_openai-0.4.7-py3-none-any.whl.metadata (3.0 kB)
Collecting llama-index-multi-modal-llms-openai<0.6,>=0.5.0 (fro

In [3]:
# !pip install llama-index llama-index-packs-raptor llama-index-vector-stores-chroma
# !pip install --upgrade transformers
!pip install llama-index-embeddings-huggingface




In [2]:
from llama_index.packs.raptor import RaptorPack

# optionally download the pack to inspect/modify it yourself!
# from llama_index.core.llama_pack import download_llama_pack
# RaptorPack = download_llama_pack("RaptorPack", "./raptor_pack")

In [5]:
!wget https://arxiv.org/pdf/2401.18059.pdf -O ./Raptor.pdf

/bin/bash: wget: command not found


In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Constructing the Clusters/Hierarchy Tree

Async code (asynchronous code) is code that doesn't execute sequentially from top to bottom. Instead, it allows operations to run in the background without blocking the execution of other code. This enables programs to remain responsive while waiting for time-consuming operations to complete.

An event loop is a programming construct that continuously monitors and processes events or messages in a program. It's the core mechanism that enables asynchronous, non-blocking operations in many programming environments. The event loop follows this basic pattern: 
Wait for something to happen (an event)
Process that event when it occurs
Repeat - go back to waiting

asyncio is Python's built-in library for writing asynchronous code. It allows you to write concurrent code using the async/await syntax.

nest_asyncio is a library that patches asyncio to allow nested event loops. This is particularly important in Jupyter notebooks!
The Problem:

Jupyter notebooks already run their own event loop in the background
Normal asyncio doesn't allow creating a new event loop when one is already running
This causes errors like: RuntimeError: This event loop is already running

What does nest_asyncio.apply() do?
Patches the asyncio event loop to allow nesting
Enables running async code inside environments that already have an event loop (like Jupyter)
Prevents RuntimeError when libraries try to create their own event loops

In [3]:
import nest_asyncio

nest_asyncio.apply()

First line is importing SimpleDirectoryReader class, which is LlamaIndex's primary tool for loading various document types (PDFs, Word docs, text files, etc.) into a format that can be processed.

second line loads the pdf, .load_data() actually reads the PDF and converts it into LlamaIndex Document objects, each Document contains:

The text content extracted from the PDF
Metadata (filename, page numbers, etc.)

A unique document ID

Returns a list of documents:
documents becomes a list of Document objects
For a single PDF, this is typically one Document object containing all the text

this loaded document will then be:

Split into chunks by the SentenceSplitter
Embedded using the embedding model
Clustered and summarized hierarchically by RAPTOR
Stored in the vector database for retrieval

In [4]:
from llama_index.core import SimpleDirectoryReader

# documents = SimpleDirectoryReader(input_files=["./Raptor.pdf"]).load_data()

reader = SimpleDirectoryReader(
    input_dir="/storage/home/mfp5696/vxk_group/250630_nlp_hallucination/documents",
    recursive=True,
)

documents = []
for docs in reader.iter_data():
    for doc in docs:
        documents.append(doc)

print(len(documents))

3142


from llama_index.core.node_parser import SentenceSplitter
Splits documents into smaller chunks

Chroma is an open-source vector database designed specifically for AI applications. A specialized database for storing and searching embeddings (numerical representations of text).

In [None]:
from llama_index.core.node_parser import SentenceSplitter
# from llama_index.llms.openai import OpenAI
# from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.llms.huggingface import HuggingFaceLLM
import chromadb

#Creates a database that persists on disk
client = chromadb.PersistentClient(path="./raptor_paper_db")

#Creates a named collection "raptor"
collection = client.get_or_create_collection("raptor")

#LlamaIndex wrapper around Chroma, provides unified interface for vector operations
vector_store = ChromaVectorStore(chroma_collection=collection)

#Llama models
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"

selected_model = LLAMA2_13B_CHAT


raptor_pack = RaptorPack(
    documents, 
    embed_model=HuggingFaceEmbedding(
        model_name="intfloat/e5-base-v2",
        query_instruction="query: ", # used for embedding queries E5 models
        text_instruction="passage: ",
        #embed_batch_size=64
    ),  # used for embedding clusters
    #llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),  # used for generating summaries
    llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.1,},
    #query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16},
),
    vector_store=vector_store,  # used for storage
    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed
    mode="collapsed",  # sets default mode
    transformations=[
        SentenceSplitter(chunk_size=400, chunk_overlap=50)
    ],  # transformations applied for ingestion
)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


CUDA available: True
GPU: NVIDIA A100-PCIE-40GB
Memory: 42.41 GB


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating embeddings for level 0.
Performing clustering for level 0.
Generating summaries for level 0 with 1227 clusters.


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


## Retrieval

In [None]:
nodes = raptor_pack.run("What baselines is raptor compared against?", mode="collapsed")
print(len(nodes))
print(nodes[0].text)

2
Specifically, RAPTOR’s F-1 scores are at least 1.8% points higher than DPR and at least 5.3% points
higher than BM25.
Retriever GPT-3 F-1 Match GPT-4 F-1 Match UnifiedQA F-1 Match
Title + Abstract 25.2 22.2 17.5
BM25 46.6 50.2 26.4
DPR 51.3 53.0 32.1
RAPTOR 53.1 55.7 36.6
Table 4: Comparison of accuracies on the QuAL-
ITY dev dataset for two different language mod-
els (GPT-3, UnifiedQA 3B) using various retrieval
methods. RAPTOR outperforms the baselines of
BM25 and DPR by at least 2.0% in accuracy.
Model GPT-3 Acc. UnifiedQA Acc.
BM25 57.3 49.9
DPR 60.4 53.9
RAPTOR 62.4 56.6
Table 5: Results on F-1 Match scores of various
models on the QASPER dataset.
Model F-1 Match
LongT5 XL (Guo et al., 2022) 53.1
CoLT5 XL (Ainslie et al., 2023) 53.9
RAPTOR + GPT-4 55.7Comparison to State-of-the-art Systems
Building upon our controlled comparisons,
we examine RAPTOR’s performance relative
to other state-of-the-art models.


In [None]:
nodes = raptor_pack.run(
    "What baselines is raptor compared against?", mode="tree_traversal"
)
print(len(nodes))
print(nodes[0].text)

Retrieved parent IDs from level 2: ['cc3b3f41-f4ca-4020-b11f-be7e0ce04c4f']
Retrieved 1 from parents at level 2.
Retrieved parent IDs from level 1: ['a4ca9426-a312-4a01-813a-c9b02aefc7e8']
Retrieved 2 from parents at level 1.
Retrieved parent IDs from level 0: ['63126782-2778-449f-99c0-1e8fd90caa36', 'd8f68d31-d878-41f1-aeb6-a7dde8ed5143']
Retrieved 4 from parents at level 0.
4
Specifically, RAPTOR’s F-1 scores are at least 1.8% points higher than DPR and at least 5.3% points
higher than BM25.
Retriever GPT-3 F-1 Match GPT-4 F-1 Match UnifiedQA F-1 Match
Title + Abstract 25.2 22.2 17.5
BM25 46.6 50.2 26.4
DPR 51.3 53.0 32.1
RAPTOR 53.1 55.7 36.6
Table 4: Comparison of accuracies on the QuAL-
ITY dev dataset for two different language mod-
els (GPT-3, UnifiedQA 3B) using various retrieval
methods. RAPTOR outperforms the baselines of
BM25 and DPR by at least 2.0% in accuracy.
Model GPT-3 Acc. UnifiedQA Acc.
BM25 57.3 49.9
DPR 60.4 53.9
RAPTOR 62.4 56.6
Table 5: Results on F-1 Match score

## Loading

Since we saved to a vector store, we can also use it again! (For local vector stores, there is a `persist` and `from_persist_dir` method on the retriever)

In [None]:
from llama_index.packs.raptor import RaptorRetriever

retriever = RaptorRetriever(
    [],
    embed_model=OpenAIEmbedding(
        model="text-embedding-3-small"
    ),  # used for embedding clusters
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),  # used for generating summaries
    vector_store=vector_store,  # used for storage
    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed
    mode="tree_traversal",  # sets default mode
)

In [None]:
# if using a default vector store
# retriever.persist("./persist")
# retriever = RaptorRetriever.from_persist_dir("./persist", ...)

## Query Engine

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1)
)

In [None]:
response = query_engine.query("What baselines was RAPTOR compared against?")

In [None]:
print(str(response))

BM25 and DPR
