# RAPTOR: Recursive abstractive processing and thematic organization for retrieval

In many real-world scenarios, documents are long and information-dense, yet users often ask very specific or high-level questions. Flat retrieval systems can either miss the big picture or drown the model in irrelevant details. RAPTOR (Recursive Abstractive Processing and Thematic Organization for Retrieval) addresses this by building a multi-level abstraction of the input documents, clustering semantically similar content, summarizing each cluster, and repeating this process hierarchically. This results in a tree of document abstractions — where high-level summaries can guide retrieval towards relevant details when answering a query.

This notebook demonstrates how to build a RAPTOR tree over long documents, organize that tree into a searchable vectorstore, retrieve relevant information hierarchically using both semantics and context and generate concise, accurate answers using only the most useful content.


In [1]:
import os

# MKL library and thread handling on Windows
os.environ["OMP_NUM_THREADS"] = "1"

import numpy as np
import pandas as pd
from typing import List, Dict, Any
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from langchain.document_loaders import PyPDFLoader
from langchain.chains.llm import LLMChain
from langchain.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.schema import AIMessage
from langchain.docstore.document import Document

import matplotlib.pyplot as plt
import logging
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize logging, embeddings, and language model

We initialize our logging, language model, and embedding model below.
- Logging: This helps us monitor the process, especially when building hierarchical trees or running long summarization tasks. It is also useful for debugging and understanding how the system behaves.
- Embeddings: Text embeddings are dense vector representations of text that capture semantic meaning. We will use them for clustering and similarity search.
- LLM: This will be the core engine for summarization, contextual compression, and final answer generation.

In [2]:
# Set up logging configuration to monitor pipeline stages and status messages
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Initialize embeddings for document and query representation
embeddings = OpenAIEmbeddings()

# Initialize chat model for summarization and generation
llm = ChatOpenAI(model_name="gpt-4o-mini-2024-07-18")

- The logging setup ensures that every major step—like embedding, clustering, summarizing—gets logged with a timestamp and severity level (info, warning, etc.).
- The `OpenAIEmbeddings` instance will later be used to embed both original documents and their summaries, so that we can organize and retrieve them efficiently using similarity search.
- The `ChatOpenAI` model will power all natural language tasks, such as generating summaries at each hierarchical level and eventually answering the user's query using compressed, relevant content.

### Helper functions

At the heart of RAPTOR is the repeated cycle of embedding, clustering, and summarizing at multiple hierarchical levels. To support this, we define a set of modular utility functions that each handle a core building block of the process.

These utilities will be reused throughout the tree-building pipeline, allowing us to abstract away the lower-level steps of transforming raw documents into structured, semantically meaningful summaries.

In [3]:
def extract_text(item):
    """
    Extracts the raw text content from either a plain string or a LangChain AIMessage object.
    This allows flexibility when working with both user input and LLM responses.
    """
    if isinstance(item, AIMessage):
        return item.content # Pull the actual text from the AI response
    return item  # Assume the input is already a plain string

def embed_texts(texts: List[str]) -> List[List[float]]:
    """Embed texts using OpenAIEmbeddings. Embeddings are later used for clustering and similarity-based retrieval."""
    # Log how many documents we are processing
    logging.info(f"Embedding {len(texts)} texts")
    # Converts a list of texts into their corresponding embedding vectors
    return embeddings.embed_documents([extract_text(text) for text in texts])

def perform_clustering(embeddings: np.ndarray, n_clusters: int = 10) -> np.ndarray:
    """
    Perform clustering on embeddings using Gaussian Mixture Model (GMM).
    Each cluster ideally represents a thematic grouping of semantically related texts.
    """
    # Log how many clusters we are trying to create
    logging.info(f"Performing clustering with {n_clusters} clusters")
    # Initialize the Gaussian Mixture model with a fixed random seed for reproducibility
    gm = GaussianMixture(n_components=n_clusters, random_state=42)
    # Fit the GMM to the data and return the predicted cluster labels for each embedding
    return gm.fit_predict(embeddings)

def summarize_texts(texts: List[str]) -> str:
    """
    Generates a concise summary of a list of texts using the LLM.
    This is used to create the higher-level nodes in the RAPTOR tree.
    """
    # Log the number of text chunks being summarized at this step
    logging.info(f"Summarizing {len(texts)} texts")
    # Define the prompt template for the LLM to summarize the input
    prompt = ChatPromptTemplate.from_template(
        "Summarize the following text concisely:\n\n{text}"
    )
    # Combine the prompt template and the LLM into a runnable LangChain chain
    chain = prompt | llm
    input_data = {"text": texts} # Pass all texts to be summarized
    # Run the chain and return the LLM's response
    return chain.invoke(input_data)

def visualize_clusters(embeddings: np.ndarray, labels: np.ndarray, level: int):
    """
    Uses PCA to reduce embedding dimensionality and plots a scatterplot of the clusters.
    Helpful for visually inspecting how well the semantic clustering performed at each level.
    """
    # Use PCA to reduce high-dimensional embeddings down to 2D for visualization
    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)

    # Create a scatter plot of points colored by cluster assignment
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=labels, cmap='viridis')
    plt.colorbar(scatter)  # Add a color legend for clusters
    plt.title(f'Cluster Visualization - Level {level}')
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.show()

Here, we define the modular operations that make RAPTOR's recursive structure work:
- The `extract_text` function gives us a unified way to handle inputs from different stages (e.g., strings or AIMessage outputs).
- `embed_texts` transforms raw or summarized text into dense vector representations, which are the backbone of clustering and retrieval in RAPTOR.
- `perform_clustering` applies a Gaussian Mixture Model to the embeddings to detect thematic groupings. These clusters will serve as the basis for summarizing and compressing information at each level.
- `summarize_texts` generates one summary per cluster, forming the input for the next level of the hierarchy.
- `visualize_clusters` helps us inspect whether the clusters formed are meaningful, which can be useful during development or troubleshooting.

### Building the RAPTOR tree

This recursive process compresses large collections of documents into semantically meaningful summaries across multiple levels. It takes the original texts, embeds them, clusters them into thematic groups, summarizes each group, and repeats this process on the summaries. This builds a tree from raw content (level 0) up to abstracted summaries (level N).

At every iteration, we track metadata like cluster origin, parent-child relations, and summary lineage. This hierarchy is crucial for later tasks like drill-down exploration, retrieval, or visualizing abstraction paths.


In [4]:
def build_raptor_tree(texts: List[str], max_levels: int = 3) -> Dict[int, pd.DataFrame]:
    """Build the RAPTOR tree structure with level metadata and parent-child relationships."""
    # Dictionary to store each level's DataFrame (text, embeddings, clusters, metadata)
    results = {}
    # Start by extracting raw text (in case some are AIMessage objects)
    current_texts = [extract_text(text) for text in texts]
    # Initialize metadata for original texts — level 0, no parents, labeled as "original"
    current_metadata = [{"level": 0, "origin": "original", "parent_id": None} for _ in texts]

    # Loop over levels to progressively build the tree from specific (level 0) to abstract (level N)
    for level in range(1, max_levels + 1):
        logging.info(f"Processing level {level}")

        # Step 1: Convert texts into embedding vectors
        embeddings = embed_texts(current_texts)

        # Step 2: Define number of clusters — max 10 or half the input size
        n_clusters = min(10, len(current_texts) // 2)

        # Step 3: Cluster the embeddings to group them into thematic units
        cluster_labels = perform_clustering(np.array(embeddings), n_clusters)

        # Step 4: Save this level's data into a structured DataFrame
        df = pd.DataFrame({
            'text': current_texts,
            'embedding': embeddings,
            'cluster': cluster_labels,
            'metadata': current_metadata
        })

        results[level-1] = df  # Store this level's data

        summaries = []  # To hold new summaries (higher-level concepts)
        new_metadata = []  # Metadata for the next level of summaries

        # Step 5: For each cluster, summarize its content and record lineage
        for cluster in df['cluster'].unique():
            cluster_docs = df[df['cluster'] == cluster]
            cluster_texts = cluster_docs['text'].tolist()
            cluster_metadata = cluster_docs['metadata'].tolist()

            # Summarize the grouped texts into a single higher-level abstraction
            summary = summarize_texts(cluster_texts)
            summaries.append(summary)

            # Build metadata that connects this summary to its child documents
            new_metadata.append({
                "level": level,
                "origin": f"summary_of_cluster_{cluster}_level_{level-1}",
                "child_ids": [meta.get('id') for meta in cluster_metadata],
                "id": f"summary_{level}_{cluster}"
            })

        # Prepare for next level: summaries become new input texts
        current_texts = summaries
        current_metadata = new_metadata

        # Stop if only one summary is left — no more meaningful abstraction possible
        if len(current_texts) <= 1:
            results[level] = pd.DataFrame({
                'text': current_texts,
                'embedding': embed_texts(current_texts),
                'cluster': [0],
                'metadata': current_metadata
            })
            logging.info(f"Stopping at level {level} as we have only one summary")
            break

    return results

This is the core logic that builds the RAPTOR architecture layer by layer — it constructs a hierarchical tree where each level is a progressively more abstract representation of the data below it.

At level 0, we just have the original raw text chunks. Each subsequent level embeds those texts into vectors, clusters them into topic groups, summarizes each group into a single node, and continues upward. Each summary node tracks which inputs it represents via metadata. This metadata is essential for backtracking or navigating the tree later — e.g., reconstructing all children from a parent, or tracing the lineage of a summary.

- It tracks each document and summary across levels using metadata, enabling easy traversal up and down the hierarchy.
- Embeddings are created only for the current working set of texts, allowing each clustering step to reflect only local semantic similarity.
- The Gaussian Mixture Model determines how documents cluster together, dynamically adapting to how many texts are in play.
- The LLM summarization step distills each cluster into a higher-level concept, which then serves as input for the next iteration.
- The tree terminates early if only one summary remains—indicating that the abstraction has converged to a single high-level idea.
- The `results` dictionary effectively stores each "layer" of the tree, with DataFrames that hold the actual content, its representation, and metadata.

This modular and recursive structure is what allows RAPTOR to build layered, explainable summaries from complex document collections.

### Vectorstore construction
Now that the full RAPTOR tree is built — with hierarchical summaries and metadata — we can turn the entire structure into a searchable knowledge base. This next step embeds every node in the tree (original texts and all summaries) into vector space and stores them inside a FAISS vector index.

This enables efficient semantic retrieval: we can search for content by meaning, not just by keyword. And because each node still carries metadata like its abstraction level or parent lineage, we can filter our searches by depth, origin, or other properties.

In [5]:
def build_vectorstore(tree_results: Dict[int, pd.DataFrame]) -> FAISS:
    """Build a FAISS vectorstore from all texts in the RAPTOR tree."""
    all_texts = []  # To collect all node texts (raw and summarized)
    all_embeddings = []  # Corresponding embeddings
    all_metadatas = []  # Metadata for lineage, level, etc.

    # Loop over each level in the tree and extract its contents
    for level, df in tree_results.items():
        # Collect the actual text content (converted to string to be safe)
        all_texts.extend([str(text) for text in df['text'].tolist()])

        # Handle embeddings in both list and ndarray formats
        all_embeddings.extend([embedding.tolist() if isinstance(embedding, np.ndarray) else embedding for embedding in df['embedding'].tolist()])
        # Append the associated metadata
        all_metadatas.extend(df['metadata'].tolist())

    logging.info(f"Building vectorstore with {len(all_texts)} texts")

    # Create LangChain Document objects manually to ensure correct types
    documents = [Document(page_content=str(text), metadata=metadata)
                 for text, metadata in zip(all_texts, all_metadatas)]

    # Build and return the FAISS index from documents and embeddings
    return FAISS.from_documents(documents, embeddings)

Now, the RAPTOR hierarchy is searchable. Each node is tied to its level and origin, enabling level-specific or lineage-based filtering.
- Every node in the tree, regardless of level, gets embedded into a vector space.
- Texts and their metadata are wrapped into standardized Document objects.
- FAISS — a high-performance similarity search library — indexes all vectors efficiently.
- The result is a powerful vectorstore where we can:
  - Search across the whole tree for semantically similar ideas
  - Filter results by level, origin cluster, or summary ID
  - Retrieve abstract summaries or drill down into specifics

### Hierarchical tree traversal retrieval
The RAPTOR tree is not just for summarization — it’s also designed for multi-level semantic retrieval. Instead of flattening the tree and searching all documents equally, this method queries the tree from the top down, traversing through its summary nodes. This way, we start with high-level concepts and drill down only into the most relevant clusters.

This approach has a number of benefits:
- Improves precision by narrowing the search to semantically aligned paths.
- Leverages the tree’s metadata (like levels and parent-child relationships).
- Allows retrieval to dynamically mix abstract summaries and detailed documents.

In the function below, we:
- Embed the user query once.
- Begin searching at the highest abstraction level.
- Recursively follow `child_ids` of top-level matches into deeper levels.
- Collect documents across levels and return a relevance-ranked list.

In [6]:
def tree_traversal_retrieval(query: str, vectorstore: FAISS, k: int = 3) -> List[Document]:
    """Perform tree traversal retrieval."""
    # Convert the query into its embedding form once
    query_embedding = embeddings.embed_query(query)

    def retrieve_level(level: int, parent_ids: List[str] = None) -> List[Document]:
        """
        Recursively retrieve documents by traversing the RAPTOR tree downward.
        If parent_ids are specified, limits the search to children of those nodes.
        """

        if parent_ids:
            # Filter search to only nodes at this level and within the parent lineage
            docs = vectorstore.similarity_search_by_vector_with_relevance_scores(
                query_embedding,
                k=k,
                filter=lambda meta: meta['level'] == level and meta['id'] in parent_ids
            )
        else:
            # First level search — no filtering by parent
            docs = vectorstore.similarity_search_by_vector_with_relevance_scores(
                query_embedding,
                k=k,
                filter=lambda meta: meta['level'] == level
            )

        # Base case: if we have reached level 0 or found no matches, return what we have
        if not docs or level == 0:
            return docs

        # Otherwise, go one level deeper by collecting child document IDs
        child_ids = [doc.metadata.get('child_ids', []) for doc, _ in docs]
        child_ids = [item for sublist in child_ids for item in sublist]  # Flatten the list

        # Recursively collect results from lower level nodes
        child_docs = retrieve_level(level - 1, child_ids)
        # Combine current level docs with those retrieved below
        return docs + child_docs

    # Start from the highest available level in the tree
    max_level = max(doc.metadata['level'] for doc in vectorstore.docstore.values())
    return retrieve_level(max_level)

Instead of just searching over a flat document list, this retrieval strategy respects the hierarchy of our RAPTOR tree:
- The top-level summaries act as conceptual filters. They catch broad semantic matches.
- Once top-level nodes are selected, the algorithm descends into their child summaries or documents, limiting the search space to relevant subtrees.
- Each recursive pass refines the relevance by tracing semantic lineage, from abstract → specific.
- This method returns a blend of high-level and low-level nodes, all connected through the tree’s structure.

This hierarchical approach is especially useful when dealing with long documents, large collections, or any scenario where abstract themes and detailed evidence both matter. It is a key part of what makes RAPTOR not just a summarizer, but a retrieval-first architecture.


### Create contextual retriever
Retrieval alone can yield overly broad results. To tighten this up, RAPTOR introduces a `ContextualCompressionRetriever`, which enhances relevance and conciseness by post-processing the retrieved documents with an LLM.
This component works like a semantic sieve:
- It takes in raw retrieved content.
- Runs it through a compression chain powered by a prompt + LLM.
- Outputs only the most contextually relevant snippets, filtered through the lens of the user’s question.

This is especially useful when:
- We are retrieving from large, verbose documents.
- We want faster downstream inference (e.g., in a QA chain).
- We need higher precision in multi-level chains.

In [7]:
def create_retriever(vectorstore: FAISS) -> ContextualCompressionRetriever:
    """
    Create a retriever with contextual compression.
    It wraps the FAISS-based retriever in a ContextualCompressionRetriever, which uses an LLM to extract only the parts of each document that are relevant to the current query.
    """
    logging.info("Creating contextual compression retriever")

    # Step 1: Create a basic retriever from the vectorstore
    base_retriever = vectorstore.as_retriever()

    # Step 2: Define a prompt that instructs the LLM to extract only what's relevant
    prompt = ChatPromptTemplate.from_template(
        "Given the following context and question, extract only the relevant information for answering the question:\n\n"
        "Context: {context}\n"
        "Question: {question}\n\n"
        "Relevant Information:"
    )

    # Step 3: Wrap the LLM with a chain that knows how to use the prompt
    extractor = LLMChainExtractor.from_llm(llm, prompt=prompt)

    # Step 4: Create the contextual compression retriever
    return ContextualCompressionRetriever(
        base_compressor=extractor,  # LLM-based filter
        base_retriever=base_retriever  # Initial document retriever
    )

This function upgrades a basic vector retriever into a smarter one that thinks before returning documents:
- We first wrap the FAISS vectorstore into a standard retriever object. This handles the initial similarity-based filtering based on the user’s query.
- Then we define a prompt template that tells the LLM what to do: from the retrieved context, extract only the parts that help answer the query.
- This prompt is wrapped into a LangChain `LLMChainExtractor`, which creates a callable component that compresses documents into shorter, more relevant chunks.
- Finally, we glue the extractor and retriever together into a `ContextualCompressionRetriever`.

At runtime, the retriever:
1. Performs a similarity search to get top-k document chunks.
2. Forwards those chunks along with the query to the LLM extractor.
3. Returns only the filtered, compressed pieces for downstream use (e.g., summarization, QA, synthesis).

This LLM-powered compression is essential when working with noisy or layered data (like RAPTOR trees), helping our agent stay focused and efficient.

### Define hierarchical retrieval
RAPTOR doesn’t just flatten a knowledge tree into embeddings—it respects and leverages hierarchy. So when it comes to retrieval, we don’t just search from the bottom-up or top-down arbitrarily—we do both strategically.

This retrieval strategy starts at the highest level of abstraction (summaries), working down to the original source texts through lineage metadata. At each level, it:
- Retrieves documents matching the query.
- Traces the metadata of retrieved summaries to their child documents in lower levels.
- Dynamically updates the query to reflect the children of relevant results.
- Collects documents across levels to build a layered, comprehensive context set.

This design is powerful for semantic search where themes span different depths—from broad overviews to specific instances.

In [8]:
def hierarchical_retrieval(query: str, retriever: ContextualCompressionRetriever, max_level: int) -> List[Document]:
    """Perform hierarchical retrieval starting from the highest level, handling potential None values."""
    all_retrieved_docs = []  # Store results from all levels

    # Traverse levels from top (abstract summaries) to bottom (original docs)
    for level in range(max_level, -1, -1):
        # Retrieve documents at this level with the current query
        level_docs = retriever.invoke(
            query,
            filter=lambda meta: meta['level'] == level  # Ensure level matches
        )
        all_retrieved_docs.extend(level_docs)

        # If documents found and more levels exist below, retrieve their children from the next level down
        if level_docs and level > 0:
            # Gather all child IDs from retrieved documents' metadata
            child_ids = [doc.metadata.get('child_ids', []) for doc in level_docs]
            child_ids = [item for sublist in child_ids for item in sublist if item is not None]  # Flatten and filter None

            # Refine the query to guide the next retrieval level
            if child_ids:  # Only modify query if there are valid child IDs
                child_query = f" AND id:({' OR '.join(str(id) for id in child_ids)})"
                query += child_query  # Append ID-based constraint to the query

    return all_retrieved_docs

This function traverses the RAPTOR tree from abstract to concrete, collecting a blend of documents relevant to the query at multiple layers of semantic depth.
- It begins at the highest level (i.e., summaries), filtering documents based on their metadata level.
- Retrieved summaries are inspected for `child_ids`, which are metadata pointers to lower-level documents they represent.
- If child IDs are found, they are embedded into the query string for the next level down—effectively scoping retrieval to parts of the tree already considered semantically relevant.
- This cycle repeats until it hits level 0 (raw text chunks), gathering documents at each step.

The end result is a multi-level result set that combines high-level abstractions with supporting details from the lower levels of the tree. This helps maintain both semantic alignment and contextual grounding in our final outputs. This pattern is especially useful when:
- We want to answer questions that require high-level synthesis and supporting evidence.
- We want to visualize or explain how an abstract conclusion was derived from concrete inputs.
- We are building chains that depend on hierarchical depth (e.g., summarization + QA).

By traversing intelligently rather than exhaustively, RAPTOR avoids noisy results and enables meaningful semantic zooming during retrieval.


### RAPTOR query pipeline (Online pnference phase)
This is where everything comes together. Once the tree is built and the retriever is set up, the `raptor_query` function acts as the end-to-end orchestration point for answering user questions. It coordinates all the moving parts:
- Starts by running hierarchical retrieval to get semantically and structurally relevant documents.
- Formats these results with metadata for transparency and debugging.
- Uses a context-aware prompt to query the LLM for a grounded answer.
- Returns a structured payload including the answer, the exact context used, and metadata-rich document lineage.

This pattern allows for not just answer generation, but also deep introspection of how and why the model answered the way it did.

In [9]:
def raptor_query(query: str, retriever: ContextualCompressionRetriever, max_level: int) -> Dict[str, Any]:
    """Process a query using the RAPTOR system with hierarchical retrieval and LLM-based answer generation."""
    logging.info(f"Processing query: {query}")

    # Step 1: Retrieve relevant documents across the RAPTOR hierarchy
    relevant_docs = hierarchical_retrieval(query, retriever, max_level)

    # Step 2: Format the retrieved docs with metadata for inspection
    doc_details = []
    for i, doc in enumerate(relevant_docs, 1):
        doc_details.append({
            "index": i,
            "content": doc.page_content,
            "metadata": doc.metadata,
            "level": doc.metadata.get('level', 'Unknown'),
            "similarity_score": doc.metadata.get('score', 'N/A')  # May be unavailable
        })

    # Step 3: Combine the content into a single context string
    context = "\n\n".join([doc.page_content for doc in relevant_docs])

    # Step 4: Use the LLM to generate an answer based on the context
    prompt = ChatPromptTemplate.from_template(
        "Given the following context, please answer the question:\n\n"
        "Context: {context}\n\n"
        "Question: {question}\n\n"
        "Answer:"
    )
    chain = prompt | llm
    answer = chain.invoke({"context": context, "question": query})

    logging.info("Query processing completed")

    # Step 5: Return structured results
    result = {
        "query": query,
        "retrieved_documents": doc_details,
        "num_docs_retrieved": len(relevant_docs),
        "context_used": context,
        "answer": answer,
        "model_used": llm.model_name,
    }

    return result

def print_query_details(result: Dict[str, Any]):
    """Print detailed information about the query process, including tree level metadata."""
    print(f"Query: {result['query']}")
    print(f"\nNumber of documents retrieved: {result['num_docs_retrieved']}")
    print(f"\nRetrieved Documents:")
    for doc in result['retrieved_documents']:
        print(f"  Document {doc['index']}:")
        print(f"    Content: {doc['content'][:100]}...")  # Show first 100 characters
        print(f"    Similarity Score: {doc['similarity_score']}")
        print(f"    Tree Level: {doc['metadata'].get('level', 'Unknown')}")
        print(f"    Origin: {doc['metadata'].get('origin', 'Unknown')}")
        if 'child_docs' in doc['metadata']:
            print(f"    Number of Child Documents: {len(doc['metadata']['child_docs'])}")
        print()

    print(f"\nContext used for answer generation:")
    print(result['context_used'])

    print(f"\nGenerated Answer:")
    print(result['answer'].content)

    print(f"\nModel Used: {result['model_used']}")

This is the final pipeline that connects all parts of RAPTOR into a working system that can handle real-world queries. Here is what it is doing behind the scenes:
- It begins by invoking the hierarchical retriever, which explores the RAPTOR tree from top to bottom, surfacing documents whose lineage is aligned with the query.
- Each document's metadata—such as its abstraction level, origin cluster, or similarity score—is preserved. This allows us to trace answers back to specific sources or thematic clusters.
- The content of all retrieved docs is stitched together into a single context block.
- This context is passed into a prompt template and executed via a LangChain chain, producing an answer that is grounded in the retrieved material.
- All outputs are returned in a structured dictionary, making the results easy to inspect, display, or log.
- A helper function is included to pretty-print everything.

This method is a demonstration of RAPTOR’s hybrid strengths: structured semantics and LLM reasoning, working together.

### Example usage

#### Load and preprocess the source document
Before RAPTOR can summarize or retrieve anything, we first need to load and split the input content. In this example, we use a PDF.

In [10]:
# Load raw data from a PDF file
path = "Understanding_Climate_Change.pdf"

# Use a PDF loader to extract document chunks
loader = PyPDFLoader(path)
documents = loader.load()

# Extract plain text from the document objects
texts = [doc.page_content for doc in documents]

Here, we read a PDF document into memory and parse it into a list of raw text chunks. These chunks serve as the input to the RAPTOR tree builder. `PyPDFLoader` from LangChain handles PDF parsing under the hood, splitting by pages or layout regions depending on the configuration.


#### Build the RAPTOR tree and components
Now we build the RAPTOR components step-by-step:
- First, we recursively summarize and cluster the document collection into a hierarchical tree.
- Then we store all nodes (texts and summaries) into a FAISS vectorstore for fast retrieval.
- Finally, we create a contextual retriever that uses LLM-based compression to extract only the most relevant information.

In [11]:
# Build the RAPTOR tree from the extracted texts
tree_results = build_raptor_tree(texts)

# Create a vectorstore to support similarity search across all tree levels
vectorstore = build_vectorstore(tree_results)

# Create retriever that compresses the retrieved context using the LLM
retriever = create_retriever(vectorstore)

2025-06-05 23:08:45,926 - INFO - Processing level 1
2025-06-05 23:08:45,930 - INFO - Embedding 10 texts
2025-06-05 23:08:46,907 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-06-05 23:08:47,070 - INFO - Performing clustering with 5 clusters
2025-06-05 23:08:55,935 - INFO - Summarizing 1 texts
2025-06-05 23:08:59,500 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:08:59,527 - INFO - Summarizing 2 texts
2025-06-05 23:09:02,689 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:02,694 - INFO - Summarizing 2 texts
2025-06-05 23:09:05,030 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:05,038 - INFO - Summarizing 3 texts
2025-06-05 23:09:09,741 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:09,749 - INFO - Summarizing 2 texts
2

Under the hood, this step recursively abstracts the input into thematic layers. Each node contains metadata about its origin, allowing precise traversal and retrieval. The FAISS index ensures similarity search is fast, even across many levels. The contextual retriever will refine search results using the language model’s reasoning.


#### Run a query through the full RAPTOR pipeline


In [12]:
# Run an actual query through the full RAPTOR pipeline
max_level = 3  # The deepest level available in the tree
query = "What is the greenhouse effect?"

# Get results from RAPTOR
result = raptor_query(query, retriever, max_level)

print_query_details(result)

2025-06-05 23:09:24,892 - INFO - Processing query: What is the greenhouse effect?
2025-06-05 23:09:25,076 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-06-05 23:09:26,681 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:27,973 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:28,993 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:29,812 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:30,017 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-06-05 23:09:31,045 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:31,917 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-05 23:09:33,091 - 

Query: What is the greenhouse effect?

Number of documents retrieved: 16

Retrieved Documents:
  Document 1:
    Content: The greenhouse effect is the process by which greenhouse gases, such as carbon dioxide (CO2), methan...
    Similarity Score: N/A
    Tree Level: 0
    Origin: original

  Document 2:
    Content: The provided context does not include a definition or explanation of the greenhouse effect. Therefor...
    Similarity Score: N/A
    Tree Level: 0
    Origin: original

  Document 3:
    Content: The provided context does not contain any information specifically addressing the greenhouse effect....
    Similarity Score: N/A
    Tree Level: 0
    Origin: original

  Document 4:
    Content: The context provided does not contain any information about the greenhouse effect....
    Similarity Score: N/A
    Tree Level: 0
    Origin: original

  Document 5:
    Content: The greenhouse effect is a natural process where greenhouse gases, such as carbon dioxide (CO2), met...
    

Here, we perform a real query using the system. It:
- Starts retrieval from the highest summary level and walks down the tree.
- Uses metadata lineage to trace back the documents that informed the answer.
- Shows us which texts were considered relevant, where they came from in the hierarchy, and what the LLM ultimately answered based on that evidence.

This gives us not just an answer, but also visibility into which parts of the hierarchy contributed to it. We can trace what documents were selected, at what level, how they were scored, and how they map to the original document.