# Hierarchical indices in document retrieval

In this notebook, we implement a hierarchical indexing system to improve the relevance and efficiency of document retrieval. Unlike flat vector-based approaches that treat all content chunks equally, this method adds structure by first summarizing documents and then indexing those summaries alongside detailed chunks.

We use OpenAI’s GPT-4o-mini for summarization, FAISS for vector storage, and OpenAI Embeddings for representing text. The idea is to search summaries first to narrow down relevant sections, and then search within those sections for fine-grained detail. This is ideal when working with large documents or long-form content, where flat retrieval can miss context or feel too noisy.

In [14]:
import asyncio
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains.summarize.chain import load_summarize_chain
from langchain.docstore.document import Document
from langchain.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import RateLimitError

# Load environment variables from a .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Handling API rate limiting
When working with OpenAI's API (or any external service), there is a chance our code might hit a rate limit — which means we are sending too many requests too quickly. If this happens, the API won't return our result, and our app could break unless we handle it.

To fix this, we implement a retry mechanism that waits a bit and tries again, using something called exponential backoff with jitter. That just means:
- Exponential backoff means that the time between retries gets longer after each failure: first wait 2 seconds if it fails the first time. If it fails again, wait 4 seconds. Then 8, and so on — giving the server a chance to breathe.
- Jitter adds a bit of randomness to that wait time, to avoid everyone retrying at the exact same time (which could overwhelm the server again).

This function will wrap any async request we make (like summarizing with GPT or embedding text) and make sure that if we hit a temporary limit, the code won't crash — it'll just wait and try again.

#### Define the retry logic with exponential backoff + jitter

In [2]:
async def exponential_backoff(attempt):
    """
    Implements exponential backoff with a jitter.

    Args:
        attempt: The current retry attempt number.

    Waits for a period of time before retrying the operation.
    The wait time is calculated as (2^attempt) + a random fraction of a second.
    """
    # Calculate the wait time with exponential backoff and jitter
    wait_time = (2 ** attempt) + random.uniform(0, 1)
    print(f"Rate limit hit. Retrying in {wait_time:.2f} seconds...")

    # Asynchronously sleep for the calculated wait time
    await asyncio.sleep(wait_time)

This function handles the actual delay logic. We will use it inside our retry wrapper below.

#### Retry any async operation with backoff logic

In [3]:
async def retry_with_exponential_backoff(coroutine, max_retries=5):
    """
    Retries a coroutine using exponential backoff upon encountering a RateLimitError.

    Args:
        coroutine: The coroutine to be executed.
        max_retries: The maximum number of retry attempts.

    Returns:
        The result of the coroutine if successful.

    Raises:
        The last encountered exception if all retry attempts fail.
    """
    for attempt in range(max_retries):
        try:
            # Attempt to execute the coroutine
            return await coroutine
        except RateLimitError as e:
            # If the last attempt also fails, raise the exception
            if attempt == max_retries - 1:
                raise e

            # Wait for an exponential backoff period before retrying
            await exponential_backoff(attempt)

    # If max retries are reached without success, raise an exception
    raise Exception("Max retries reached")

What this is doing (step-by-step):
1. It defines an async function that wraps any other coroutine (an async task like sending a request). We pass in any async task — like a call to GPT or an embedding request.
2. If it runs fine, we get the result immediately.
3. If it fails due to a rate limit, the retry function:
   - Calculates how long to wait using `exponential_backoff`.
   - Waits asynchronously without blocking other tasks.
   - Tries again (up to `max_retries` times).
4. If it still fails after all retries, the error is raised so we can handle it (or crash gracefully).

### Load the PDF
We will use `PyPDFLoader` to extract text from the PDF. It reads the PDF page by page and stores the extracted text in a list of document objects, where each document contains the content of a single page.

In [4]:
# Path to the PDF document
path = "Understanding_Climate_Change.pdf"

# Define a coroutine to load the document
async def load_pdf(path):
    # Use the default event loop to run the load operation in a separate thread
    loop = asyncio.get_event_loop()
    loader = PyPDFLoader(path)
    
    # Run the loading operation in a separate thread
    documents = await loop.run_in_executor(None, loader.load)
    return documents

# Use the load_pdf coroutine to load the document
documents = await load_pdf(path)

### Summarize each page
So now that we have the full document loaded and split into separate pages (or logical chunks), the next smart move is to create a summary of each one. Why? Because it's a lot easier — and more efficient — to start searching through summaries than to jump straight into hundreds of raw text blocks. Think of it like scanning a table of contents before reading a whole book. We will create a high-level summary of each document using the GPT model.

In [5]:
# Create document-level summaries
summary_llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini-2024-07-18", max_tokens=4000) # Load the summarization LLM
# Load a built-in summarization chain using LangChain's map_reduce approach
summary_chain = load_summarize_chain(summary_llm, chain_type="map_reduce")

- We are setting up the language model we'll use to generate the summaries. `ChatOpenAI` loads the GPT-4o-mini model and with a `temperature` of 0 it gives us more deterministic, reliable results (not too creative, just focused). The `max_tokens` lets it write longer summaries when needed.
- Then we load what LangChain calls a "summarization chain" — it is like a pre-built recipe that takes in a document and gives us back a nicely structured summary. We use the `"map_reduce"` type here, which is helpful when summarizing longer inputs: it processes parts individually and then combines the results.

Now let’s actually loop through our pages, summarize each one, and collect the outputs:

In [6]:
# Process documents in smaller batches to avoid rate limits
batch_size = 5  # Adjust this based on your rate limits
# Collect summaries here
summaries = []

# Summarize each page asynchronously
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]

    # For each document in the batch, generate a summary
    batch_tasks = []
    for doc in batch:
        # Use our retry helper to avoid crashing on rate limits
        task = retry_with_exponential_backoff(summary_chain.ainvoke([doc]))
        batch_tasks.append(task)

    # Run the batch and gather summaries
    batch_summaries = await asyncio.gather(*batch_tasks)

    # Store summaries as Document objects
    for original_doc, summary_result in zip(batch, batch_summaries):
        summary = summary_result['output_text']
        summaries.append(Document(
            page_content=summary,
            metadata={
                "source": path,
                "page": original_doc.metadata["page"],
                "summary": True
            }
        ))

    # short pause to avoid hitting rate limits
    await asyncio.sleep(1)

- We go through the pages (or document chunks) in small groups of 5. That’s mostly to avoid triggering OpenAI’s rate limits — because asking it to summarize 50 pages at once is probably going to get us blocked or throttled.
- For each group, we create tasks — one per document — and call our `retry_with_exponential_backoff()` function, just in case the API complains. These tasks are executed with `asyncio.gather()` so we can run them in parallel and speed things up.
- Once we have the summaries, we store each one as a new `Document`. We also add some metadata so we can track which summary belongs to which page later. That `"summary": True` tag will be important in the next steps when we build our hierarchical retrieval logic.
- We also add a little `await asyncio.sleep(1)` just to be nice to the API and avoid hammering it too fast.

### Create detailed chunks
Now that we have our document summarized into smaller, more digestible pieces (summaries), we can move on to the next part: splitting the original text into detailed chunks.

So why would we need this? Well, summaries give us a high-level view, but if we need more context or want to dive into specific sections, we need to break the document into smaller chunks — chunks that still contain enough detail for the user to understand without needing the entire page of text. These chunks will give us much more flexibility when we are performing searches later.

Now, instead of splitting the document at random, we want to split it based on the number of characters and overlap between chunks, ensuring no meaningful context is lost when we move from one chunk to another. For that, we will use LangChain's `RecursiveCharacterTextSplitter`.

In [8]:
# Define chunk size and overlap
chunk_size = 1000
chunk_overlap = 200

# Define the coroutine to split the documents
async def split_documents(docs, chunk_size, chunk_overlap):
    loop = asyncio.get_event_loop()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    
    # Run the split operation in a separate thread using the event loop's executor
    detailed_chunks = await loop.run_in_executor(None, text_splitter.split_documents, docs)
    return detailed_chunks

# Split the documents into detailed chunks
detailed_chunks = await split_documents(documents, chunk_size, chunk_overlap)

We start by setting up the `RecursiveCharacterTextSplitter`. Here, we define a couple of important things:
1. **`chunk_size`** — This tells the splitter how big we want each chunk of text to be. We set this based on how much content we think is manageable and still meaningful (in terms of information retention).
2. **`chunk_overlap`** — This is like padding between chunks. It ensures that the end of one chunk and the beginning of the next overlap just a little, so we don’t lose important context when moving between sections.
3. **`length_function`** — This just defines how we count the length of the text. We use the standard `len()` function, which counts characters.

By calling `split_documents`, we split the entire document into smaller, more manageable chunks. The output here will be a list of documents where each document is a chunk of the original text.

Next, we want to update the metadata for each of these detailed chunks. This is important because later on, when we are searching through these chunks, we want to know from which page or section each chunk came from. We will also tag these chunks as not being summaries so we can keep track of which documents are summaries and which are more detailed.

In [9]:
# Update metadata for detailed chunks
for i, chunk in enumerate(detailed_chunks):
    chunk.metadata.update({
        "chunk_id": i,
        "summary": False,
        "page": int(chunk.metadata.get("page", 0))
    })

We loop through each of the `detailed_chunks` we just created. For each chunk, we update the metadata to:
- **Assign a unique `chunk_id`** — This is just a unique identifier for each chunk (it is an index in this case). It helps us keep track of which chunk is which.
- **Set `summary: False`** — This tells us that the chunk is not a summary but an actual piece of detailed content.
- **Record the `page` number** — The original document had page numbers, and we want to retain this information to reference the chunk’s location in the original text.

### Embed and create vector stores
Now that we have our summaries and detailed chunks ready, the next step is to transform these chunks into embeddings and store them in vector stores. We will be using OpenAI's embedding model to generate embeddings for both the summaries and the detailed chunks. After that, we will store the embeddings in vector stores for later retrieval.

In [10]:
# Initializes the OpenAI embeddings model
embeddings = OpenAIEmbeddings()

Next, we will create the vector store. We will use FAISS, a library for nearest neighbor search, to create the vector stores. This will allow us to find the most semantically similar chunks of text when performing a query. We need to embed both the summaries and detailed chunks, so we will create two separate vector stores — one for each type of chunk.

In [15]:
# Create vector stores asynchronously with rate limit handling
async def create_vectorstore(docs):
    """
    Creates a vector store from a list of documents with rate limit handling.

    Args:
        docs: The list of documents to be embedded.

    Returns:
        A FAISS vector store containing the embedded documents.
    """
    loop = asyncio.get_event_loop()

    # Use run_in_executor to run the FAISS embedding process in a separate thread
    return await retry_with_exponential_backoff(
        loop.run_in_executor(None, FAISS.from_documents, docs, embeddings)
    )

# Generate vector stores for summaries and detailed chunks concurrently
summary_vectorstore, detailed_vectorstore = await asyncio.gather(
    create_vectorstore(summaries),
    create_vectorstore(detailed_chunks)
)

Here we define an asynchronous helper function `create_vectorstore` that will:
1. Convert documents to embeddings: This is done by passing a batch of documents (summaries or detailed chunks) to the `FAISS.from_documents()` method along with the `embeddings` model.
2. Handle rate limits: Since embedding large batches of documents may hit API rate limits, we use our previously defined `retry_with_exponential_backoff` function to ensure that the embedding process retries automatically with an exponential delay in case it encounters a rate limit error.

After this, we are ready to create the vector stores for both the summaries and the detailed chunks. We will do this concurrently for efficiency. In this part of the code, we use `asyncio.gather` to execute the `create_vectorstore` function concurrently for both the summaries and the detailed chunks. This ensures that both vector stores are created at the same time, rather than sequentially, making the process more efficient.

#### Save vector stores for reuse
In this step, we will save the vector stores for future use. Instead of recomputing the embeddings and vector stores every time we run the process, we can persist them on disk, so we can reuse them without the need for recalculation. This saves us time, computational resources, and makes our retrieval system more efficient in the long run.

In [16]:
summary_vectorstore.save_local("../vector_stores/summary_store")
detailed_vectorstore.save_local("../vector_stores/detailed_store")

We use the FAISS library’s `.save_local()` method to store the vector stores on disk. This method serializes the vector stores and writes them to a specified directory.

Or, if they already exist, load them instead:

In [None]:
if os.path.exists("../vector_stores/summary_store") and os.path.exists("../vector_stores/detailed_store"):
    embeddings = OpenAIEmbeddings()
    summary_vectorstore = FAISS.load_local("../vector_stores/summary_store", embeddings, allow_dangerous_deserialization=True)
    detailed_vectorstore = FAISS.load_local("../vector_stores/detailed_store", embeddings, allow_dangerous_deserialization=True)

### Perform hierarchical retrieval
Now that we have created and saved the vector stores (both for summaries and detailed chunks), the next step is performing retrieval. The goal is to retrieve relevant information based on a query, but in a way that is both efficient and contextually relevant.

In [17]:
query = "What is the greenhouse effect?"
k_summaries = 3
k_chunks = 5

# Step 1: Search summaries
top_summaries = summary_vectorstore.similarity_search(query, k=k_summaries)

# Step 2: Drill down into relevant pages
relevant_chunks = []

for summary in top_summaries:
    # For each summary, retrieve relevant detailed chunks
    page_number = summary.metadata["page"]
    page_filter = lambda metadata: metadata["page"] == page_number
    page_chunks = detailed_vectorstore.similarity_search(
        query,
        k=k_chunks,
        filter=page_filter
    )
    relevant_chunks.extend(page_chunks)

1. Set the query and parameters: We define the query we are searching for. We also set `k_summaries = 3`, meaning we want to retrieve the top 3 relevant summaries, and `k_chunks = 5`, meaning we want to get 5 detailed chunks per relevant summary.
2. Search for relevant summaries: The first step in hierarchical retrieval is to search through the summary vector store (`summary_vectorstore`) to find the most relevant summaries to the query.
   - We use the `similarity_search()` method of the `summary_vectorstore`, which compares the query with each summary and returns the top `k_summaries` most similar to the query.
   - The result, `top_summaries`, is a list of the top 3 summaries that are most relevant to the query.
3. Filter by page and retrieve detailed chunks: For each of the top summaries retrieved, we extract the page number of that summary using `summary.metadata["page"]`. This is important because we want to ensure that we only retrieve detailed chunks from the same page as the relevant summary.
4. Search for detailed chunks: Using the `similarity_search()` method on the detailed vector store (`detailed_vectorstore`), we search for detailed chunks that are similar to the query.
   - We apply the `page_filter` to restrict the search to the chunks on the same page as the current summary.
   - We retrieve the top `k_chunks` detailed chunks for each relevant summary.
5. Combine all relevant chunks: For each summary, we retrieve detailed chunks and extend the list of relevant chunks (`relevant_chunks`) with these results. After looping through all the top summaries, `relevant_chunks` will contain all the relevant detailed chunks that are closely tied to the query.

### Display the results
In this final step, we will present the relevant information we have retrieved.

In [18]:
# Print results
for chunk in relevant_chunks:
    print(f"Page: {chunk.metadata['page']}")
    print(f"Content: {chunk.page_content[:100]}...")  # Print first 100 characters
    print("---")

Page: 0
Content: Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is...
---
Page: 0
Content: Most of these climate changes are attributed to very small variations in Earth's orbit that 
change ...
---
Page: 0
Content: Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to si...
---
Page: 5
Content: Energy-efficient buildings use less energy for heating, cooling, and lighting. This can be 
achieved...
---
Page: 5
Content: a long time. These projects can help sequester carbon and provide new habitats for wildlife. 
Strate...
---
Page: 2
Content: development of eco-friendly fertilizers and farming techniques is essential for reducing the 
agricu...
---
Page: 2
Content: Heatwaves can lead to heat-related illnesses and exacerbate existing health conditions. 
Changing Se...
---
Page: 2
Content: Ruminant animals, such as cows and sheep, produce methane during digestion. Manure 
management pract...
---
