<a href="https://colab.research.google.com/github/jayVisaria/genai-agent-hub/blob/main/cookbook/pageindex_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Vectorless RAG with PageIndex

## PageIndex Introduction
PageIndex is a new **reasoning-based**, **vectorless RAG** framework that performs retrieval in two steps:  
1. Generate a tree structure index of documents  
2. Perform reasoning-based retrieval through tree search  

<div align="center">
  <img src="https://docs.pageindex.ai/images/cookbook/vectorless-rag.png" width="70%">
</div>

Compared to traditional vector-based RAG, PageIndex features:
- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.
- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.
- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search ("vibe retrieval").

In [1]:
%pip install -q --upgrade pageindex

In [2]:
from pageindex import PageIndexClient
import pageindex.utils as utils
from google.colab import userdata

# Get your PageIndex API key from https://dash.pageindex.ai/api-keys
PAGEINDEX_API_KEY = userdata.get("PAGEINDEX_API_KEY")
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

In [3]:
import google.generativeai as genai
from google.colab import userdata

# Get your Gemini API key from Google AI Studio
GEMINI_API_KEY = userdata.get("GOOGLE_API_KEY")
genai.configure(api_key=GEMINI_API_KEY)

async def call_llm(prompt, model="gemini-2.5-flash", temperature=0):
    """Calls the Google Gemini API with the given text prompt."""
    client = genai.GenerativeModel(model)
    response = client.generate_content(prompt)
    return response.text.strip()

In [4]:
import os, requests

# You can also use our GitHub repo to generate PageIndex tree
# https://github.com/VectifyAI/PageIndex

pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)
print(f"Downloaded {pdf_url}")

doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)

Downloaded https://arxiv.org/pdf/2501.12948.pdf
Document Submitted: pi-cmgc8p5gu003a0bqsc56lzrbh


## Step 1: PageIndex Tree Generation

In [12]:
if pi_client.is_retrieval_ready(doc_id):
    tree = pi_client.get_tree(doc_id, node_summary=True)['result']
    print('Simplified Tree Structure of the Document:')
    utils.print_tree(tree)
else:
    print("Processing document, please try again later...")

Simplified Tree Structure of the Document:
[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',
  'node_id': '0000',
  'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',
  'nodes': [{'title': 'Abstract',
             'node_id': '0001',
             'summary': 'The partial document introduces two reas...'},
            {'title': 'Contents',
             'node_id': '0002',
             'summary': 'This partial document outlines the struc...'},
            {'title': '1. Introduction',
             'node_id': '0003',
             'prefix_summary': 'The partial document introduces recent a...',
             'nodes': [{'title': '1.1. Contributions',
                        'node_id': '0004',
                        'summary': '### 1.1. Contributions\n'},
                       {'title': 'Post-Training: Large-Scale Reinforcement...',
                        'node_id': '0005',
                        'summary': 'This partial document discusses the appl...'},
                

## Step 2: Reasoning-Based Retrieval with Tree Search

In [13]:
import json

query = "What are the conclusions in this document?"

tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a corresponding summary.
Your task is to find all nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "<Your thinking process on which nodes are relevant to the question>",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}
Directly return the final JSON structure. Do not output anything else.
"""

tree_search_result = await call_llm(search_prompt)

In [14]:
import json
import re

node_map = utils.create_node_mapping(tree)

tree_search_result_cleaned = tree_search_result.strip()
if tree_search_result_cleaned.startswith("```json"):
    tree_search_result_cleaned = tree_search_result_cleaned[7:-3].strip()
elif tree_search_result_cleaned.startswith("```"):
     tree_search_result_cleaned = tree_search_result_cleaned[3:-3].strip()


try:
    tree_search_result_json = json.loads(tree_search_result_cleaned)

    print('Reasoning Process:')
    utils.print_wrapped(tree_search_result_json['thinking'])

    print('\nRetrieved Nodes:')
    for node_id in tree_search_result_json["node_list"]:
        node = node_map.get(node_id)
        if node:
            print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}")
        else:
            print(f"Warning: Node ID {node_id} not found in node_map.")

except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")
    print("Raw LLM output:")
    print(tree_search_result)
except KeyError as e:
    print(f"Error accessing expected key in JSON: {e}")
    print("Parsed JSON object:")
    print(tree_search_result_json)

Reasoning Process:
The user is asking for the conclusions of the document. In academic papers, conclusions are
typically found in the Abstract, the dedicated Conclusion section, and sometimes summarized in the
Discussion section.
1. **Node 0021 (5. Conclusion, Limitations, and Future Work):** This is the primary location for the
explicit conclusions, summarizing the findings about DeepSeek-R1-Zero, DeepSeek-R1, and the
successful distillation process.
2. **Node 0001 (Abstract):** The abstract provides a high-level summary of the main conclusions,
including the comparative performance of DeepSeek-R1.
3. **Node 0020 (4. Discussion):** This section contains interpretive conclusions, specifically
comparing the effectiveness and economy of distillation versus reinforcement learning, which is a
major takeaway of the research.
I will select these three nodes as they collectively provide the comprehensive set of conclusions
and major findings.

Retrieved Nodes:
Node ID: 0001	 Page: 1	 Title: A

## Step 3: Answer Generation

In [15]:
node_list = tree_search_result_json["node_list"]
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list)

print('Retrieved Context:\n')
utils.print_wrapped(relevant_content[:1000] + '...')

Retrieved Context:

## Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised
fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL,
DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language mixing. To address these
issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates
multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to
OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source
DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from
DeepSeek-R1 based on Qwen and Llama.


![img-0.jpeg](img-0.jpeg)

Figure 1 | Benchmark performance of DeepSeek-R1.



In [36]:
answer_prompt = f"""
Answer the question based on the context:

Question: {query}
Context: {relevant_content}

Provide a clear, concise answer based only on the context provided.
"""

print('Generated Answer:\n')
answer = await call_llm(answer_prompt)
utils.print_wrapped(answer)

Generated Answer:

The conclusions in this document are as follows:

**Regarding the DeepSeek Models:**

1.  **DeepSeek-R1-Zero** (the pure RL approach without cold-start data) achieved strong performance
across various tasks, demonstrating the potential of LLMs to develop reasoning capabilities without
supervised data. However, it encountered challenges such as poor readability and language mixing.
2.  **DeepSeek-R1** (leveraging cold-start data alongside iterative RL fine-tuning) is more powerful
and achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

**Regarding Distillation vs. Reinforcement Learning (RL):**

1.  Distilling more powerful models (like DeepSeek-R1) into smaller ones yields excellent and
effective results, with distilled models achieving impressive benchmark results, significantly
outperforming other instruction-tuned models.
2.  Smaller models relying on the large-scale RL discussed in the paper require enormous
computational power and may not eve

In [16]:
# Define the user query
query = "What are Distilled Model Evaluation?"

In [23]:
import os
import requests
import json
import re
import asyncio

from pageindex import PageIndexClient
import pageindex.utils as utils
from google.colab import userdata
import google.generativeai as genai

# --- Step 1: Setup and Initialization ---
# Get your API keys from Google Colab secrets
PAGEINDEX_API_KEY = userdata.get("PAGEINDEX_API_KEY")
GEMINI_API_KEY = userdata.get("GOOGLE_API_KEY")

# Initialize PageIndex client
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

# Initialize Gemini client
genai.configure(api_key=GEMINI_API_KEY)

async def call_llm(prompt, model="gemini-2.5-flash", temperature=0):
    """Calls the Google Gemini API with the given text prompt."""
    client = genai.GenerativeModel(model)
    response = client.generate_content(prompt)
    return response.text.strip()


# --- Step 2: Document Loading and Indexing ---
# Download the document
pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)
print(f"Downloaded {pdf_url}")

# Submit document to PageIndex for indexing
doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)

# Wait for document to be ready (optional, but good practice)
while not pi_client.is_retrieval_ready(doc_id):
    print("Processing document, please wait...")
    await asyncio.sleep(5) # Use asyncio.sleep in async function

# --- Step 3: PageIndex Tree Retrieval ---
tree = pi_client.get_tree(doc_id, node_summary=True)['result']
node_map = utils.create_node_mapping(tree)
print('Simplified Tree Structure of the Document:')
utils.print_tree(tree)


# --- RAG Pipeline from Query to Answer ---

# The query is defined in a cell above (assuming 'query' variable exists)
# query = "What are the conclusions in this document?" # Uncomment and define query here if running as a standalone cell

# Step 4: Reasoning-Based Tree Search
# Use the LLM to find relevant nodes in the document tree
tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a corresponding summary.
Your task is to find all nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "<Your thinking process on which nodes are relevant to the question>",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}
Directly return the final JSON structure. Do not output anything else.
"""

tree_search_result = await call_llm(search_prompt)

# Attempt to clean and parse the LLM output
tree_search_result_cleaned = tree_search_result.strip()
if tree_search_result_cleaned.startswith("```json"):
    tree_search_result_cleaned = tree_search_result_cleaned[7:-3].strip()
elif tree_search_result_cleaned.startswith("```"):
     tree_search_result_cleaned = tree_search_result_cleaned[3:-3].strip()

try:
    tree_search_result_json = json.loads(tree_search_result_cleaned)

    print('\nReasoning Process:')
    utils.print_wrapped(tree_search_result_json['thinking'])

    print('\nRetrieved Nodes:')
    for node_id in tree_search_result_json["node_list"]:
        node = node_map.get(node_id)
        if node:
            print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}")
        else:
            print(f"Warning: Node ID {node_id} not found in node_map.")

except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")
    print("Raw LLM output:")
    print(tree_search_result)
    tree_search_result_json = {}
except KeyError as e:
    print(f"Error accessing expected key in JSON: {e}")
    print("Parsed JSON object:")
    print(tree_search_result_json)
    tree_search_result_json = {}


# Step 5: Retrieve Relevant Content
if 'node_list' in tree_search_result_json:
    node_list = tree_search_result_json["node_list"]
    relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list if node_id in node_map)

    print('\nRetrieved Context:\n')
    utils.print_wrapped(relevant_content[:1000] + '...')
else:
    relevant_content = ""
    print("\nNo nodes were identified as relevant.")


# Step 6: Answer Generation
if relevant_content:
    answer_prompt = f"""
    Answer the question based on the context:

    Question: {query}
    Context: {relevant_content}

    Provide a clear, concise answer based only on the context provided.
    """

    print('\nGenerated Answer:\n')
    answer = await call_llm(answer_prompt)
    utils.print_wrapped(answer)
else:
    print("\nCannot generate answer as no relevant content was retrieved.")

Downloaded https://arxiv.org/pdf/2501.12948.pdf
Document Submitted: pi-cmgc8xv51016p0aqs01ch8ey6
Processing document, please wait...
Processing document, please wait...
Processing document, please wait...
Processing document, please wait...
Processing document, please wait...
Processing document, please wait...
Simplified Tree Structure of the Document:
[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',
  'node_id': '0000',
  'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',
  'nodes': [{'title': 'Abstract',
             'node_id': '0001',
             'summary': 'The partial document introduces two reas...'},
            {'title': 'Contents',
             'node_id': '0002',
             'summary': 'The partial document outlines the struct...'},
            {'title': '1. Introduction',
             'node_id': '0003',
             'prefix_summary': 'The partial document introduces recent a...',
             'nodes': [{'title': '1.1. Contributions',
                