# RAPTOR Code Retrieval Demo

This notebook demonstrates hierarchical code retrieval using RAPTOR clustering:
1. Extract code from Python files
2. Apply RAPTOR clustering for multi-level code understanding
3. Store embeddings in FAISS vector store
4. Retrieve relevant code snippets via semantic search

**Key Features**:
- Tree-structured retrieval at different abstraction levels
- Semantic code search using natural language queries
- Hierarchical summaries for codebase understanding
- Local embeddings (no API costs)

## Setup

**Prerequisites:**
1. Run `uv sync` in project root
2. Copy `.env.example` to `.env` and configure API keys
3. Specify Python codebase path for analysis

In [2]:
import sys
import os
import glob
from pathlib import Path

# Add parent directory to path to import src modules
sys.path.insert(0, os.path.abspath('..'))

# Remove cached modules to force fresh import
modules_to_remove = [key for key in sys.modules.keys() if key.startswith('src.')]
for module in modules_to_remove:
    del sys.modules[module]

# Import our modules
from src.config import Config
from src.raptor import RAPTORProcessor
from src.vector_store import FAISSVectorStore
from src.code_processor import CodeProcessor

print("Modules loaded successfully")

  from .autonotebook import tqdm as notebook_tqdm


Modules loaded successfully


## Configuration

Choose LLM provider for code summaries: `"openai"` or `"gemini"`

Embeddings use local sentence-transformers (no API calls required)

In [3]:
# Configuration
LLM_PROVIDER = "gemini"  # or "openai"
USE_LOCAL_EMBEDDINGS = True  # Use free local embeddings

# Initialize configuration
config = Config(llm_provider=LLM_PROVIDER, use_local_embeddings=USE_LOCAL_EMBEDDINGS)
print(f"Configuration loaded: {config}")
print(f"Embeddings: {'Local (sentence-transformers)' if USE_LOCAL_EMBEDDINGS else f'{LLM_PROVIDER} API'}")

Configuration loaded: Config(provider=gemini, model=gemini-2.0-flash)
Embeddings: Local (sentence-transformers)


## Set Codebase Path

Specify the Python codebase directory for analysis (default: `../src`)

In [4]:
# Set your codebase path
CODEBASE_PATH = "../src"  # Modify as needed

# Output directories
VECTOR_STORE_DIR = "../data/code_vector_store"

# Create directories if they don't exist
os.makedirs(VECTOR_STORE_DIR, exist_ok=True)

print(f"Codebase path: {CODEBASE_PATH}")
print(f"Vector store directory: {VECTOR_STORE_DIR}")

Codebase path: ../src
Vector store directory: ../data/code_vector_store


## Step 1: Extract Code from Codebase

Process:
1. Find all Python files
2. Extract functions and classes with docstrings
3. Create structured code chunks for embedding

In [5]:
# Initialize code processor
code_processor = CodeProcessor()

# Extract code chunks
code_chunks = code_processor.extract_code_chunks(CODEBASE_PATH)

print(f"\nExtracted {len(code_chunks)} code chunks")
print(f"\nSample code chunk:\n")
print(code_chunks[0][:500] + "..." if code_chunks else "No chunks found")

Found 6 Python files

Extracted 12 code chunks

Sample code chunk:

# File: config.py

Configuration module for managing API keys and LLM provider selection.
...


## Step 2: Apply RAPTOR Clustering

RAPTOR creates hierarchical tree structure:
- **Level 0 (Leaf)**: Individual functions/classes
- **Level 1**: Summaries of related code groups
- **Level 2**: High-level module summaries
- **Level 3**: Overall codebase understanding

Enables retrieval at different abstraction levels.

In [6]:
# Initialize RAPTOR processor
raptor = RAPTORProcessor(config)

# Apply RAPTOR clustering (3 levels of hierarchy)
print("Building RAPTOR tree structure...")
print("Creating hierarchical summaries of code.\n")

all_code_texts = raptor.process(texts=code_chunks, n_levels=3)

print(f"\nRAPTOR Results:")
print(f"  Original code chunks: {len(code_chunks)}")
print(f"  Total texts (with summaries): {len(all_code_texts)}")
print(f"  New summaries created: {len(all_code_texts) - len(code_chunks)}")

🔧 Using local embeddings (sentence-transformers/all-MiniLM-L6-v2)


  return HuggingFaceEmbeddings(


Building RAPTOR tree structure...
Creating hierarchical summaries of code.


Building RAPTOR tree with 3 levels...
Starting with 12 leaf texts


  warn(


  Level 1: Generated 2 clusters
  Level 2: Generated 1 clusters
  Level 1: Added 2 summaries
  Level 2: Added 1 summaries
RAPTOR processing complete: 15 total texts

RAPTOR Results:
  Original code chunks: 12
  Total texts (with summaries): 15
  New summaries created: 3


## Step 3: Create FAISS Vector Store

Store code chunks and summaries in FAISS vector database for semantic search.

In [7]:
# Initialize vector store
vector_store = FAISSVectorStore(config)

# Create vector store from all texts
print("Creating vector store from code embeddings...")
vector_store.create_from_texts(all_code_texts)

# Display stats
stats = vector_store.get_stats()
print(f"\nVector Store Stats:")
for key, value in stats.items():
    print(f"  {key}: {value}")

🔧 Using local embeddings (sentence-transformers/all-MiniLM-L6-v2)
Creating vector store from code embeddings...

Creating FAISS vector store from 15 texts...
Vector store created successfully

Vector Store Stats:
  status: initialized
  n_vectors: 15
  embedding_provider: gemini


### Save Vector Store

In [8]:
# Save vector store for later use
vector_store.save(VECTOR_STORE_DIR)
print(f"Vector store saved to {VECTOR_STORE_DIR}")


Saving vector store to ../data/code_vector_store...
Vector store saved successfully
Vector store saved to ../data/code_vector_store


## Step 4: Semantic Code Search

Search for code using natural language queries.

**Example queries:**
- "How to configure the LLM provider?"
- "Code for processing PDF files"
- "Functions that handle embeddings"
- "RAPTOR clustering implementation"

In [9]:
# Example query
query = "How to configure embeddings and LLM provider?"  # Modify as needed

print(f"Query: {query}\n")

# Search for similar code
results = vector_store.similarity_search_with_score(query, k=5)

print(f"\nTop {len(results)} Results:\n")
print("=" * 80)
for i, (doc, score) in enumerate(results, 1):
    print(f"\nResult {i} (Similarity Score: {score:.4f}):")
    print("-" * 80)
    print(doc.page_content)
    print("=" * 80)

Query: How to configure embeddings and LLM provider?


Top 5 Results:


Result 1 (Similarity Score: 1.0741):
--------------------------------------------------------------------------------
# File: config.py
# Class: Config

Configuration class for managing LLM providers and API keys.

Methods: __init__, _validate_api_keys, get_embedding_model, get_llm_model, __repr__

```python
class Config:
    """Configuration class for managing LLM providers and API keys."""
    
    def __init__(self, llm_provider: Optional[str] = None, use_local_embeddings: bool = False):
        """
        Initialize configuration.
        
        Args:
            llm_provider: Either "openai" or "gemini". If None, reads from env.
            use_local_embeddings: If True, use local sentence-transformers instead of API embeddings
        """
        self.llm_provider = llm_provider or os.getenv("LLM_PROVIDER", "openai")
        self.llm_provider = self.llm_provider.lower()
        self.use_local_embeddings = 

### Additional Example Queries

Try different types of queries to see the power of RAPTOR retrieval:

In [10]:
# Try multiple queries
example_queries = [
    "PDF processing and extraction",
    "RAPTOR clustering algorithm",
    "Vector store implementation",
    "How to handle batch processing?",
]

for query in example_queries:
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    print(f"{'='*80}\n")
    
    results = vector_store.similarity_search_with_score(query, k=2)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"Result {i} (Score: {score:.4f}):")
        # Show first 200 chars of the result
        preview = doc.page_content[:200].replace('\n', ' ')
        print(f"{preview}...")
        print()


Query: PDF processing and extraction

Result 1 (Score: 0.7581):
# File: pdf_processor.py  PDF processing module for extracting text, tables, and images. ...

Result 2 (Score: 0.9912):
This content describes two Python modules: `pdf_processor.py` and `code_processor.py`, along with an `__init__.py` file.  *   **`pdf_processor.py`**: This module contains the `PDFProcessor` class, des...


Query: RAPTOR clustering algorithm

Result 1 (Score: 0.8306):
# File: raptor.py # Class: RAPTORProcessor  RAPTOR hierarchical text clustering and summarization.  Methods: __init__, embed_texts, global_cluster_embeddings, local_cluster_embeddings, get_optimal_clu...

Result 2 (Score: 0.9739):
# File: raptor.py  RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) implementation. Performs hierarchical clustering and summarization of text documents. ...


Query: Vector store implementation

Result 1 (Score: 1.0766):
# File: vector_store.py  FAISS vector store module for efficient similar

## Advanced: Tree-Level Retrieval

RAPTOR enables retrieval at different abstraction levels:
- Higher-level results provide summaries
- Lower-level results provide specific code

In [11]:
def search_with_context(query: str, k: int = 3):
    """
    Search and display results with abstraction level metadata.
    """
    print(f"Query: {query}\n")
    
    results = vector_store.similarity_search_with_score(query, k=k)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"\nResult {i} (Similarity: {1-score:.3f}):")
        print("-" * 80)
        
        # Determine if it's original code or a summary
        content = doc.page_content
        if "# File:" in content or "```python" in content:
            level = "Leaf (Original Code)"
        elif len(content) < 300:
            level = "High-level Summary"
        else:
            level = "Mid-level Summary"
        
        print(f"Level: {level}")
        print(f"\nContent:\n{content[:400]}...")
        print("-" * 80)

# Try it out
search_with_context("How does the configuration system work?")

Query: How does the configuration system work?


Result 1 (Similarity: -0.206):
--------------------------------------------------------------------------------
Level: Mid-level Summary

Content:
The provided content describes three Python modules: `config.py`, `raptor.py`, and `vector_store.py`.

**config.py:** This module defines a `Config` class for managing LLM providers (either "openai" or "gemini") and API keys. The `Config` class initializes with an optional `llm_provider` argument (defaults to "openai" or reads from the environment variable `LLM_PROVIDER`) and a `use_local_embeddin...
--------------------------------------------------------------------------------

Result 2 (Similarity: -0.280):
--------------------------------------------------------------------------------
Level: Mid-level Summary

Content:
The provided content describes five Python modules: `config.py`, `raptor.py`, `vector_store.py`, `pdf_processor.py`, and `code_processor.py`, along with an `__init__.py` f

## Interactive Code Search

Run this cell multiple times with different queries:

In [12]:
# Interactive search
your_query = "Show me error handling code"  # Modify as needed

results = vector_store.similarity_search(your_query, k=3)

print(f"Query: {your_query}\n")
print("=" * 80)

for i, doc in enumerate(results, 1):
    print(f"\nResult {i}:")
    print("-" * 80)
    print(doc.page_content)
    print("=" * 80)

Query: Show me error handling code


Result 1:
--------------------------------------------------------------------------------
# File: code_processor.py

Code processing module for extracting code from Python files.


Result 2:
--------------------------------------------------------------------------------
# File: code_processor.py
# Class: CodeProcessor

Extract and process code from Python codebases.

Methods: __init__, extract_code_chunks, _extract_from_file, _extract_class, _extract_function, get_file_stats

```python
class CodeProcessor:
    """Extract and process code from Python codebases."""
    
    def __init__(self):
        """Initialize code processor."""
        pass
    
    def extract_code_chunks(
        self, 
        codebase_path: str, 
        max_files: Optional[int] = None
    ) -> List[str]:
        """
        Extract code chunks from Python files in a codebase.
        
        Extracts:
        - Module-level docstrings
        - Classes with their docstri

## Load Existing Vector Store (Optional)

Load previously saved vector store to skip processing:

In [13]:
# Uncomment to load existing vector store
# vector_store_loaded = FAISSVectorStore(config)
# vector_store_loaded.load(VECTOR_STORE_DIR)

# print("Vector store loaded successfully")
# print(f"Stats: {vector_store_loaded.get_stats()}")

## Summary

Successfully completed:
- Extracted code from Python codebase
- Applied RAPTOR hierarchical clustering
- Created FAISS vector store for code search
- Performed semantic code retrieval

### Next Steps:

1. **Try different codebases** - Point to any Python project
2. **Integrate with RAG** - Build code Q&A systems
3. **Code documentation** - Generate docs from RAPTOR summaries
4. **Code search tools** - Build IDE plugins
5. **Multi-language support** - Extend to JavaScript, Java, etc.

### Optimization Tips:

- **Larger codebases**: Use `max_files` parameter or filters
- **Better results**: Include imports and comments in extraction
- **Custom embeddings**: Try code-specific embedding models
- **Fine-tune retrieval**: Adjust `k` parameter and similarity thresholds