# Contextual RAG System - Clean Demo

This notebook demonstrates the modular contextual retrieval system with three approaches:

- **Basic RAG**: Standard document chunking and embedding
- **Contextual RAG**: Enhanced embeddings with situational context  
- **Combined RAG**: Multi-document search with source attribution

All functionality is cleanly separated into modules - this notebook focuses on **demonstration and testing**.


## Setup & Imports


In [14]:
# Import the clean modular components
import json
import textwrap
from vector_db import VectorDB, ContextualVectorDB
from rag_operations import (
    answer_query_base, answer_query_contextual, answer_query_combined
)
from data_utils import (
    transform_data_for_vectordb, create_contextual_dataset, 
    create_combined_dataset
)
import config

from config import OPENAI_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY, GREENHOUSE_API_KEY, ANSWER_MODEL

def wrap_text(text: str, width: int = 90) -> str:
    """Wrap text to specified width while preserving paragraph breaks."""
    if not text:
        return text
    
    # Split by double newlines to preserve paragraph breaks
    paragraphs = text.split('\n\n')
    wrapped_paragraphs = []
    
    for paragraph in paragraphs:
        # Remove single newlines within paragraphs and wrap
        cleaned_paragraph = paragraph.replace('\n', ' ').strip()
        if cleaned_paragraph:
            wrapped = textwrap.fill(cleaned_paragraph, width=width)
            wrapped_paragraphs.append(wrapped)
    
    return '\n\n'.join(wrapped_paragraphs)

print("✅ All modules loaded successfully")
print(f"📁 Data directory: {config.DATA_DIR}")


✅ All modules loaded successfully
📁 Data directory: ../data


## 1. Basic RAG Demo

Standard document chunking and embedding approach.


In [15]:
# Load employee handbook data
print("📖 Loading employee handbook data...")

with open(config.EMPLOYEE_HANDBOOK_PATH, 'r') as f:
    employee_handbook_raw = json.load(f)

# Transform data for VectorDB
employee_handbook = transform_data_for_vectordb(employee_handbook_raw, "employee_handbook")

# Initialize and load VectorDB
basic_db = VectorDB("employee_handbook")
basic_db.load_data(employee_handbook)

print(f"✅ Loaded {len(employee_handbook[0]['chunks'])} chunks into basic VectorDB")


📖 Loading employee handbook data...
Loading vector database from disk.
✅ Loaded 77 chunks into basic VectorDB


In [17]:
# Test basic RAG with text wrapping
test_query = "What are Uniswap's core values?"

print(f"🔍 Query: {test_query}")
print("="*60)

answer = answer_query_base(test_query, basic_db)
print(f"📝 Basic RAG Answer:\n{wrap_text(answer)}")


🔍 Query: What are Uniswap's core values?
📝 Basic RAG Answer:
Uniswap's core values are articulated through their operating principles called "Unicode,"
which serve as daily guideposts for how they interact with each other, users, and their
community.

**People First** represents their belief that easy, safe, fair value transfer on the
internet can improve people's lives. Access, security and experience is the center of
everything they do. By pursuing decentralization, interoperability, and durability they
align with their users over the long-term. Internally, people are their greatest asset,
and they strive for an environment where everyone can make an incredible impact. They
share direct, kind feedback so they can improve. When advocating for an idea, technical
tradeoff, or business goal, they start with why — why it's better for their user and
company.

**Simple** reflects their craft of keeping things simple. In a complex field, they create
clarity and simplicity. They write and bui

## 2. Contextual RAG Demo

Enhanced embeddings with situational context for better retrieval accuracy.


In [20]:
# Load contextual dataset (pre-computed contextual information)
print("📖 Loading contextual employee handbook database...")

# This loads pre-created contextual embeddings
contextual_db = ContextualVectorDB("employee_handbook_contextual")
contextual_db.load_db()  # Load from existing pickle file

if contextual_db:
    print("✅ Contextual VectorDB loaded successfully")
else: print("what")


📖 Loading contextual employee handbook database...
✅ Contextual VectorDB loaded successfully


In [21]:
# Test contextual RAG with text wrapping
test_query = "Do we have any work from home benefits"

print(f"🔍 Query: {test_query}")
print("="*60)

answer = answer_query_contextual(test_query, contextual_db)
print(f"🧠 Contextual RAG Answer:\n{wrap_text(answer)}")

print("\n" + "="*60)
print("🔍 Retrieved Chunks (for transparency):")
results = contextual_db.search(test_query, k=3)
for i, result in enumerate(results, 1):
    metadata = result['metadata']
    print(f"\n📄 Chunk {i} (Similarity: {result['similarity']:.3f})")
    print(f"   Section: {metadata.get('heading', 'N/A')}")
    context_preview = metadata.get('contextual_content', 'N/A')[:100]
    print(f"   Context: {context_preview}...")


🔍 Query: Do we have any work from home benefits
🧠 Contextual RAG Answer:
Uniswap Labs offers several work from home benefits for remote team members:

**Home Office Setup Reimbursement**: The company reimburses up to $2,000 USD to cover the
purchase of office supplies, productivity items, and anything else you might need to get
your home office set up.

**Coworking Space Allowance**: If you prefer to work from a co-working space instead of
home, Uniswap Labs reimburses the cost up to $500 USD per month.

**Flexible Remote Work Policy**: Remote working allows team members to work at home, on
the road, or in a satellite location for all or part of their workweek, whenever you
choose, unless otherwise decided upon for your role. The company considers telecommuting
to be a viable, flexible work option and doesn't require anyone to come into the office if
their preference is for remote working.

**Equipment Provision**: All team members receive a company-issued computer to do their
job, whi

## 3. Interactive Testing

Test your own queries with the clean RAG system. All text is wrapped at 90 characters for easy reading.


In [26]:
# Interactive query testing function with text wrapping
def test_query(query: str, approach: str = "all"):
    """Test a query with specified RAG approach(es). All text wrapped at 90 characters."""
    print(f"🔍 Query: {query}")
    print("="*60)
    
    if approach in ["basic", "all"]:
        print("\n📝 Basic RAG:")
        answer = answer_query_base(query, basic_db)
        print(wrap_text(answer))
    
    if approach in ["contextual", "all"]:
        print("\n🧠 Contextual RAG:")
        answer = answer_query_contextual(query, contextual_db)
        print(wrap_text(answer))

# Example usage - try your own queries!
test_query("How much vacation time do employees get?", "contextual")


🔍 Query: How much vacation time do employees get?

🧠 Contextual RAG:
Uniswap Labs provides Unlimited Paid Time Off (PTO) for eligible employees, meaning there
is no maximum number of days off you can reasonably be approved to take, though time off
should be taken in a way that is not disruptive to the business. The team historically
averages around 4 weeks per year as the norm.

PTO includes vacation, sick leave, bereavement leave, birthday reprieve, and any other
time you may just need a break. Additionally, Uniswap has an annual end of year break with
dates specified in the current year's holiday schedule.

All regular full-time team members are eligible for PTO. However, temporary team members
and regular part-time team members scheduled to work less than 20 hours per week are not
eligible for PTO, though they are provided with sick time if required by applicable state
or city law.

For planned absences like vacation, you should request PTO from your manager and record it
using the 

## Summary

This clean implementation demonstrates:

✅ **Modular Architecture**: All functionality separated into proper modules  
✅ **Two Main RAG Approaches**: Basic and Contextual retrieval  
✅ **Easy Testing**: Simple functions to test and compare approaches  
✅ **Clean Interface**: No cluttered inline functions or redundant code  
✅ **Error Handling**: Graceful fallbacks built into the modules  
✅ **Text Wrapping**: All answers wrapped at 90 characters for easy reading

### Next Steps:
- Test with your own queries using `test_query()`
- Compare basic vs contextual retrieval performance
- Explore the modular codebase in the separate `.py` files
- Ready for production deployment or multi-agent expansion

### Module Structure:
- `vector_db.py` - VectorDB and ContextualVectorDB classes
- `rag_operations.py` - RAG pipeline functions  
- `data_utils.py` - Data transformation utilities
- `config.py` - Configuration and API keys
- `prompts.py` - System prompt templates

### Text Wrapping:
All answers are now wrapped at 90 characters with preserved paragraph breaks for optimal readability!


## Testing

In [33]:
# Load job descriptions data
print("📖 Loading job descriptions data...")

import re
from vector_db import VectorDB, ContextualVectorDB

def extract_job_info(content):
    """Extract job title and department from content."""
    title_match = re.search(r"Job Title:\s*(.+)", content)
    dept_match = re.search(r"Department:\s*(.+)", content)
    
    job_title = title_match.group(1).strip() if title_match else "Unknown"
    department = dept_match.group(1).strip() if dept_match else "Unknown"
    
    return job_title, department

# Initialize and load VectorDB
job_db = VectorDB("job_descriptions")
job_db.load_db()

print(f"✅ Loaded {len(job_db.embeddings)} chunks into job descriptions VectorDB")

📖 Loading job descriptions data...
✅ Loaded 944 chunks into job descriptions VectorDB


In [48]:
# Test search
query = "Legal"
results = job_db.search(query, k=15)

print(f"\n🔍 Search results for '{query}':")
for i, result in enumerate(results[:15], 1):
    content = result['metadata']['content']
    job_title, department = extract_job_info(content)
    similarity = result['similarity']
    print(f"  {i}. {job_title} ({department}) - similarity: {similarity:.3f}")


🔍 Search results for 'Legal':
  1. Legal Intern (Legal & Policy) - similarity: 0.374
  2. Legal Counsel (Customer Experience) - similarity: 0.353
  3. Commercial Contract Attorney (Legal & Policy) - similarity: 0.349
  4. Assistant General Counsel, Product (Legal & Policy) - similarity: 0.344
  5. Commercial Contract Attorney (Legal & Policy) - similarity: 0.341
  6. Legal Counsel (Customer Experience) - similarity: 0.341
  7. Legal Intern (Legal & Policy) - similarity: 0.340
  8. Accounting Manager (Legal & Policy) - similarity: 0.339
  9. Legal Program Manager (Legal & Policy) - similarity: 0.337
  10. Legal Intern (Legal & Policy) - similarity: 0.334
  11. Assistant General Counsel, Product (Legal & Policy) - similarity: 0.333
  12. Senior Policy Associate (Legal & Policy) - similarity: 0.332
  13. Associate General Counsel (Legal & Policy) - similarity: 0.331
  14. Head of US Policy (Legal & Policy) - similarity: 0.327
  15. Senior Counsel (Legal & Policy) - similarity: 0.327


In [37]:
# For contextual embeddings:
contextual_db = ContextualVectorDB("job_descriptions_contextual")
contextual_db.load_db()
print(f"✅ Loaded {len(contextual_db.embeddings)} chunks into contextual job descriptions VectorDB")

# Your databases are now ready for search:
# job_db.search("your query", k=5)
# contextual_db.search("your query", k=5) 

✅ Loaded 944 chunks into contextual job descriptions VectorDB


In [39]:
query = "Engineering manager"
results = contextual_db.search(query, k=5)

print(f"\n🔍 Search results for '{query}':")


🔍 Search results for 'Engineering manager':


## Testing 2

### Function declaration

In [3]:
import json
import os
from typing import List, Dict, Any
from pathlib import Path

from vector_db import VectorDB 
from config import OPENAI_API_KEY

"""
============================================================================================
Transform jobs for vector database
============================================================================================
"""
def transform_jobs_for_vectordb(jobs_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    doc = {
        "doc_id": "all_jobs",
        "original_uuid": "all_jobs_uuid",
        "chunks": []
    }
    
    for i, job in enumerate(jobs_data):
        full_content = create_full_job_text(job)
        doc["chunks"].append({
            "chunk_id": job['job_id'],
            "original_index": i,
            "content": full_content,
            "job_id": job['job_id'],
            "job_title": job['title'],
            "department": job['department']
        })
    
    return [doc]

"""
============================================================================================
Create full job text
============================================================================================
"""
def create_full_job_text(job: Dict[str, Any]) -> str:
    """Create a comprehensive text representation of a job."""
    
    sections = []
    
    # Header
    sections.append(f"Job Title: {job['title']}")
    sections.append(f"Department: {job['department']}")
    
    # Metadata
    metadata = job.get('metadata', {})
    if metadata.get('seniority'):
        sections.append(f"Seniority Level: {metadata['seniority']}")
    if metadata.get('skills'):
        sections.append(f"Skills: {', '.join(metadata['skills'])}")
    
    sections.append("")  # Empty line
    
    # Introduction
    if job['sections'].get('intro'):
        sections.append("Job Description:")
        sections.append(job['sections']['intro'])
        sections.append("")
    
    # Responsibilities
    if job['sections'].get('responsibilities'):
        sections.append("Key Responsibilities:")
        for resp in job['sections']['responsibilities']:
            sections.append(f"• {resp}")
        sections.append("")
    
    # Requirements
    if job['sections'].get('requirements'):
        sections.append("Requirements:")
        for req in job['sections']['requirements']:
            sections.append(f"• {req}")
        sections.append("")
    
    # Nice to haves
    if job['sections'].get('nice_to_haves'):
        sections.append("Nice to Have:")
        for nice in job['sections']['nice_to_haves']:
            sections.append(f"• {nice}")
        sections.append("")
    
    return "\n".join(sections)

"""
============================================================================================
Create job embeddings
============================================================================================
"""
def create_job_embeddings(input_file: str = "data/job_descriptions_conglomerate.json"):
    """Create vector embeddings for job descriptions."""
    
    print("🚀 Creating job description embeddings...\n")
    
    # Load job descriptions
    print(f"📖 Loading job descriptions from {input_file}")
    with open(input_file, 'r', encoding='utf-8') as f:
        jobs_data = json.load(f)
    
    print(f"✅ Loaded {len(jobs_data)} job descriptions")
    
    # Transform data for vector database
    print("\n🔄 Transforming job data for vector database...")
    transformed_data = transform_jobs_for_vectordb(jobs_data)
    
    total_chunks = sum(len(doc['chunks']) for doc in transformed_data)
    print(f"✅ Created 1 document with {total_chunks} total chunks")
    
    # Create vector database
    print(f"\n🗄️ Creating vector database: job_descriptions")
    db = VectorDB(name="job_descriptions")
    db.load_data(transformed_data)
    
    print(f"✅ Vector database created and saved!")
    
    # Test search functionality
    print(f"\n🔍 Testing search functionality...")
    test_queries = [
        "software engineer python",
        "marketing manager", 
        "product design",
        "business development sales"
    ]

    for query in test_queries:
        results = db.search(query, k=3)
        print(f"\nQuery: '{query}'")
        for i, result in enumerate(results[:2], 1):
            metadata = result['metadata']
            similarity = result['similarity']
            job_title = metadata.get('job_title', 'Unknown')
            department = metadata.get('department', 'Unknown')
            print(f"  {i}. {job_title} ({department}) (similarity: {similarity:.3f})")
    
    return db

### Creating new pkl file & db

In [6]:
print("🎯 Job Descriptions Embedding Creator")
print("=" * 50)

print("\n1️⃣ Creating vector embeddings...")
db = create_job_embeddings()

print("\n✨ Job embedding creation complete!")

🎯 Job Descriptions Embedding Creator

1️⃣ Creating vector embeddings...
🚀 Creating job description embeddings...

📖 Loading job descriptions from data/job_descriptions_conglomerate.json
✅ Loaded 236 job descriptions

🔄 Transforming job data for vector database...
✅ Created 1 document with 236 total chunks

🗄️ Creating vector database: job_descriptions
Loading vector database from disk.
✅ Vector database created and saved!

🔍 Testing search functionality...

Query: 'software engineer python'
  1. Software Engineer (Engineering) (similarity: 0.420)
  2. Data Engineer (Engineering) (similarity: 0.392)

Query: 'marketing manager'
  1. Head of Marketing (Business Operations & Strategy) (similarity: 0.437)
  2. Product Manager (Product Management) (similarity: 0.419)

Query: 'product design'
  1. Product Designer (Product Management) (similarity: 0.394)
  2. Product Designer (Product Management) (similarity: 0.394)

Query: 'business development sales'
  1. Business Development (Business Deve

## Actually generate JDs

In [12]:
# Import and use directly
from misc.job_description_generator import JobGenerator, generate_job_description

# One-liner generation
job_desc = generate_job_description(
    job_title="Senior Python Developer",
    department="Engineering", 
    requirements=["5+ years Python", "API development", "Database design"],
    seniority="senior",
    skills=["Python", "FastAPI", "PostgreSQL"]
)

# Or use the class for more control
generator = JobGenerator(verbose=False)  # Silent mode
job_desc = generator.generate(
    job_title="Marketing Manager",
    department="Marketing",
    key_requirements=["3+ years marketing", "Campaign management"]
)

📖 Loading vector database: job_descriptions
✅ Loaded 236 job descriptions

📄 Senior Python Developer - Engineering

**Job Description:**
We're looking for an enthusiastic, self-motivated professional to join our team.

**Key Responsibilities:**
• Design and develop robust APIs using Python and FastAPI to support scalable applications
• Lead discussions with product and design to rapidly iterate, experiment and launch products
• Design and build systems with an eye for performance & scalability
• Own products end to end and work across the stack as needed, including backend, database design, API development, testing, etc.
• Mentor team members on engineering best practices and career development

**Requirements:**
• 5+ years of Python development experience
• Strong experience in API development and design
• Experience with database design and optimization
• Experience building and maintaining a production system at scale
• Experience with FastAPI, PostgreSQL or similar technologies
• D

In [27]:
from data_utils import setup_contextual_embeddings

# This will automatically create the contextual embeddings if they don't exist
raw_chunks, enhanced_chunks = setup_contextual_embeddings('employee_handbook')

📖 Loading Employee Handbook data for contextual RAG...
🔄 Creating contextual embeddings for Employee Handbook...
Processing 77 Employee Handbook chunks...


Adding contextual information to Employee Handbook: 100%|██████████| 77/77 [09:12<00:00,  7.17s/it]

✅ Contextual information added for Employee Handbook!
📄 Enhanced chunks saved to ../data/employee_handbook_with_context.json
✅ Created contextual embeddings with 77 chunks



