A comprehensive Retrieval-Augmented Generation (RAG) system that processes PDF documents and provides advanced search capabilities through hybrid semantic-keyword search, intelligent LLM reranking, adaptive query enhancement, and real-time confidence scoring.
- Hybrid Search: Intelligent combination of semantic + keyword search (default: 60%/40%)
- Semantic Search: OpenAI embeddings using text-embedding-3-large for conceptual understanding
- Keyword Search: SQLite FTS5 with BM25-style ranking for exact term matching
- Configurable Weights: Customize semantic/keyword balance for optimal results
- LLM Reranking: Intelligent result reordering using Gemini 2.0 Flash Lite (~2s processing)
- Adaptive Query Enhancement: AI-powered query classification with automatic enhancement calibration
- Confidence Scoring: Real-time quality assessment with visual indicators (🟢 HIGH, 🟡 MEDIUM, 🔴 LOW)
- Auto-Detection: Intelligent language identification from document content
- Manual Override: Force specific language responses (Italian, Spanish, French, English)
- Native Explanations: Contextual analysis in the document's natural language
| Feature | CLI (query.py) | Web (gradio_browser.py) |
|---|---|---|
| Search Methods | ✅ --hybrid, --semantic, --bm25 |
✅ Interactive dropdown |
| Weight Control | ✅ --semantic-weight, --keyword-weight |
✅ Auto-normalizing sliders |
| LLM Reranking | ✅ --rerank flag |
✅ Reranking checkbox |
| Language Control | ✅ --language italian |
✅ Language dropdown |
| Real-time Feedback | ✅ Verbose logging | ✅ Live progress + logs |
| Visual Highlighting | ✅ ANSI colors | ✅ Rich HTML styling |
- Smart Highlighting: AI-powered semantic text highlighting with explanations
- Multi-Level Analysis: Content → Relevance → LLM Analysis → Synthesis
- Dual Answer Mode: Direct answers + detailed search breakdowns
- Professional Styling: Color-coded sections with responsive design
- Python 3.8+
- Poetry (recommended) or pip
-
Clone the repository:
git clone <repository-url> cd llmrag
-
Install dependencies:
# Using Poetry (recommended) poetry install poetry shell # Or using pip pip install -r requirements.txt
-
Environment Configuration:
cp .env.example .env # Edit .env with your API keysRequired configuration:
# OpenRouter API key for LLM analysis and explanations # Get your key from: https://openrouter.ai/keys OPENROUTER_API_KEY=your_openrouter_api_key_here # OpenAI API key for embeddings (text-embedding-3-large) # Get your key from: https://platform.openai.com/api-keys OPENAI_API_KEY=your_openai_api_key_here
Process a PDF document to extract text and generate embeddings:
python ingest.py path/to/your/document.pdfOptions:
-v, --verbose: Enable verbose logging-p, --pages N: Process only first N pages--from-page N: Start from page N--to-page N: End at page N
Output:
- Semantic embeddings stored in ChromaDB collection:
pdf_{pdf_name} - Keyword index stored in SQLite FTS5 database:
./hybrid_search.db - Vector database location:
./chroma_db/
Search processed documents using hybrid, semantic, or keyword search:
python query.py "your search query"Search Mode Options:
--hybrid: Hybrid search combining semantic + keyword (default, 60% semantic + 40% keyword)--semantic: Semantic search only (ChromaDB embeddings)--bm25: Keyword search only (SQLite FTS5 BM25)--semantic-weight FLOAT: Weight for semantic search in hybrid mode (default: 0.6)--keyword-weight FLOAT: Weight for keyword search in hybrid mode (default: 0.4)
General Options:
--pdf PDF_NAME: Search specific PDF (default: all)-k, --top-k N: Number of results to show (default: 3)-s, --min-similarity FLOAT: Minimum similarity threshold (default: 0.0)--language LANG: Force response language (italian, spanish, french, english, default: auto-detect)--enhancement MODE: Enhancement mode (auto, minimal, full, maximum, off, default: auto)--rerank: Enable LLM reranking for improved result quality (~2s)--no-enhancement: Legacy flag to disable enhancement (equivalent to --enhancement=off)--no-text: Hide text content--no-analysis: Disable LLM analysis--list: List available PDF collections-v, --verbose: Enable verbose logging
Examples:
# Default hybrid search (60% semantic + 40% keyword)
python query.py "pricing strategies"
# Semantic search only (concepts and context)
python query.py "market analysis" --semantic
# Keyword search only (exact terms and phrases)
python query.py "machine learning" --bm25
# Custom hybrid weighting (80% semantic, 20% keyword)
python query.py "artificial intelligence" --semantic-weight 0.8 --keyword-weight 0.2
# Search specific document with 5 results
python query.py "competitive advantage" --pdf business_plan -k 5
# Filter by similarity threshold
python query.py "pricing models" -s 0.3
# List available documents
python query.py --list
# Keyword search for exact terminology
python query.py "REST API endpoint" --bm25
# LLM reranking for improved result quality (~2s processing time)
python query.py "machine learning algorithms" --rerank
python query.py "market analysis" --semantic --rerank
python query.py "competitive strategies" --hybrid --rerank
# Confidence scoring examples (visual quality assessment)
python query.py "Neptune distance from sun" --dual-answer # → 97% HIGH confidence
python query.py "string theory cosmology" --dual-answer # → 62% MEDIUM confidence
python query.py "pasta recipe" --dual-answer # → No results (outside domain)
# Semantic search for conceptual understanding
python query.py "What can we learn from David's relationship with God?" --semantic
# Force Italian language responses
python query.py "strategie di marketing" --language italian
# Force English responses for multilingual documents
python query.py "análisis de mercado" --language english
# Adaptive query enhancement (automatic classification and calibration)
python query.py "Quanto è grande il Sole?" --enhancement=auto # Factual → minimal enhancement
python query.py "Cos'è una supernova?" --enhancement=auto # Conceptual → full enhancement
python query.py "Differenza tra pianeta e stella" --enhancement=auto # Comparative → maximum enhancement
# Manual enhancement control
python query.py "search query" --enhancement=minimal # Factual optimization
python query.py "search query" --enhancement=full # Balanced (previous default)
python query.py "search query" --enhancement=maximum # Comparative optimization
python query.py "search query" --enhancement=off # Disable enhancement
# Legacy query enhancement examples (automatic translation and expansion)
python query.py "Nettuno" --bm25 # → Enhanced to "Neptune planet eighth planet"
python query.py "strategie di marketing" --hybrid # → Enhanced to include "marketing strategies business promotional"
# Disable query enhancement for exact term matching (legacy)
python query.py "machine learning" --bm25 --no-enhancementLaunch the Gradio web interface:
python gradio_browser.pyAccess at: http://localhost:7860
Port Management: If port 7860 is occupied, use the utility script:
./kill_port_7860.shEnhanced Web Interface Features:
- 🔍 Search Method Selection: Dropdown for Hybrid/Semantic/BM25 search modes
- ⚖️ Hybrid Weight Control: Interactive sliders for semantic/keyword balance (auto-normalizing to 1.0)
- 🔄 LLM Reranking: Checkbox to enable intelligent result reordering (~2s)
- 🌐 Language Selection: Dropdown for Auto-detect/English/Italian/Spanish/French
- 📚 Collection Filtering: Text input for specific document collections
- 🎯 Smart UI: Sliders only visible when Hybrid mode is selected
- Interactive Search Interface: Real-time document search with full CLI feature parity
- Confidence Scoring: Visual quality indicators (🟢 HIGH ≥80%, 🟡 MEDIUM 60-79%, 🔴 LOW <60%)
- Rich Text Highlighting: Advanced semantic highlighting with footnoted explanations
- Multi-Level Analysis: Each result includes:
- 📖 Content: Highlighted text with semantic annotations
- 💡 Relevance Analysis: Numbered explanations for each highlight
- 🧠 LLM Analysis: AI-powered relevance assessment
- 🔬 Comprehensive Synthesis: Cross-result analysis and insights
- Dual Answer Mode: Direct answers + detailed search result analysis with confidence assessment
- Responsive Design: Dark/light theme support with professional styling
- Live Feedback: Real-time logs showing search method, weights, and progress
View database statistics:
python info.py-
ingest.py: Document processing pipeline
- PDF text extraction using PyMuPDF
- Smart text chunking (500 chars with 50 char overlap)
- OpenAI embedding generation
- ChromaDB storage with metadata
-
query.py: Semantic search engine
- ChromaDB similarity search
- Multi-collection querying
- LLM-powered result highlighting
- Multilingual explanations
-
llm_wrapper.py: API integration layer
- OpenRouter API for LLM analysis
- OpenAI API for embeddings
- Comprehensive error handling
- Text preprocessing and normalization
-
gradio_browser.py: Enhanced web interface
- Search Method Control: Full dropdown support for Hybrid/Semantic/BM25 modes
- Interactive Weight Tuning: Real-time sliders for hybrid search balance (auto-normalizing)
- LLM Reranking Control: Checkbox to enable intelligent result reordering
- Language Selection: Dropdown for forced language responses
- Rich semantic highlighting: Footnoted explanations with multi-level analysis
- Professional UI: Smart controls with conditional visibility and live feedback
- Full CLI Parity: All command-line features available in web interface
-
info.py: Database utilities
- Collection statistics
- Database management tools
- Semantic Search: OpenAI text-embedding-3-large (3072 dimensions) with ChromaDB
- Keyword Search: SQLite FTS5 with BM25 ranking and Porter stemming
- Hybrid Search: Intelligent combination with configurable weighting
- LLM Analysis: OpenAI GPT-4.1 Nano via OpenRouter (configurable)
- LLM Reranking: Google Gemini 2.0 Flash Lite for intelligent result ordering
- PDF Processing: PyMuPDF for text extraction
- Web Interface: Gradio for interactive search
- Text Processing: Advanced UTF-8 handling and normalization
llmrag/
├── ingest.py # PDF processing and dual-database ingestion
├── query.py # Hybrid/semantic/keyword search engine
├── sqlite_fts5.py # SQLite FTS5 keyword search manager
├── llm_wrapper.py # API integration (OpenAI + OpenRouter)
├── gradio_browser.py # Web interface for document browsing
├── info.py # Database information and utilities
├── test_llm_wrapper.py # API connection tests
├── test_chunking.py # Text chunking tests
├── pyproject.toml # Project dependencies
├── .env.example # Environment configuration template
├── .env # Environment configuration (create from .env.example)
├── kill_port_7860.sh # Utility to free port 7860 (kill processes)
├── chroma_db/ # ChromaDB vector database (auto-created)
└── hybrid_search.db # SQLite FTS5 keyword database (auto-created)
- Chunk Size: 500 characters with 50 character overlap
- Word Boundaries: Preserves word integrity
- Context Preservation: Overlap maintains semantic continuity
- Metadata: Tracks chunk relationships and source pages
The system provides sophisticated text highlighting with explanations:
Terminal (query.py):
- Yellow Background: Semantically relevant text sections
- Green Text: Detailed explanations of relevance
- Structured Sections: Bordered content areas with emoji headers
- Footnoted Explanations: Numbered references with detailed analysis
Web Interface (gradio_browser.py):
- Yellow Highlights: Semantically relevant text with footnote numbers
- Explanation Cards: Numbered explanations in dedicated sections
- Multi-Level Analysis: Content → Relevance → LLM Analysis → Synthesis
- Professional Styling: Color-coded sections with responsive design
Automatic Detection + Manual Override:
- Auto-Detection: Intelligent language identification from document content
- Manual Override: Force specific language via
--language(CLI) or dropdown (Web)
Supported Languages:
- Italian: Spiegazioni in italiano naturale
- Spanish: Explicaciones en español natural
- French: Explications en français naturel
- English: Natural English explanations
Usage Examples:
# Auto-detect language from document
python query.py "strategia aziendale" # → Italian responses
# Force specific language
python query.py "business strategy" --language italian # → Italian responses
python query.py "estrategia empresarial" --language english # → English responsesIntelligent Query Classification + Automatic Enhancement Calibration: The system now automatically classifies queries and adapts enhancement levels for optimal results:
Query Classification Types:
-
Factual Queries: "quanto", "quando", "dove", "how big", "when did" → Minimal Enhancement
- Example: "Quanto è grande il Sole?" → "How big Sun? size"
- Strategy: Translation + 1-2 direct synonyms only
-
Conceptual Queries: "cos'è", "come", "perché", "what is", "how does" → Full Enhancement
- Example: "Cos'è una supernova?" → "What supernova? stellar explosion massive star core collapse"
- Strategy: Balanced expansion with relevant terminology
-
Comparative Queries: "differenza", "confronto", "versus", "compare" → Maximum Enhancement
- Example: "Differenza tra pianeta e stella" → "Difference planet star? celestial bodies stellar objects formation"
- Strategy: Extensive synonyms + related concepts + domain terminology
Enhancement Modes:
--enhancement=auto(default): Automatic classification and calibration--enhancement=minimal: Factual queries optimization--enhancement=full: Balanced enhancement (previous default)--enhancement=maximum: Comparative queries optimization--enhancement=off: Disable enhancement completely
Adaptive Process:
- Query Classification: AI-powered analysis of query type and intent
- Adaptive Calibration: Enhancement level adjusted based on query characteristics
- Language Detection: Identifies non-English queries
- Translation: Converts to English for document matching
- Smart Expansion: Term expansion calibrated to query type
- Context Addition: Includes domain-specific terminology as appropriate
- Fallback Protection: Uses original if enhancement fails
Benefits:
- Intelligent Trade-offs: Right-sized enhancement reduces noise for factual queries while maximizing recall for complex queries
- Cross-Language Search: Find English documents using non-English queries
- Adaptive Efficiency: Optimal enhancement level automatically selected
- Better Precision: LLM-guided term selection preserves relevance
- Domain Awareness: Adds field-specific terminology when appropriate
- Backward Compatibility: Legacy
--no-enhancementflag continues to work
Control Options:
- Default:
--enhancement=autofor intelligent adaptation - Manual Override: Use specific modes when you know the query type
- Legacy Support:
--no-enhancementstill works for exact term matching
- Cosine Similarity: Precise semantic matching
- Normalized Embeddings: Consistent similarity ranges
- Threshold Filtering: Configurable relevance cutoffs
- Ranked Results: Best matches first with confidence scores
Test your setup:
# Test API connections
python llm_wrapper.py
# Test with sample document
python ingest.py sample.pdf -v
python query.py "test query" -v
# Run test suites
python test_llm_wrapper.py
python test_chunking.py| Variable | Description | Required |
|---|---|---|
OPENROUTER_API_KEY |
OpenRouter API key for LLM calls and reranking | Yes |
OPENAI_API_KEY |
OpenAI API key for embeddings | Yes |
SEMANTIC_MODEL |
LLM model for analysis | No (defaults to GPT-4.1 Nano) |
RERANKING_MODEL |
Model for intelligent result reranking | No (defaults to Gemini 2.0 Flash Lite) |
EMBEDDING_MODEL |
Embedding model | No (defaults to text-embedding-3-large) |
For SEMANTIC_MODEL:
openai/gpt-4.1-nano(current default, fast & cost-effective)anthropic/claude-3-haiku:beta(excellent alternative)anthropic/claude-3-sonnet:beta(balanced performance for complex analysis)
For RERANKING_MODEL:
google/gemini-2.0-flash-lite-001(current default, optimized for reranking)anthropic/claude-3-haiku:beta(alternative for reranking)
- Batch Processing: Efficient embedding generation
- Persistent Storage: ChromaDB for fast retrieval
- Smart Chunking: Optimal text segmentation
- Caching: Model loading optimization
- Processing: ~2-3 seconds per PDF page
- Search: <1 second for typical queries
- Memory: ~200MB base + embedding cache
- Storage: ~1MB per document (embeddings + metadata)
-
Missing API Keys:
Error: OPENAI_API_KEY not found Solution: Add API keys to .env file -
Collection Not Found:
Error: No PDF collections found Solution: Run ingest.py first to process documents -
Low Similarity Scores:
Issue: No relevant results Solution: Try broader queries or lower threshold (-s 0.1) -
Empty Results:
Issue: Query returns no results Solution: Check if documents are properly ingested with --list
Enable detailed logging:
# Set debug in .env
PAK_DEBUG=true
# Or export temporarily
export PAK_DEBUG=true
python ingest.py document.pdf -v- Custom Models: Modify model selection in llm_wrapper.py
- Output Formats: Extend display_results() in query.py
- Language Support: Add language rules in highlight_relevant_text()
- Chunking Strategies: Modify chunk_text() in ingest.py
The system uses two APIs:
- OpenAI: For high-quality text embeddings
- OpenRouter: For LLM analysis and explanations
Both APIs are abstracted through llm_wrapper.py for easy modification.
The system excels at understanding conceptual queries:
Query: "What can we learn from David's relationship with God?"
Results Retrieved:
- David vs Goliath - Faith enabling impossible victories
- David and Bathsheba - Consequences of moral failure
- Saul's Relationship - Contrast in divine favor
Generated Analysis: Multi-colored highlighting showing:
- PRIMARY: Direct references to David and God
- SECONDARY: Concepts of faith, trust, and divine relationship
- CONTEXT: Historical and theological background
Query: "competitive pricing strategies"
Results: Semantic matching finds relevant content even without exact phrase matches:
- Market positioning discussions
- Pricing model comparisons
- Competitive analysis frameworks
[Add your license information here]
[Add contribution guidelines here]
For issues and questions:
- Check the troubleshooting section
- Review error logs with debug mode enabled
- Verify API key configuration
- Test with smaller documents first