PyRagix

Local-first RAG system based on modern retrieval research - query expansion, cross-encoder reranking, hybrid search (FAISS + BM25), and semantic chunking. Runs entirely on your machine via Ollama. No cloud APIs, no data leaving your network.

Also available as a .NET port: pyragix-net

Architecture

PyRagix implements a multi-stage retrieval pipeline.

Query Pipeline:

User Query
  ↓
Multi-Query Expansion (3-5 variants via local LLM)
  ↓
Hybrid Search (FAISS semantic 70% + BM25 keyword 30%)
  ↓
Cross-Encoder Reranking (top-20 → top-7 by relevance)
  ↓
Answer Generation (local Ollama LLM)

Ingestion Pipeline:

Document Input (PDF, HTML, Images)
  ↓
Text Extraction (PyMuPDF, BeautifulSoup, PaddleOCR)
  ↓
Semantic Chunking (sentence-boundary aware)
  ↓
Embedding Generation (local sentence-transformers)
  ↓
Dual Indexing (FAISS vector + BM25 keyword)

Query expansion helps with recall on vague or paraphrased questions. Reranking filters out keyword-matched junk. Hybrid search handles structured queries (names, dates, IDs) that pure semantic search misses.

Features

Query expansion - generates multiple query variants via the local LLM to improve recall
Cross-encoder reranking - re-scores retrieved chunks with a dedicated relevance model
Hybrid search - FAISS semantic search + BM25 keyword matching, weighted and fused
Semantic chunking - splits at sentence boundaries instead of fixed character counts
Multi-format ingestion - PDF, HTML, and images (via PaddleOCR)
Incremental updates - add documents without reprocessing the whole corpus
Web UI and console interface - FastAPI backend with a TypeScript frontend, or use the CLI
Runs on Windows, Linux, and macOS

Type Safety & Architecture

The entire codebase passes pyright --strict with zero errors and zero # type: ignore comments. Python 3.13+ syntax throughout (X | None, list[T], dict[K, V]).

Third-party C++ libraries (FAISS, PyMuPDF, PaddleOCR) are typed via Protocols and custom stubs:

# ingestion/models.py
class PDFPage(Protocol):
    """Protocol for PyMuPDF Page objects."""
    def get_text(self, option: str) -> str: ...
    def get_pixmap(self, dpi: int) -> PDFPixmap: ...

Additional stubs for faiss, paddleocr, rank_bm25, sqlite_utils, and others live in typings/.

All config and data models use Pydantic v2:

# types_models.py
class MetadataDict(BaseModel):
    model_config = ConfigDict(frozen=True, validate_assignment=True)

    source: str
    chunk_index: int = Field(ge=0)
    total_chunks: int
    file_type: str

The codebase is split into three packages with explicit boundaries:

from ingestion import (
    FAISSManager,      # Vector index management
    FileScanner,       # Document discovery
    MetadataStore,     # SQLite operations
    TextProcessor,     # Extraction pipeline
)

from rag import (
    RAGConfig,         # Configuration
    load_models,       # Model initialization
    hybrid_search,     # Multi-stage retrieval
    generate_answer,   # LLM generation
)

from utils import (
    BM25Index,         # Keyword search
    QueryExpander,     # Query rewriting
    Reranker,          # Cross-encoder scoring
)

Quick Start

Prerequisites

Python 3.13+ with uv package manager (recommended) or pip
Ollama for local LLM inference - download from ollama.com
8GB+ RAM (16GB+ recommended for optimal performance)

Installation

# Clone repository
git clone https://github.com/psarno/PyRagix.git
cd PyRagix

# Install dependencies with uv (recommended - fast and reliable)
uv sync

# Or with pip (installs from pyproject.toml)
pip install -e .

# Pull Ollama model for local LLM
ollama pull qwen2.5:7b
ollama serve

Basic Usage

# Ingest documents (builds FAISS + BM25 indexes)
uv run python ingest_folder.py --fresh ./docs

# Start web interface (compiles TypeScript frontend and starts server)
./dev.sh
# Open http://localhost:8000/web/

# Or use console interface
uv run python query_rag.py

Configuration

PyRagix uses settings.toml for all configuration. The file is auto-generated with optimal defaults for your system on first run. A template is available at settings.example.toml.

All RAG features are off by default. Turn them on as needed:

[query_expansion]
ENABLE_QUERY_EXPANSION = true
QUERY_EXPANSION_COUNT = 3

[reranking]
ENABLE_RERANKING = true
RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
RERANK_TOP_K = 20

[hybrid_search]
ENABLE_HYBRID_SEARCH = true
HYBRID_ALPHA = 0.7          # 70% semantic, 30% keyword

[semantic_chunking]
ENABLE_SEMANTIC_CHUNKING = true
SEMANTIC_CHUNK_MAX_SIZE = 1600
SEMANTIC_CHUNK_OVERLAP = 200

Query expansion generates variant phrasings of your query. Helps most with vague or ambiguous questions. QUERY_EXPANSION_COUNT controls how many variants (default 3).
Reranking re-scores the top candidates with a cross-encoder. Filters out chunks that matched on keywords but aren't actually relevant. RERANK_TOP_K sets the candidate pool (default 20).
Hybrid search fuses FAISS and BM25 results. Mostly useful for structured queries (names, dates, IDs) that pure vector search misses. HYBRID_ALPHA controls the weight split.
Semantic chunking splits at sentence boundaries instead of fixed character counts. Better context preservation.

Enabling everything adds a few hundred ms per query, which is small compared to LLM generation time.

Hardware Tuning

For memory-constrained systems (8-12GB RAM):

[embeddings]
BATCH_SIZE = 8

[threading]
TORCH_NUM_THREADS = 4

[pdf]
BASE_DPI = 100

For high-performance systems (32GB+ RAM):

[embeddings]
BATCH_SIZE = 32

[threading]
TORCH_NUM_THREADS = 12

[pdf]
BASE_DPI = 200

[faiss]
NLIST = 2048
NPROBE = 32

LLM Configuration

Customize Ollama model and generation parameters:

[llm]
OLLAMA_MODEL = "qwen2.5:7b"
TEMPERATURE = 0.1
TOP_P = 0.9
MAX_TOKENS = 500
REQUEST_TIMEOUT = 180

[retrieval]
DEFAULT_TOP_K = 7

Advanced Usage

Incremental Ingestion

Add new documents without reprocessing:

# Initial ingestion
uv run python ingest_folder.py ./docs

# Later: add more documents (automatically skips processed files)
uv run python ingest_folder.py ./more_docs

Custom Document Filters

Skip specific file types or patterns:

[pdf]
SKIP_FILES = ["*.tmp", "backup_*", "archive/*"]

FAISS Index Optimization

PyRagix uses IVF (Inverted File) indexing by default for fast search on large corpora:

[faiss]
INDEX_TYPE = "ivf"
NLIST = 1024
NPROBE = 16

NLIST: Number of clusters (default: 1024). Increase for larger datasets (10k+ chunks).
NPROBE: Search clusters (default: 16). Higher values improve recall at the cost of speed.

The system automatically falls back to flat indexing for small collections (< 2048 chunks), then upgrades to IVF as your corpus grows.

GPU Acceleration

GPU is auto-detected with CPU fallback:

[gpu]
GPU_ENABLED = true
GPU_DEVICE = 0
GPU_MEMORY_FRACTION = 0.8

GPU FAISS requires separate installation. CPU-only FAISS works fine and is the default.

Project Structure

PyRagix/
├── ingest_folder.py        # Document ingestion CLI (thin wrapper)
├── query_rag.py           # Console query CLI (thin wrapper)
├── web_server.py          # FastAPI web server
├── dev.sh                 # Development script (compiles TypeScript + starts server)
├── config.py              # Configuration management
├── settings.toml          # User configuration (auto-generated, TOML format)
├── settings.example.toml  # Configuration template
├── types_models.py        # Shared Pydantic models (MetadataDict, etc.)
│
├── ingestion/             # Document Processing Pipeline (11 modules)
│   ├── __init__.py        # Package exports
│   ├── cli.py             # CLI argument parsing
│   ├── environment.py     # Environment setup (torch, GPU detection)
│   ├── faiss_manager.py   # FAISS index management (IVF, flat)
│   ├── file_filters.py    # File type detection and filtering
│   ├── file_scanner.py    # Recursive document discovery
│   ├── metadata_store.py  # SQLite metadata database
│   ├── models.py          # Protocol definitions (PDFPage, OCRProcessorProtocol, etc.)
│   ├── pipeline.py        # Main ingestion orchestration
│   ├── stale_cleaner.py   # Remove outdated chunks
│   └── text_processing.py # Text extraction (PDF, HTML, OCR)
│
├── rag/                   # Query Pipeline (5 modules)
│   ├── __init__.py        # Package exports
│   ├── configuration.py   # RAGConfig Pydantic model
│   ├── embeddings.py      # Embedding model initialization
│   ├── llm.py             # Ollama LLM client
│   ├── loader.py          # FAISS/BM25 index loading
│   └── retrieval.py       # Multi-stage retrieval (hybrid, rerank)
│
├── utils/                 # RAG Utilities (3 modules)
│   ├── __init__.py        # Package exports
│   ├── bm25_index.py      # BM25 keyword search
│   ├── query_expander.py  # Multi-query expansion via LLM
│   └── reranker.py        # Cross-encoder reranking
│
├── classes/               # Core Processing Classes
│   ├── ProcessingConfig.py # Ingestion configuration dataclass
│   └── OCRProcessor.py     # PaddleOCR wrapper
│
├── typings/               # Type Stubs for Third-Party Libraries
│   ├── faiss/             # FAISS C++ bindings
│   ├── fitz/              # PyMuPDF (fitz)
│   ├── paddleocr/         # PaddleOCR
│   ├── rank_bm25/         # BM25 library
│   ├── sklearn/           # scikit-learn
│   ├── sqlite_utils/      # SQLite utilities
│   ├── lxml/              # XML/HTML parser
│   └── umap/              # UMAP dimensionality reduction
│
├── tests/                 # Pytest Test Suite
│   ├── conftest.py        # Shared fixtures (temp dirs, mocks)
│   ├── test_config.py     # Configuration validation tests
│   ├── test_environment.py # Environment setup tests
│   ├── test_faiss_manager.py # FAISS indexing tests
│   ├── test_file_filters.py # File type detection tests
│   ├── test_file_scanner.py # Document discovery tests
│   └── test_text_processing.py # Text extraction tests
│
├── web/                   # Web Interface (TypeScript)
│   ├── index.html         # Main UI page
│   ├── style.css          # Responsive styling
│   ├── script.ts          # TypeScript source (type-safe API client)
│   └── tsconfig.json      # TypeScript configuration (compile with dev.sh)
│
├── pyrightconfig.json     # Pyright strict type checking config
└── uv.lock               # Dependency lock file

Dependencies

Managed via pyproject.toml. Requires Python 3.13+.

Core ML/AI:

torch (2.9+): Embedding model backend with CUDA support
sentence-transformers: Dense embeddings and cross-encoder reranking
transformers: HuggingFace model infrastructure
faiss-cpu (1.12+): High-performance vector search with IVF indexing
rank-bm25: BM25 keyword search for hybrid retrieval

Document Processing:

paddleocr: OCR for images and scanned documents
paddlepaddle (3.2+): PaddleOCR backend
pymupdf: PDF text extraction
beautifulsoup4: HTML parsing
langchain-text-splitters: Semantic chunking with sentence boundaries
pillow: Image processing

Data & Infrastructure:

fastapi: Web API and UI server
uvicorn: ASGI server with WebSockets
sqlite-utils: Metadata database management
pydantic: Data validation and settings management
numpy: Numerical operations

Utilities:

scikit-learn: ML utilities (used by reranker)
umap-learn: Dimensionality reduction (visualization)
psutil: System resource monitoring
requests: HTTP client

Development Tools:

pyright: Strict static type checking
ruff: Fast Python linter and formatter
pytest: Testing framework

Installation:

# Recommended: Use uv for fast, reliable dependency management
uv sync

# Alternative: Traditional pip installation
pip install -e .

# Development dependencies
uv sync --dev

All dependencies are pinned to minimum versions.

CI/CD

GitHub Actions runs pyright --strict, ruff, and pytest on every push and PR.

Contributing

Contributions are welcome.

Development Setup:

git clone https://github.com/psarno/PyRagix.git
cd PyRagix
uv sync

Rules:

All code must pass pyright --strict with zero errors
No # type: ignore - use stubs or cast() instead
No Any types except for legitimate sentinel values and validators
Modern syntax: X | None, list[T], dict[K, V] (not Optional, List, Dict)
Pydantic v2 for data models, Protocols for third-party library interfaces
Tests for new features using fixtures from tests/conftest.py

Workflow:

# Type check (must pass before committing)
uv run pyright

# Run tests
uv run pytest

# Lint and format
uv run ruff check .
uv run ruff format .

If you're adding a new third-party library feature, update the type stubs. Look at ingestion/ and rag/ for the existing patterns.

License

MIT License - see LICENSE for details.

Acknowledgements

Built on FAISS, Sentence Transformers, Ollama, PaddleOCR, and LangChain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyRagix

Architecture

Features

Type Safety & Architecture

Quick Start

Prerequisites

Installation

Basic Usage

Configuration

Hardware Tuning

LLM Configuration

Advanced Usage

Incremental Ingestion

Custom Document Filters

FAISS Index Optimization

GPU Acceleration

Project Structure

Dependencies

CI/CD

Contributing

License

Acknowledgements

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.github/workflows		.github/workflows
classes		classes
ingestion		ingestion
rag		rag
tests		tests
typings		typings
utils		utils
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
__version__.py		__version__.py
config.py		config.py
dev.sh		dev.sh
ingest_folder.py		ingest_folder.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
query_rag.py		query_rag.py
settings.example.toml		settings.example.toml
todo-python.md		todo-python.md
types_models.py		types_models.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PyRagix

Architecture

Features

Type Safety & Architecture

Quick Start

Prerequisites

Installation

Basic Usage

Configuration

Hardware Tuning

LLM Configuration

Advanced Usage

Incremental Ingestion

Custom Document Filters

FAISS Index Optimization

GPU Acceleration

Project Structure

Dependencies

CI/CD

Contributing

License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 1

Languages

Packages