Retrieval-Augmented Generation (RAG) pipeline for ingesting and querying PDF documents using ChromaDB and local language models.
Processes PDF documents, generates local embeddings and stores them in ChromaDB. Supports natural language queries with result reranking via CrossEncoder.
Main flow:
run_ingestor.py→src/pdf_ingestor.py: ingests PDFs, generates embeddings and stores them in ChromaDBprompting-chromadb-reranker.py: queries the vector store with CrossEncoder rerankingprompting-improved-direct-chromadb-context-collection-works-works.py: simplified query version
- Python 3.10+
- ChromaDB installed locally
# 1. Copy and edit the config file
cp config.example.py config.py
# 2. Edit config.py with your local paths and LM Studio URL
# 3. Install dependencies
pip install -r requirements.txt# Ingest PDFs
python run_ingestor.py
# Query with reranking
python prompting-chromadb-reranker.py
# Simplified query
python "prompting-improved-direct-chromadb-context-collection-works-works.py"RAG/
├── src/
│ └── pdf_ingestor.py # Core ingestion engine
├── run_ingestor.py # Ingestion entry point
├── prompting-chromadb-reranker.py # Query with reranking
├── prompting-improved-direct-chromadb-context-collection-...py # Simplified query
├── ingestion-multi-pdf-PyMuPDF-direct Chroma-testqueries-works.py # Alternative ingestion
├── config.example.py # Config template (copy to config.py)
└── requirements.txt
- PDF ingestion with batching and logging
- Local embeddings with sentence-transformers
- Query with CrossEncoder reranking
- User interface
- Support for other document formats
- Embeddings generated with
sentence-transformers/all-MiniLM-L6-v2 - Vector store persisted in local ChromaDB (path configurable in
config.py) - Local paths and LM Studio URL are configured in
config.py(seeconfig.example.py) - Iterative development scripts are preserved in mabaeyens/RAG-experiments (private)
This project is the result of a strategic collaboration between human design and AI-assisted code generation.
- Architecture & Logic: Fully defined by the author. This includes system structure, business rules, data flow, and implementation strategy.
- Code Generation: The syntactic implementation and line-by-line code writing was performed by Claude Code, following precise and iterative instructions provided by the author.
- Supervision & Refinement: All code was manually reviewed, tested, and adjusted to ensure quality, consistency, and compliance with project standards.
This approach demonstrates the ability to direct advanced AI tools to accelerate development without sacrificing creative control or technical quality.
This project is licensed under the MIT License. You can find the full text in the LICENSE file.
Note on authorship: Although much of the source code was generated by an AI, the creative direction, architecture, and final integration are human work. Usage rights are granted under the terms of the MIT License.
Feel free to fork this project!
- If you find a bug, open an issue.
- If you have an improvement, submit a Pull Request.
- Feel free to use this code in your own projects!