An end-to-end Retrieval-Augmented Generation (RAG) pipeline that ingests arXiv research papers, stores vector embeddings in PostgreSQL, and exposes a Streamlit GUI for natural-language querying. All details about this work can be seen via: https://tronghien.com/blog/building-a-rag-pipeline
[arXiv API]
↓
[Ingestion] → fetch metadata + PDFs → parse text → chunk (512 tokens)
↓
[Embedding] → sentence-transformers (all-MiniLM-L6-v2) → 384-dim vectors
↓
[PostgreSQL + pgvector] → cosine similarity search
↓
[Generation] → Anthropic Claude API (claude-sonnet-4-6)
↓
[Streamlit GUI] → chat interface with source citations
Chat interface showing a query about ingested arXiv papers, with source chunk citations in the sidebar.
| Tool | Version | Notes |
|---|---|---|
| Python | 3.11+ | |
| Docker Desktop | latest | Must be running before any Docker commands |
| PostgreSQL client | any | For psql CLI (comes with PostgreSQL install) |
| Anthropic API key | — | Required for answer generation |
Windows note: Local PostgreSQL 18 occupies port 5432. The Docker container runs on port 5433 to avoid conflicts.
git clone <repo-url>
cd rag_applicationpython -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activatepip install -r requirements.txtCopy the example file and fill in your values:
cp .env.example .envOpen .env and set your Anthropic API key:
ANTHROPIC_API_KEY=your_key_here
DATABASE_URL=postgresql://rag_user:rag_user@localhost:5433/rag_db
SUPERUSER_DATABASE_URL=postgresql://postgres:postgres@localhost:5433/rag_db
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHUNK_SIZE=512
CHUNK_OVERLAP=50
RETRIEVAL_TOP_K=5Make sure Docker Desktop is running, then:
docker-compose up -dThis starts a PostgreSQL 16 + pgvector container on port 5433.
Run once after the container starts:
PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 \
-c "CREATE DATABASE rag_db;" \
-c "CREATE USER rag_user WITH PASSWORD 'rag_user';" \
-c "GRANT ALL PRIVILEGES ON DATABASE rag_db TO rag_user;"
PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 -d rag_db \
-c "GRANT ALL ON SCHEMA public TO rag_user;"python scripts/setup_db.pyThis creates the documents and document_chunks tables, enables the pgvector extension, and creates an HNSW index for fast similarity search.
Download the model defined in EMBEDDING_MODEL (.env) before running ingestion or the GUI. This avoids mid-run interruptions and caches the model for all future runs.
python scripts/download_model.pyThe model (all-MiniLM-L6-v2, ~90 MB) is saved to ~/.cache/huggingface/ and reused automatically. The script retries up to 5 times on network errors.
First run note: The embedding model (
all-MiniLM-L6-v2, ~90 MB) is downloaded from HuggingFace automatically on first run and cached at~/.cache/huggingface/. This may take a few minutes depending on your connection. Subsequent runs load from cache instantly.
# Small test run (recommended first)
python scripts/ingest.py --category cs.AI --max-results 5
# Full ingestion (500 papers — takes several minutes)
python scripts/ingest.py --category cs.AI --max-results 500Available categories: cs.AI, cs.CL, cs.LG, cs.CV, cs.NE, etc.
streamlit run src/gui/app.pyOpen your browser at http://localhost:8501 and start asking questions about the ingested papers.
rag_pipeline/
├── docker-compose.yml # PostgreSQL + pgvector container (port 5433)
├── requirements.txt
├── .env.example
├── src/
│ ├── database/
│ │ ├── models.py # SQLAlchemy ORM models (Document, DocumentChunk)
│ │ └── session.py # Engine + get_db() context manager
│ ├── ingestion/
│ │ ├── fetcher.py # arXiv API client
│ │ ├── parser.py # PDF download + text extraction
│ │ └── chunker.py # 512-token chunking with tiktoken
│ ├── embedding/
│ │ └── embedder.py # sentence-transformers wrapper
│ ├── retrieval/
│ │ └── retriever.py # pgvector cosine similarity search
│ ├── generation/
│ │ ├── prompts.py # All prompt templates
│ │ └── generator.py # Anthropic Claude API integration
│ └── gui/
│ └── app.py # Streamlit chat interface
├── scripts/
│ ├── download_model.py # Pre-download embedding model from .env
│ ├── setup_db.py # One-time DB initialisation
│ └── ingest.py # Ingestion pipeline CLI
└── tests/
├── test_chunker.py
├── test_embedder.py
└── test_retriever.py
pytest tests/Your Python app is connecting to a PostgreSQL instance without pgvector. Check that:
- Docker Desktop is running (
docker ps) - The container is on port 5433 (
docker-compose ps) DATABASE_URLin.envuses port5433
Docker Desktop is not running. Open it from the Start menu and wait for the whale icon to show "Docker Desktop is running", then retry.
Local PostgreSQL occupies port 5432. This project's Docker container is intentionally mapped to 5433. Do not change this unless you stop the local PostgreSQL service first.
CREATE EXTENSION requires superuser privileges. Set SUPERUSER_DATABASE_URL in .env pointing to the postgres superuser, or run manually:
PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 -d rag_db \
-c "CREATE EXTENSION IF NOT EXISTS vector;"The HuggingFace model download was interrupted, leaving a corrupted cache. Clear it and re-run:
# Windows
rm -rf "$env:USERPROFILE\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2"
# macOS / Linux
rm -rf ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2Then retry python scripts/ingest.py. The model will re-download cleanly.
This is a transient network issue with the HuggingFace CDN. The download will retry up to 5 times automatically. If it ultimately fails, just re-run the ingest command — the download resumes from where it left off once connectivity improves.
