RAG Pipeline

An end-to-end Retrieval-Augmented Generation (RAG) pipeline that ingests arXiv research papers, stores vector embeddings in PostgreSQL, and exposes a Streamlit GUI for natural-language querying. All details about this work can be seen via: https://tronghien.com/blog/building-a-rag-pipeline

Architecture

[arXiv API]
     ↓
[Ingestion]  → fetch metadata + PDFs → parse text → chunk (512 tokens)
     ↓
[Embedding]  → sentence-transformers (all-MiniLM-L6-v2) → 384-dim vectors
     ↓
[PostgreSQL + pgvector]  → cosine similarity search
     ↓
[Generation] → Anthropic Claude API (claude-sonnet-4-6)
     ↓
[Streamlit GUI]  → chat interface with source citations

GUI Preview

Chat interface showing a query about ingested arXiv papers, with source chunk citations in the sidebar.

Prerequisites

Tool	Version	Notes
Python	3.11+
Docker Desktop	latest	Must be running before any Docker commands
PostgreSQL client	any	For `psql` CLI (comes with PostgreSQL install)
Anthropic API key	—	Required for answer generation

Windows note: Local PostgreSQL 18 occupies port 5432. The Docker container runs on port 5433 to avoid conflicts.

Installation

1. Clone the repository

git clone <repo-url>
cd rag_application

2. Create and activate a virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

3. Install Python dependencies

pip install -r requirements.txt

4. Configure environment variables

Copy the example file and fill in your values:

cp .env.example .env

Open .env and set your Anthropic API key:

ANTHROPIC_API_KEY=your_key_here
DATABASE_URL=postgresql://rag_user:rag_user@localhost:5433/rag_db
SUPERUSER_DATABASE_URL=postgresql://postgres:postgres@localhost:5433/rag_db
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHUNK_SIZE=512
CHUNK_OVERLAP=50
RETRIEVAL_TOP_K=5

5. Start the database

Make sure Docker Desktop is running, then:

docker-compose up -d

This starts a PostgreSQL 16 + pgvector container on port 5433.

6. Create the database user

Run once after the container starts:

PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 \
  -c "CREATE DATABASE rag_db;" \
  -c "CREATE USER rag_user WITH PASSWORD 'rag_user';" \
  -c "GRANT ALL PRIVILEGES ON DATABASE rag_db TO rag_user;"

PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 -d rag_db \
  -c "GRANT ALL ON SCHEMA public TO rag_user;"

7. Initialize the database schema

python scripts/setup_db.py

This creates the documents and document_chunks tables, enables the pgvector extension, and creates an HNSW index for fast similarity search.

8. Download the embedding model

Download the model defined in EMBEDDING_MODEL (.env) before running ingestion or the GUI. This avoids mid-run interruptions and caches the model for all future runs.

python scripts/download_model.py

The model (all-MiniLM-L6-v2, ~90 MB) is saved to ~/.cache/huggingface/ and reused automatically. The script retries up to 5 times on network errors.

Usage

Ingest arXiv papers

First run note: The embedding model (all-MiniLM-L6-v2, ~90 MB) is downloaded from HuggingFace automatically on first run and cached at ~/.cache/huggingface/. This may take a few minutes depending on your connection. Subsequent runs load from cache instantly.

# Small test run (recommended first)
python scripts/ingest.py --category cs.AI --max-results 5

# Full ingestion (500 papers — takes several minutes)
python scripts/ingest.py --category cs.AI --max-results 500

Available categories: cs.AI, cs.CL, cs.LG, cs.CV, cs.NE, etc.

Launch the GUI

streamlit run src/gui/app.py

Open your browser at http://localhost:8501 and start asking questions about the ingested papers.

Project Structure

rag_pipeline/
├── docker-compose.yml       # PostgreSQL + pgvector container (port 5433)
├── requirements.txt
├── .env.example
├── src/
│   ├── database/
│   │   ├── models.py        # SQLAlchemy ORM models (Document, DocumentChunk)
│   │   └── session.py       # Engine + get_db() context manager
│   ├── ingestion/
│   │   ├── fetcher.py       # arXiv API client
│   │   ├── parser.py        # PDF download + text extraction
│   │   └── chunker.py       # 512-token chunking with tiktoken
│   ├── embedding/
│   │   └── embedder.py      # sentence-transformers wrapper
│   ├── retrieval/
│   │   └── retriever.py     # pgvector cosine similarity search
│   ├── generation/
│   │   ├── prompts.py       # All prompt templates
│   │   └── generator.py     # Anthropic Claude API integration
│   └── gui/
│       └── app.py           # Streamlit chat interface
├── scripts/
│   ├── download_model.py    # Pre-download embedding model from .env
│   ├── setup_db.py          # One-time DB initialisation
│   └── ingest.py            # Ingestion pipeline CLI
└── tests/
    ├── test_chunker.py
    ├── test_embedder.py
    └── test_retriever.py

Running Tests

pytest tests/

Troubleshooting

`extension "vector" is not available`

Your Python app is connecting to a PostgreSQL instance without pgvector. Check that:

Docker Desktop is running (docker ps)
The container is on port 5433 (docker-compose ps)
DATABASE_URL in .env uses port 5433

`unable to get image ... dockerDesktopLinuxEngine`

Docker Desktop is not running. Open it from the Start menu and wait for the whale icon to show "Docker Desktop is running", then retry.

Port 5432 conflict

Local PostgreSQL occupies port 5432. This project's Docker container is intentionally mapped to 5433. Do not change this unless you stop the local PostgreSQL service first.

`permission denied to create extension "vector"`

CREATE EXTENSION requires superuser privileges. Set SUPERUSER_DATABASE_URL in .env pointing to the postgres superuser, or run manually:

PGPASSWORD=postgres psql -U postgres -h localhost -p 5433 -d rag_db \
  -c "CREATE EXTENSION IF NOT EXISTS vector;"

`FileNotFoundError: 1_Pooling/config.json` during ingestion

The HuggingFace model download was interrupted, leaving a corrupted cache. Clear it and re-run:

# Windows
rm -rf "$env:USERPROFILE\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2"

# macOS / Linux
rm -rf ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2

Then retry python scripts/ingest.py. The model will re-download cleanly.

HuggingFace download keeps retrying / `ConnectionResetError`

This is a transient network issue with the HuggingFace CDN. The download will retry up to 5 times automatically. If it ultimately fails, just re-run the ingest command — the download resumes from where it left off once connectivity improves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Pipeline

Architecture

GUI Preview

Prerequisites

Installation

1. Clone the repository

2. Create and activate a virtual environment

3. Install Python dependencies

4. Configure environment variables

5. Start the database

6. Create the database user

7. Initialize the database schema

8. Download the embedding model

Usage

Ingest arXiv papers

Launch the GUI

Project Structure

Running Tests

Troubleshooting

`extension "vector" is not available`

`unable to get image ... dockerDesktopLinuxEngine`

Port 5432 conflict

`permission denied to create extension "vector"`

`FileNotFoundError: 1_Pooling/config.json` during ingestion

HuggingFace download keeps retrying / `ConnectionResetError`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline

Architecture

GUI Preview

Prerequisites

Installation

1. Clone the repository

2. Create and activate a virtual environment

3. Install Python dependencies

4. Configure environment variables

5. Start the database

6. Create the database user

7. Initialize the database schema

8. Download the embedding model

Usage

Ingest arXiv papers

Launch the GUI

Project Structure

Running Tests

Troubleshooting

extension "vector" is not available

unable to get image ... dockerDesktopLinuxEngine

Port 5432 conflict

permission denied to create extension "vector"

FileNotFoundError: 1_Pooling/config.json during ingestion

HuggingFace download keeps retrying / ConnectionResetError

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`extension "vector" is not available`

`unable to get image ... dockerDesktopLinuxEngine`

`permission denied to create extension "vector"`

`FileNotFoundError: 1_Pooling/config.json` during ingestion

HuggingFace download keeps retrying / `ConnectionResetError`

Packages