Skip to content

obj809/rag-context-pipeline

Repository files navigation

RAG Context Pipeline

A minimal Retrieval-Augmented Generation (RAG) pipeline in Python. Loads a PDF, splits it into chunks, embeds each chunk locally, and answers questions about the document by retrieving the most relevant chunks and sending them to an OpenAI model.

Backed by Postgres + pgvector. PyMuPDF4LLM extracts the PDF to Markdown, LangChain splits it and composes the query side as an LCEL chain — but embeddings and vector search stay hand-rolled (see Architecture). Answers cite the page each fact came from. (Charts/figures aren't extracted — see Known limitations.)

How it works

PDF  →  chunks  →  embeddings  →  Postgres/pgvector  →  retrieval  →  OpenAI  →  answer
        └─────── indexing/ ─────┘                       └──── querying/ ─────┘
              (run once)                                  (interactive REPL)

The two halves communicate only through a Postgres table (chunks), backed by the pgvector extension. Indexing writes the chunks and their embeddings once per source document; querying retrieves the top-k via a SQL cosine-distance search and is an interactive REPL.

The query side runs as a LangChain LCEL chain — retriever | prompt | llm | parser. The retriever (querying/retriever.py) is a thin BaseRetriever wrapping the same raw pgvector SQL and sentence-transformers embedding; it does not use LangChain's PGVector vectorstore, so the chunks table schema stays under our control. Only gpt-5.4-nano is called through LangChain (ChatOpenAI).

Known limitation — charts and figures. PyMuPDF4LLM extracts text, tables, and headings well (the glossary becomes clean Markdown tables), but it discards data labels embedded in charts/figures as picture content. This report is chart-heavy, so some figures' numbers (e.g. the methane-by-sector pie chart) are not in the index. Recovering them would require vision-based extraction. Affected eval questions are flagged with "chart_dependent": true in eval/dataset.jsonl.

Quick start

Prerequisites: Python 3.9+, Docker, and an OpenAI API key.

# 1. Virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 2. Config
echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .env

# 3. Start Postgres + pgvector
docker compose up -d

# 4. Build the index
#    (first run downloads the ~130MB embedding model to ~/.cache/huggingface/)
python indexing/build_index.py

# 5. Ask questions — either the REPL...
python querying/ask.py

# ...or the HTTP API (interactive docs at http://localhost:8000/docs)
uvicorn api.main:app --reload

Query the API:

curl -s localhost:8000/ask -H 'content-type: application/json' \
  -d '{"question": "What is Australia'\''s 2035 emissions target?"}'

To use a different PDF, drop it in data/ and update PDF_PATH in indexing/build_index.py.

HTTP API

The FastAPI service (api/main.py) exposes the same retrieval + answer chain over HTTP. Run it from the project root with uvicorn api.main:app --reload; it listens on http://localhost:8000 (interactive docs at /docs) and needs the same prerequisites as the REPL — index built, Postgres up, .env set.

Endpoint Description
GET /health Liveness check plus a real DB round-trip; returns { "status": "ok" }.
POST /ask Body { "question": string, "k"?: int } (k defaults to 6, range 1–20). Returns the question and an answer with inline [page N] citations.

Validation errors (empty question, k out of range) return 422 (no LLM call); upstream DB/OpenAI failures return 502 with the error in detail. Runnable curl examples live in curl-commands.md.

Stack

Layer Choice
Orchestration (query side) LangChain LCEL — langchain-core + langchain-openai
PDF extraction (index side) pymupdf4llm (per-page Markdown)
Chunking RecursiveCharacterTextSplitter (langchain-text-splitters)
Answer generation OpenAI (gpt-5.4-nano) via ChatOpenAI
Embeddings sentence-transformers + BAAI/bge-small-en-v1.5 (local, no API key)
Vector store Postgres + pgvector (Dockerized), via psycopg
HTTP API fastapi + uvicorn, pooled via psycopg-pool
Env loading python-dotenv

Project structure

indexing/
├── load_pdf.py       # PDF → per-page Markdown Documents (PyMuPDF4LLM)
├── chunker.py        # Documents → overlapping chunks (RecursiveCharacterTextSplitter)
├── embed.py          # chunks → numpy array of embeddings
└── build_index.py    # orchestrator → Postgres (chunks table, incl. page)

querying/
├── load_index.py     # Postgres → embedding-model name
├── retriever.py      # query → top-k chunks as LangChain Documents (pgvector SQL)
├── chain.py          # retriever + LLM → LCEL chain (prompt | model | parser)
└── ask.py            # orchestrator → interactive REPL (builds + invokes the chain)

api/
└── main.py           # FastAPI service (POST /ask, GET /health) over the same chain

eval/
├── dataset.jsonl     # ground-truth questions → expected pages
└── run_eval.py       # retrieval eval (hit-rate@k / MRR)

db/
└── init.sql          # CREATE EXTENSION vector (runs on first container start)

docker-compose.yml    # Postgres + pgvector service

data/
└── net-zero-report.pdf    # source document

Tuning

The values worth experimenting with:

Constant Location Default
CHUNK_SIZE indexing/build_index.py 1200 characters
CHUNK_OVERLAP indexing/build_index.py 150 characters
EMBEDDING_MODEL indexing/build_index.py BAAI/bge-small-en-v1.5
TOP_K querying/ask.py 6 retrieved chunks per question
OPENAI_MODEL querying/ask.py gpt-5.4-nano

Changing EMBEDDING_MODEL requires re-running python indexing/build_index.py — the embeddings on disk were computed with the old model and aren't comparable to vectors from a new one.

About

Minimal Python RAG pipeline over a net-zero policy document using local embeddings. FastAPI | LangChain | Hugging Face

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages