A minimal Retrieval-Augmented Generation (RAG) pipeline in Python. Loads a PDF, splits it into chunks, embeds each chunk locally, and answers questions about the document by retrieving the most relevant chunks and sending them to an OpenAI model.
Backed by Postgres + pgvector. PyMuPDF4LLM extracts the PDF to Markdown, LangChain splits it and composes the query side as an LCEL chain — but embeddings and vector search stay hand-rolled (see Architecture). Answers cite the page each fact came from. (Charts/figures aren't extracted — see Known limitations.)
PDF → chunks → embeddings → Postgres/pgvector → retrieval → OpenAI → answer
└─────── indexing/ ─────┘ └──── querying/ ─────┘
(run once) (interactive REPL)
The two halves communicate only through a Postgres table (chunks), backed by the pgvector extension. Indexing writes the chunks and their embeddings once per source document; querying retrieves the top-k via a SQL cosine-distance search and is an interactive REPL.
The query side runs as a LangChain LCEL chain — retriever | prompt | llm | parser. The retriever (querying/retriever.py) is a thin BaseRetriever wrapping the same raw pgvector SQL and sentence-transformers embedding; it does not use LangChain's PGVector vectorstore, so the chunks table schema stays under our control. Only gpt-5.4-nano is called through LangChain (ChatOpenAI).
Known limitation — charts and figures. PyMuPDF4LLM extracts text, tables, and headings well (the glossary becomes clean Markdown tables), but it discards data labels embedded in charts/figures as picture content. This report is chart-heavy, so some figures' numbers (e.g. the methane-by-sector pie chart) are not in the index. Recovering them would require vision-based extraction. Affected eval questions are flagged with "chart_dependent": true in eval/dataset.jsonl.
Prerequisites: Python 3.9+, Docker, and an OpenAI API key.
# 1. Virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 2. Config
echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .env
# 3. Start Postgres + pgvector
docker compose up -d
# 4. Build the index
# (first run downloads the ~130MB embedding model to ~/.cache/huggingface/)
python indexing/build_index.py
# 5. Ask questions — either the REPL...
python querying/ask.py
# ...or the HTTP API (interactive docs at http://localhost:8000/docs)
uvicorn api.main:app --reloadQuery the API:
curl -s localhost:8000/ask -H 'content-type: application/json' \
-d '{"question": "What is Australia'\''s 2035 emissions target?"}'To use a different PDF, drop it in data/ and update PDF_PATH in indexing/build_index.py.
The FastAPI service (api/main.py) exposes the same retrieval + answer chain over HTTP. Run it from the project root with uvicorn api.main:app --reload; it listens on http://localhost:8000 (interactive docs at /docs) and needs the same prerequisites as the REPL — index built, Postgres up, .env set.
| Endpoint | Description |
|---|---|
GET /health |
Liveness check plus a real DB round-trip; returns { "status": "ok" }. |
POST /ask |
Body { "question": string, "k"?: int } (k defaults to 6, range 1–20). Returns the question and an answer with inline [page N] citations. |
Validation errors (empty question, k out of range) return 422 (no LLM call); upstream DB/OpenAI failures return 502 with the error in detail. Runnable curl examples live in curl-commands.md.
| Layer | Choice |
|---|---|
| Orchestration (query side) | LangChain LCEL — langchain-core + langchain-openai |
| PDF extraction (index side) | pymupdf4llm (per-page Markdown) |
| Chunking | RecursiveCharacterTextSplitter (langchain-text-splitters) |
| Answer generation | OpenAI (gpt-5.4-nano) via ChatOpenAI |
| Embeddings | sentence-transformers + BAAI/bge-small-en-v1.5 (local, no API key) |
| Vector store | Postgres + pgvector (Dockerized), via psycopg |
| HTTP API | fastapi + uvicorn, pooled via psycopg-pool |
| Env loading | python-dotenv |
indexing/
├── load_pdf.py # PDF → per-page Markdown Documents (PyMuPDF4LLM)
├── chunker.py # Documents → overlapping chunks (RecursiveCharacterTextSplitter)
├── embed.py # chunks → numpy array of embeddings
└── build_index.py # orchestrator → Postgres (chunks table, incl. page)
querying/
├── load_index.py # Postgres → embedding-model name
├── retriever.py # query → top-k chunks as LangChain Documents (pgvector SQL)
├── chain.py # retriever + LLM → LCEL chain (prompt | model | parser)
└── ask.py # orchestrator → interactive REPL (builds + invokes the chain)
api/
└── main.py # FastAPI service (POST /ask, GET /health) over the same chain
eval/
├── dataset.jsonl # ground-truth questions → expected pages
└── run_eval.py # retrieval eval (hit-rate@k / MRR)
db/
└── init.sql # CREATE EXTENSION vector (runs on first container start)
docker-compose.yml # Postgres + pgvector service
data/
└── net-zero-report.pdf # source document
The values worth experimenting with:
| Constant | Location | Default |
|---|---|---|
CHUNK_SIZE |
indexing/build_index.py |
1200 characters |
CHUNK_OVERLAP |
indexing/build_index.py |
150 characters |
EMBEDDING_MODEL |
indexing/build_index.py |
BAAI/bge-small-en-v1.5 |
TOP_K |
querying/ask.py |
6 retrieved chunks per question |
OPENAI_MODEL |
querying/ask.py |
gpt-5.4-nano |
Changing EMBEDDING_MODEL requires re-running python indexing/build_index.py — the embeddings on disk were computed with the old model and aren't comparable to vectors from a new one.