RAG Context Pipeline

A minimal Retrieval-Augmented Generation (RAG) pipeline in Python. Loads a PDF, splits it into chunks, embeds each chunk locally, and answers questions about the document by retrieving the most relevant chunks and sending them to an OpenAI model.

Backed by Postgres + pgvector. PyMuPDF4LLM extracts the PDF to Markdown, LangChain splits it and composes the query side as an LCEL chain — but embeddings and vector search stay hand-rolled (see Architecture). Answers cite the page each fact came from. (Charts/figures aren't extracted — see Known limitations.)

How it works

PDF  →  chunks  →  embeddings  →  Postgres/pgvector  →  retrieval  →  OpenAI  →  answer
        └─────── indexing/ ─────┘                       └──── querying/ ─────┘
              (run once)                                  (interactive REPL)

The two halves communicate only through a Postgres table (chunks), backed by the pgvector extension. Indexing writes the chunks and their embeddings once per source document; querying retrieves the top-k via a SQL cosine-distance search and is an interactive REPL.

The query side runs as a LangChain LCEL chain — retriever | prompt | llm | parser. The retriever (querying/retriever.py) is a thin BaseRetriever wrapping the same raw pgvector SQL and sentence-transformers embedding; it does not use LangChain's PGVector vectorstore, so the chunks table schema stays under our control. Only gpt-5.4-nano is called through LangChain (ChatOpenAI).

Known limitation — charts and figures. PyMuPDF4LLM extracts text, tables, and headings well (the glossary becomes clean Markdown tables), but it discards data labels embedded in charts/figures as picture content. This report is chart-heavy, so some figures' numbers (e.g. the methane-by-sector pie chart) are not in the index. Recovering them would require vision-based extraction. Affected eval questions are flagged with "chart_dependent": true in eval/dataset.jsonl.

Quick start

Prerequisites: Python 3.9+, Docker, and an OpenAI API key.

# 1. Virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 2. Config
echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .env

# 3. Start Postgres + pgvector
docker compose up -d

# 4. Build the index
#    (first run downloads the ~130MB embedding model to ~/.cache/huggingface/)
python indexing/build_index.py

# 5. Ask questions — either the REPL...
python querying/ask.py

# ...or the HTTP API (interactive docs at http://localhost:8000/docs)
uvicorn api.main:app --reload

Query the API:

curl -s localhost:8000/ask -H 'content-type: application/json' \
  -d '{"question": "What is Australia'\''s 2035 emissions target?"}'

To use a different PDF, drop it in data/ and update PDF_PATH in indexing/build_index.py.

HTTP API

The FastAPI service (api/main.py) exposes the same retrieval + answer chain over HTTP. Run it from the project root with uvicorn api.main:app --reload; it listens on http://localhost:8000 (interactive docs at /docs) and needs the same prerequisites as the REPL — index built, Postgres up, .env set.

Endpoint	Description
`GET /health`	Liveness check plus a real DB round-trip; returns `{ "status": "ok" }`.
`POST /ask`	Body `{ "question": string, "k"?: int }` (`k` defaults to 6, range 1–20). Returns the `question` and an `answer` with inline `[page N]` citations.

Validation errors (empty question, k out of range) return 422 (no LLM call); upstream DB/OpenAI failures return 502 with the error in detail. Runnable curl examples live in curl-commands.md.

Stack

Layer	Choice
Orchestration (query side)	LangChain LCEL — `langchain-core` + `langchain-openai`
PDF extraction (index side)	`pymupdf4llm` (per-page Markdown)
Chunking	`RecursiveCharacterTextSplitter` (`langchain-text-splitters`)
Answer generation	OpenAI (`gpt-5.4-nano`) via `ChatOpenAI`
Embeddings	`sentence-transformers` + `BAAI/bge-small-en-v1.5` (local, no API key)
Vector store	Postgres + `pgvector` (Dockerized), via `psycopg`
HTTP API	`fastapi` + `uvicorn`, pooled via `psycopg-pool`
Env loading	`python-dotenv`

Project structure

indexing/
├── load_pdf.py       # PDF → per-page Markdown Documents (PyMuPDF4LLM)
├── chunker.py        # Documents → overlapping chunks (RecursiveCharacterTextSplitter)
├── embed.py          # chunks → numpy array of embeddings
└── build_index.py    # orchestrator → Postgres (chunks table, incl. page)

querying/
├── load_index.py     # Postgres → embedding-model name
├── retriever.py      # query → top-k chunks as LangChain Documents (pgvector SQL)
├── chain.py          # retriever + LLM → LCEL chain (prompt | model | parser)
└── ask.py            # orchestrator → interactive REPL (builds + invokes the chain)

api/
└── main.py           # FastAPI service (POST /ask, GET /health) over the same chain

eval/
├── dataset.jsonl     # ground-truth questions → expected pages
└── run_eval.py       # retrieval eval (hit-rate@k / MRR)

db/
└── init.sql          # CREATE EXTENSION vector (runs on first container start)

docker-compose.yml    # Postgres + pgvector service

data/
└── net-zero-report.pdf    # source document

Tuning

The values worth experimenting with:

Constant	Location	Default
`CHUNK_SIZE`	`indexing/build_index.py`	1200 characters
`CHUNK_OVERLAP`	`indexing/build_index.py`	150 characters
`EMBEDDING_MODEL`	`indexing/build_index.py`	`BAAI/bge-small-en-v1.5`
`TOP_K`	`querying/ask.py`	6 retrieved chunks per question
`OPENAI_MODEL`	`querying/ask.py`	`gpt-5.4-nano`

Changing EMBEDDING_MODEL requires re-running python indexing/build_index.py — the embeddings on disk were computed with the old model and aren't comparable to vectors from a new one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Context Pipeline

How it works

Quick start

HTTP API

Stack

Project structure

Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
api		api
data		data
db		db
eval		eval
indexing		indexing
querying		querying
.gitignore		.gitignore
README.md		README.md
commands.md		commands.md
contributors.png		contributors.png
curl-commands.md		curl-commands.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Context Pipeline

How it works

Quick start

HTTP API

Stack

Project structure

Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages