A small, clear reference project that demonstrates a full RAG (Retrieval-Augmented Generation) workflow: ingest PDFs, text files, and raw strings; chunk and embed; store in ChromaDB; and answer questions via an LLM using retrieved context, orchestrated with LangGraph.
RAG combines retrieval (finding relevant pieces of your data) with generation (an LLM producing an answer). Instead of relying only on the model’s training data, you:
- Ingest your content (PDFs, text files, raw text).
- Chunk it into smaller segments and embed each chunk.
- Store embeddings in a vector store (here, ChromaDB).
- On each query, embed the question, retrieve the most relevant chunks, and pass them as context to the LLM.
- The LLM answers using only (or mainly) that context, with citations.
So: chunking + embeddings let you search by meaning (similarity) and keep context size manageable; ChromaDB holds those embeddings and does fast similarity search; LangGraph makes the pipeline explicit (state → nodes → edges) so you can see and change each step.
LangGraph models the pipeline as a graph: state (query, retrieved docs, context, answer) flows through nodes (retrieve, build context, generate) and edges (including a conditional branch when no documents are found). That makes the flow easy to follow and extend (e.g. add a guardrail, extra filters, or hybrid retrieval).
README.md
src/
config.py # Env vars, constants (chunk size, k, model names)
loaders/ # PDF, text file, raw string → LangChain Documents
chunking/ # Recursive character splitter, chunk_id metadata
vectorstore/ # Chroma init, upsert, persist, load
rag/ # Retriever, prompt builder, LLM, citations; schema_retriever, sql_prompting
graph/ # LangGraph state, nodes, compiled graph
db/ # Schema extraction (pg_catalog), read-only query execution
cli.py # ingest, query, ingest-schema, ask-db, benchmark commands
data/
sql/ # Optional: football schema (competitions, seasons, teams, etc.)
pdfs/ # Sample PDFs (add your own)
texts/ # Sample .txt (e.g. sample.txt)
.env.example
pyproject.toml
tests/
-
Clone and enter the repo.
-
Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -e ".[dev]"
-
Configure environment:
cp .env.example .env # Edit .env and set OPENAI_API_KEY=sk-your-keyOptional: set
CHROMA_PERSIST_DIR(default:./chroma_db). For the ask-db feature (natural-language to SQL), also setDATABASE_URL(e.g.postgresql://user:pass@localhost:5432/football).
Index PDFs, text files, and optionally a raw string:
python -m src.cli ingest --pdf-dir data/pdfs --text-dir data/textsWith a raw string:
python -m src.cli ingest --text-dir data/texts --raw "Your extra content here."- PDFs: one Document per page; metadata
source_type=pdf,file_name,page. - Text files: one Document per file; metadata
source_type=text,file_name. - Raw: one Document; metadata
source_type=raw,name/id.
Content is chunked (recursive character splitter), embedded with OpenAI, and upserted into Chroma with dedupe by content+metadata hash.
Ask a question; the app retrieves relevant chunks and calls the LLM with that context:
python -m src.cli query "What is RAG?"Example output:
Answer: RAG stands for Retrieval-Augmented Generation. It combines a retriever
that finds relevant documents with a language model that generates answers.
The model is given the retrieved context so it can answer using your data...
Citations:
- sample.txt
If nothing is retrieved, you get a short “I don’t have any relevant documents” style message (guardrail node).
You can query a PostgreSQL database in plain English: the app retrieves relevant schema (table and column descriptions) from ChromaDB, sends it to the LLM to generate a SELECT query, runs the query read-only, and prints the result. The pipeline uses a separate Chroma collection for schema (so it does not mix with document RAG).
How it works
- Schema ingest (
ingest-schema): Connect to Postgres, read table and column metadata (includingCOMMENT ON TABLE/COMMENT ON COLUMNfrompg_catalog), build one document per table, embed them, and store in the schema collection (e.g.schema_football). - Ask (
ask-db): Your question is embedded; the top relevant schema documents are retrieved (usingSCHEMA_RETRIEVAL_Kand optionalSCHEMA_RETRIEVAL_MAX_DISTANCE). That schema context plus your question go to the LLM, which returns a singleSELECT(structured output). The app runs only that SELECT and prints the result.
Prerequisites
- A PostgreSQL database with tables (and optionally comments). The repo includes a sample schema in
sql/football.sql(competitions, seasons, teams, players, games, appearances). - In
.env:DATABASE_URL=postgresql://user:password@localhost:5432/yourdbandOPENAI_API_KEY(for embeddings and for the LLM when using OpenAI).
1. Ingest the schema (once, or after DDL changes)
python -m src.cli ingest-schemaThis pulls the public schema from DATABASE_URL, builds one document per table (name, description, columns with types and descriptions), and upserts them into the schema Chroma collection.
2. Ask a question in natural language
python -m src.cli ask-db "Which team scored the most goals at home?"The CLI prints all steps so you can see what was retrieved, what SQL was generated, and the result:
- Step 1 – Question: Your question.
- Step 2 – Retrieved schema: The table(s) and columns (with similarity distance) that were sent to the LLM.
- Step 3 – Generated SQL: The generated
SELECTand, if present, a short explanation. - Step 4 – Result: The result set in tabular form, or "(No rows)" if empty.
Only single SELECT queries are executed; anything else is rejected.
Optional configuration (ask-db only)
In .env you can set:
CHROMA_SCHEMA_COLLECTION— Chroma collection name for schema (default:schema_football).SCHEMA_RETRIEVAL_K— Number of schema documents to retrieve (default: same asRETRIEVAL_K).SCHEMA_RETRIEVAL_MAX_DISTANCE— Max L2 distance for schema similarity (0 = off). Use e.g.1.0or2.0if you get no or too many results.
Example (after loading sql/football.sql and running ingest-schema)
python -m src.cli ask-db "List all competitions"Example output shape:
--- Step 1: Question ---
List all competitions
--- Step 2: Retrieved schema ---
[Table 1] competitions (distance=0.xxxx)
Table: competitions
Description: Master list of competitions/tournaments...
Columns:
- id (bigint): Primary key...
...
--- Step 3: Generated SQL ---
SELECT id, name, country, competition_type FROM competitions;
Explanation: Lists all rows from the competitions table.
--- Step 4: Result ---
id name country competition_type
-- -------------------- -------- ----------------
1 Premier League England league
2 UEFA Champions League international
You can run 15 fixed NLP queries in a single process to compare performance when ChromaDB stays warm (no per-query boot). Useful to see how much time is spent in schema retrieval vs. LLM vs. database.
Run with full output (user query, execution time, and database response per query):
python -m src.cli benchmarkFor each of the 15 queries you get: the original user question, total execution time for that query, and the result table from PostgreSQL. At the end, total execution time for all 15 is printed.
Run with step timings only (--timing-only):
python -m src.cli benchmark --timing-onlyWith --timing-only, the CLI prints only per-step times for each query (no user query text or result table):
- chroma — time for schema retrieval (ChromaDB similarity search)
- LLM — time for the structured LLM call (SQL generation)
- DB — time for executing the generated
SELECTon PostgreSQL
Example output:
Query 1: chroma 0.15s, LLM 1.42s, DB 0.01s
Query 2: chroma 0.08s, LLM 1.38s, DB 0.01s
...
Query 15: chroma 0.07s, LLM 1.45s, DB 0.02s
Total execution time: 45.23s
Run ingest-schema at least once before using benchmark; the same schema collection and DATABASE_URL are used.
You can run the RAG graph under the LangGraph dev server and inspect node executions in the web UI (LangGraph Studio), and test with multiple queries via the API or script.
-
Install dev dependencies (includes LangGraph CLI and SDK):
python -m venv .venv
pip install -e ".[dev]" -
Ingest some documents (so the graph has something to retrieve):
python -m src.cli ingest --text-dir data/texts
Paths like
data/textsare resolved from the project root (wherelanggraph.jsonlives), so you can run ingest from any directory. The log will showaction=load_docs source=text count=N files=[...]so you can confirm every.txtfile (e.g.sample.txt,dog.txt) was loaded. After adding or changing files indata/textsordata/pdfs, run ingest again or the new content won’t appear in retrieval.Important: The LangGraph server only runs the query graph (retrieve → generate). It does not load or ingest files. ChromaDB is updated only when you run
ingestfrom the CLI. If you add or change files indata/texts(ordata/pdfs), runingestagain from the repo root; both the CLI and the LangGraph API will then use the same ChromaDB (stored under the project’schroma_dbdirectory). Iflanggraph devdoesn’t see your data, check the server logs for[vectorstore.chroma] action=init persist_dir=...andcollection_doc_count=...to confirm the path and that the collection has documents; you can setCHROMA_PERSIST_DIRin.envto an absolute path (e.g.C:\...\chroma_db) to force the same DB for both CLI and server. -
Start the LangGraph dev server (from the repo root):
Activate the virtual environment first (required so the
langgraphCLI is on your PATH):# Windows (Git Bash or WSL): source .venv/Scripts/activate # Windows (CMD): .venv\Scripts\activate.bat # Windows (PowerShell): .venv\Scripts\Activate.ps1 source .venv/Scripts/activate # or use the command for your shell langgraph dev
When ready you'll see:
- API: http://localhost:2024
- Docs: http://localhost:2024/docs
- LangGraph Studio: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
-
Open LangGraph Studio from the URL above. Use the rag assistant. Send input as custom state with a
queryfield, for example:{"query": "What is RAG?"}You can run multiple queries; each run will show the retrieve → build_context → generate (or guardrail) nodes in the UI.
-
Test with multiple queries from the command line (with
langgraph devrunning):python scripts/run_queries_via_api.py
Or pass your own questions:
python scripts/run_queries_via_api.py "What is RAG?" "What is ChromaDB?" "What is LangGraph?"
Optional: set
BASE_URLif your server is not athttp://localhost:2024.
You can use a local Mistral model instead of OpenAI for the generation step (after ChromaDB retrieval). Ingest still uses OpenAI embeddings unless you change that separately.
-
Start Mistral with the provided Compose file (requires Docker and an NVIDIA GPU):
docker compose -f docker-compose.mistral.yml up -d
Wait until the model has finished loading (logs will show when the server is ready). The first run downloads the model into a Docker volume.
-
Configure the app in
.env:LLM_PROVIDER=mistral_local LLM_BASE_URL=http://localhost:8000/v1 LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3 -
Query as usual (no OpenAI key needed for the query step):
python -m src.cli query "What is RAG?"For gated Hugging Face models, set
HF_TOKENin your environment or in a.envused by Docker (e.g. pass it when runningdocker compose). vLLM is GPU-oriented; on a machine without an NVIDIA GPU, consider alternatives such as Ollama for CPU-friendlier local inference.
After ingesting data/texts/sample.txt:
-
Query:
What is RAG?
Answer: Explains RAG; Citations:sample.txt. -
Query:
What is ChromaDB?
Answer: Explains ChromaDB’s role; Citations:sample.txt. -
Query:
What is LangGraph?
Answer: Explains orchestration as a graph; Citations:sample.txt.
Citations are derived from document metadata (file_name, page, source_type).
-
OPENAI_API_KEY is not set
Required for ingest (embeddings) and for query whenLLM_PROVIDER=openai. Create a.envfile withOPENAI_API_KEY=sk-...(see.env.example) or export it. For query withLLM_PROVIDER=mistral_localyou do not need an OpenAI key. -
No documents to ingest
Ensure--pdf-dirand/or--text-direxist and contain at least one.pdfor.txt, or use--raw "...". -
Empty or irrelevant answers
Run ingest first; then query. Increase retrievalkor chunk size/overlap insrc/config.pyif needed. -
DATABASE_URL is not set
Required for ingest-schema and ask-db. Set it in.env(e.g.postgresql://postgres:password@localhost:5432/football). Ensure the database exists and the user has permission to readpg_catalogand your tables. -
No schema documents retrieved (ask-db)
Run ingest-schema first so the schema collection is populated. If you still see no results, try increasingSCHEMA_RETRIEVAL_MAX_DISTANCE(e.g.2.0) or set it to0to disable distance filtering.
From the repo root:
pytestTests check that:
- Ingestion produces non-empty documents (with mocked or env API key).
- Vectorstore persists and reloads (or that the store can be built).
- A query returns an answer and citations when the index is populated.
- Hybrid retrieval: Keep the current retriever interface; add a keyword/BM25 path and merge results before building context.
- More loaders: Add new modules under
src/loaders/that yieldDocumentwith the same metadata conventions. - Graph: Add nodes (e.g. re-ranking, fact-check) and wire them in
src/graph/graph.pywith new edges.
Use and extend as you like for learning and reference.