A production RAG pipeline in ~130 lines of Python: Unstructured parsing → Chonkie SDPM semantic chunking → Anthropic Contextual Retrieval → BGE-m3 dense +
Qdrant/bm25sparse hybrid index → BGE-reranker-v2-m3 rerank → tenant-filtered retrieval. Shipped with a one-filedocker-compose(Qdrant + Redis Stack) and pinned dependencies.
Companion article: Context engineering: por qué tu RAG se rompe a los 50k tokens.
Naive RAG (dense-only, fixed-size chunks, no rerank) plateaus around Recall@5 ≈ 0.6 on real corpora. Every component here closes a specific failure mode:
| Failure mode | Countermeasure here |
|---|---|
| Mid-sentence chunk cuts | chonkie.SDPMChunker(chunk_size=512, threshold=0.75) — semantic doc-partitioning meta chunker |
| Chunk lacks document-level grounding | Contextual Retrieval via claude-haiku-4-5 (1–3 framing sentences prepended per chunk) |
| Keyword queries miss with dense-only | Hybrid: BAAI/bge-m3 dense + Qdrant/bm25 sparse, fused with Fusion.RRF (Reciprocal Rank Fusion) |
| Top-k contaminated by near-miss candidates | FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True) rerank over top-40 |
| Cross-tenant leakage | Filter(must=[FieldCondition(key="tenant_id", match=TENANT_ID)]) on every query |
| Config | Recall@5 | Faithfulness |
|---|---|---|
| Naive dense-only | 0.62 | 0.71 |
| + SDPM chunking | 0.71 | 0.76 |
| + Contextual Retrieval | 0.78 | 0.83 |
| + Hybrid (dense + BM25, RRF) | 0.85 | 0.85 |
| + BGE-reranker-v2-m3 (this stack) | 0.91 | 0.90 |
Numbers come from internal client evaluations — measured with Ragas 0.2.6. Your mileage will vary; re-measure on your own corpus.
| Concern | rag-production-stack |
Rolling your own | LangChain off-the-shelf | Managed SaaS |
|---|---|---|---|---|
| Dependency count | 11 pinned | grows unbounded | ~40+ transitive | 0 (but vendor lock-in) |
| Hybrid retrieval | ✅ built in | manual | partial | ✅ |
| Contextual Retrieval | ✅ (Anthropic recipe) | manual | ❌ | |
| Tenant isolation | ✅ Qdrant filter on every query | DIY | DIY | ✅ |
| Reranker | ✅ BGE-v2-m3 fp16 | DIY | plugin | ✅ |
| Self-hostable on one droplet | ✅ | depends | ❌ heavy | ❌ |
| Open-source | ✅ Apache 2.0 | — | ✅ | ❌ |
git clone https://github.com/numoru-ia/rag-production-stack.git
cd rag-production-stack
docker compose up -d # Qdrant v1.12.5 + Redis Stack 7.4.0
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# index a corpus
export ANTHROPIC_API_KEY=...
export TENANT_ID=acme
python -m rag.index /path/to/docs/
# query it
python -m rag.query "¿cuál es el período de devolución?"flowchart LR
subgraph Ingest["Ingestion (rag/index.py)"]
Docs[(Documents on disk)]
Part[unstructured.partition.auto]
Chunk[chonkie.SDPMChunker<br/>BAAI/bge-m3 · size=512 · t=0.75]
Ctx[Contextual Retrieval<br/>claude-haiku-4-5 · max_tokens=200]
Dense[fastembed TextEmbedding<br/>BAAI/bge-m3 · size 1024]
Sparse[fastembed SparseTextEmbedding<br/>Qdrant/bm25]
Upsert[qdrant_client.upsert<br/>batch 128]
end
Docs --> Part --> Chunk --> Ctx
Ctx --> Dense & Sparse --> Upsert
Upsert --> Q[(Qdrant v1.12.5<br/>collection "kb"<br/>dense + bm25 named vectors<br/>payload: text, context, source, tenant_id)]
subgraph Query["Query (rag/query.py)"]
Q2[Query string] --> D2[bge-m3 dense]
Q2 --> S2[Qdrant/bm25 sparse]
D2 & S2 --> Hybrid[query_points<br/>Prefetch dense 40 + bm25 40<br/>FusionQuery RRF<br/>tenant filter]
Hybrid --> Rank[FlagReranker<br/>bge-reranker-v2-m3 fp16]
Rank --> TopK[top_k=6]
end
Q --> Hybrid
style Q fill:#fee2e2,stroke:#dc2626
style TopK fill:#dcfce7,stroke:#16a34a
stateDiagram-v2
[*] --> Unseen: file on disk
Unseen --> Parsed: unstructured.partition_auto
Parsed --> Chunked: SDPMChunker.chunk(text)
Chunked --> Contextualized: contextualize() via claude-haiku-4-5
Contextualized --> Embedded: bge-m3 dense + Qdrant/bm25 sparse
Embedded --> Indexed: qc.upsert (batch=128)
Indexed --> Queryable
Queryable --> Stale: source mtime changes (manual re-run)
Stale --> Parsed
Queryable --> Deleted: operator removes from source dir
Deleted --> [*]
sequenceDiagram
autonumber
participant User as User / app
participant Q as rag.query.run(query, top_k=6)
participant FE as fastembed (bge-m3, Qdrant/bm25)
participant Qd as Qdrant (collection "kb")
participant RR as FlagReranker<br/>bge-reranker-v2-m3 fp16
User->>Q: run("¿cuál es el período de devolución?")
Q->>FE: dense + sparse embedding of query
FE-->>Q: dense vec (1024), sparse vec
Q->>Qd: query_points<br/>Prefetch(dense, 40) + Prefetch(bm25, 40)<br/>FusionQuery(Fusion.RRF)<br/>Filter(tenant_id == TENANT_ID)<br/>limit=40
Qd-->>Q: 40 candidate points
Q->>RR: compute_score([[query, text]] * 40, normalize=True)
RR-->>Q: scores
Q-->>User: top-6 {text, score, source}
All knobs are env vars (hardcoded model names live in rag/index.py and rag/query.py).
| Env var | Default | Purpose |
|---|---|---|
QDRANT_URL |
http://localhost:6333 |
Qdrant endpoint |
QDRANT_API_KEY |
"" |
Qdrant API key (required for cloud) |
QDRANT_COLLECTION |
kb |
Target collection name |
TENANT_ID |
default |
Payload filter applied on both ingest and query |
ANTHROPIC_API_KEY |
— | Required during ingest for Contextual Retrieval |
| Where | Value |
|---|---|
SDPMChunker |
embedding_model=BAAI/bge-m3, chunk_size=512, threshold=0.75 |
| Dense vectors | BAAI/bge-m3, 1024 dims, cosine |
| Sparse vectors | Qdrant/bm25 with Modifier.IDF |
| Contextual LLM | claude-haiku-4-5, max_tokens=200, document truncated to first 8000 chars |
| Hybrid prefetch | limit=40 on each leg |
| Fusion | models.Fusion.RRF |
| Rerank | FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True) |
Default top_k |
6 |
| Upsert batch | 128 |
Ragas 0.2.6 is pinned in requirements.txt. Typical metrics to wire into CI:
| Metric | Why | Minimum threshold we use |
|---|---|---|
context_recall |
Did retrieval find the right chunks? | ≥ 0.80 |
faithfulness |
Did the answer stay grounded? | ≥ 0.85 |
answer_relevancy |
Did the answer address the question? | ≥ 0.80 |
context_precision |
Are the returned chunks actually useful? | ≥ 0.70 |
# evaluation sketch — drop into a CI step
from ragas import evaluate
from ragas.metrics import context_recall, faithfulness, answer_relevancy
scores = evaluate(
dataset, # huggingface Dataset: question, answer, contexts, ground_truth
metrics=[context_recall, faithfulness, answer_relevancy],
)
assert scores["context_recall"] >= 0.80deepeval==2.3.0 is also pinned for LLM-judge style tests, and langfuse==2.58.1 for tracing.
- Set
TENANT_IDper request. The filter is already applied inrag/query.py; just make sure you rotate the env var per tenant context. - Cache expensive rerankers behind RedisVL (
redisvl==0.3.7is pinned — ship your own wrapper aroundrag.query.run). - Re-index after document changes, don't patch in place — chunk boundaries are content-dependent.
- Keep the Anthropic ingest budget in mind: Contextual Retrieval is one Claude Haiku call per chunk. Pre-summarize long docs or batch via the Anthropic batch API.
- Pin pack versions in
requirements.txt. Embedding weights matter; a minor fastembed bump can silently change recall.
- First-class RedisVL semantic cache layer around
rag.query.run -
firecrawl-pyweb ingestion path wired end-to-end (pinned, unused) - Built-in Langfuse spans on every query + rerank pair
- Shard by tenant with a payload index, auto-create on first write
- Batch-mode Contextual Retrieval via Anthropic Message Batches API
Apache 2.0 — see LICENSE.