- Project Overview
- Why This Project
- Architecture
- How It Works
- Project Structure
- Setup Instructions
- Document Ingestion
- Chat API
- Health & Readiness
- Web Client
- Further Documentation
- Observability
- Core System Design
- Security
- Security Audit History
- Key Features
- TODO
- Tech Stack
- Contributing
- Changelog
- Fork History
- Summary
This project is a production-grade AI chatbot backend built using:
- 🧠 LangGraph for conversation orchestration (8-node pipeline)
- 🔍 RAG pipeline using ChromaDB with mode-aware score gating, context-aware query rewriting, and groundedness verification
- 💬 Multi-LLM support (14 providers via universal OpenAI-compatible adapter)
- ⚡ FastAPI for backend APIs with auth, rate limiting, and observability
- 🧠 Redis for memory storage, rate limiting, and ingestion registry
- 🔒 Security-hardened: SSRF protection, API key auth, proxy-aware rate limiting, structured logging
It supports:
- Conversational memory with summarization
- Document-based Q&A (RAG) — ingest PDF / TXT / Markdown / DOCX / HTML, from a URL or by uploading a local file (privacy-friendly)
- Four chat modes: strict (knowledge-base-only), open (free interaction), learning (auto-growing KB), learning_review (KB growth gated by human review)
- 14 LLM providers (OpenAI, Anthropic, Google, Groq, Ollama, DeepSeek, Together, Mistral, Fireworks, OpenRouter, vLLM, LM Studio, llama.cpp)
- 7+ embedding models via FastEmbed (ONNX-based, zero CVEs)
- Self-ingestion in learning mode — auto-saves synthesized answers with provenance metadata
- API key authentication on destructive ingest endpoints
- SSRF protection — blocks private IPs and cloud metadata endpoints
- Proxy-aware rate limiting (X-Forwarded-For handling with CIDR trust)
- Structured JSON logging with correlation ID tracing
- Fully local deployment with Ollama (zero cloud API keys)
- Multilingual responses (Arabic / English / European Portuguese)
- Versioned, typed HTTP API (
/api/v1) with RFC 9457problem+jsonerrors and per-request controls - SSE token streaming (
/api/v1/chat/stream) with structured citations (label, doc_id, score, page, snippet) - Context-aware query rewriting — condenses multi-turn follow-ups into a standalone search query before retrieval
- Groundedness verification — surfaces
meta.grounded; strict mode refuses answers unsupported by the retrieved chunks - Persistent feedback (
POST /api/v1/feedback) feeding a moderator queue and the RAGAS golden set - Provider resilience — transient 429/5xx retries with backoff + an in-process circuit breaker
- Durable, retryable ingestion (
INGEST_MODE=queue) with a worker that survives restarts - Configurable persona / domain / refusal copy — deploy as your own assistant with three env vars
- Optional hybrid retrieval (dense + BM25 via RRF) for lexical recall of acronyms / SKUs / exact phrases
- Guardrails — input prompt-injection blocking + output PII masking / length cap (config-toggleable)
- RAGAS evaluation harness (
eval/) — offline faithfulness / relevancy / context precision+recall - Reference web client (
web/) — Vite + React + TypeScript SPA demonstrating the v1 API, incl. a reviewer panel - Kubernetes-ready health/ready probes
- Scalable backend design
Most chatbot APIs are stateless and cannot maintain long-term context.
This project solves that by combining:
- Stateful memory (Redis)
- Long-term summarization
- RAG-based knowledge retrieval
- LangGraph orchestration
Making it suitable for real-world SaaS integrations.
┌─────────────────────────────────────────────────────┐
│ User Query │
└────────────────────────┬────────────────────────────┘
│
▼
FastAPI POST /api/chat
│
▼
┌──────────────────────────────┐
│ LangGraph Orchestrator │
│ │
│ 1. load_memory (Redis) │
│ 2. condense_query (LLM) │ ← context-aware rewrite (multi-turn)
│ 3. retrieve_context (Chroma)│ ← mode-aware gate + mmr/hybrid
│ 4. generate_answer (LLM) │ ← mode-specific prompt
│ 5. verify_answer │ ← groundedness gate (strict refuses)
│ 6. self_ingest (Chroma) │ ← learning mode only
│ 7. summarize │
│ 8. store_memory (Redis) │
└──────────────────────────────┘
│
▼
Response to User
- User sends a question
- System loads conversation history from Redis
- Relevant documents are retrieved from ChromaDB (RAG) — behavior depends on
CHAT_MODE - LangGraph orchestrates the flow:
- memory → condense query → retrieval → reasoning → verify groundedness → self-ingest (if learning) → response
- On multi-turn follow-ups, the question is condensed into a standalone search query before retrieval
- LLM generates a final contextual answer — mode-specific prompt controls behavior
- The answer's groundedness is verified against the retrieved chunks — strict mode refuses when unsupported
- In learning mode, synthesized answers are auto-ingested into ChromaDB as new knowledge
- Conversation is updated + summarized for future use
chat-bot/
├── controllers/ # Route handler logic (chat, ingest endpoints)
├── middlewares/ # Rate limiting middleware
├── db/ # Redis and ChromaDB clients
├── graph/
│ ├── builder.py # LangGraph pipeline definition (8 nodes + edges)
│ ├── state.py # State TypedDict (chat_mode, best_score, search_query, grounded, …)
│ └── nodes/ # Individual graph nodes
│ ├── load_memory.py # Load conversation history from Redis
│ ├── condense_query.py # Rewrite follow-ups into a standalone search query (#1)
│ ├── retrieve_context.py # Mode-aware score gate + MMR / hybrid retrieval
│ ├── generate_answer.py # Mode-specific prompt → LLM call (resilient)
│ ├── verify_answer.py # Groundedness verification + strict refusal (#2)
│ ├── self_ingest.py # Auto-ingest synthesized answers (learning mode)
│ ├── store_memory.py # Save conversation to Redis
│ └── summarize.py # Conversation summarization
├── ingest/ # Document ingestion pipeline
│ ├── loaders.py # Multi-format loader registry (PDF/TXT/MD/DOCX/HTML)
│ ├── retrieval.py # Hybrid (dense + BM25) retrieval + RRF fusion (Phase 4)
│ ├── queue.py # Durable Redis-backed ingest queue (#4)
│ └── worker.py # `python -m ingest.worker` — durable ingest worker
├── feedback/ # Feedback Redis keys (#3)
├── prompts/
│ ├── answer.py # 3 mode-specific prompt builders (config-driven persona)
│ ├── condense.py # Query-rewrite prompt
│ ├── verify.py # LLM groundedness-judge prompt
│ └── summarize.py # Conversation summarization prompt
├── schemas/
│ ├── chat.py # ChatRequest schema
│ ├── feedback.py # Feedback request/response schemas
│ └── ingest.py # IngestRequest schema
├── services/
│ ├── chat_service.py # Runs the graph / streaming path; returns grounded + self_ingested
│ └── feedback_service.py # Persist feedback + export downvotes to the golden set
├── utils/
│ └── resilience.py # Provider retry/backoff + circuit breaker (#14)
├── eval/ # RAGAS harness + corpus seeder + golden set
├── tests/ # Pytest test suite (300+ tests, 97% coverage)
├── main.py # App entrypoint
├── config.py # Settings (pydantic-settings) — all feature flags
├── pytest.ini # Test configuration
├── requirements.txt
├── requirements-dev.txt # Test dependencies (pytest, fakeredis, responses, fpdf2)
├── docker-compose.yml # Cloud deployment (API + Redis)
└── docker-compose.local.yml # Local deployment (API + Redis + Ollama)
bash ~/Downloads/Miniconda3-*.sh
source ~/miniconda3/bin/activateFull guide: https://www.anaconda.com/docs/getting-started/miniconda/install/mac-cli-install
conda create -n chat-bot python=3.10
conda activate chat-botgit clone https://github.com/pt-act/chat-bot.git
cd chat-bot
pip install -r requirements.txtCopy the example file and fill in your values:
cp .env.example .envKey variables:
OPENAI_API_KEY=your_openai_key_here # https://platform.openai.com/account/api-keys
LLM_PROVIDER=openai # openai | anthropic | google | groq | ollama | openrouter | together | deepseek | mistral
LLM_MODEL=gpt-4o-mini
# Override base URL for any OpenAI-compatible endpoint:
# LLM_BASE_URL=http://localhost:11434/v1 # Ollama
# LLM_BASE_URL=https://openrouter.ai/api/v1 # OpenRouter
REDIS_HOST=localhost
REDIS_PORT=6379
RETRIEVAL_SCORE_THRESHOLD=0.3 # raise to 0.7 for stricter grounding
CHAT_MODE=strict # strict | open | learning | learning_review — see Chat Modes section
SELF_INGEST_MIN_LENGTH=50 # minimum answer length for auto-ingest in learning modes
GUARDRAILS_ENABLED=true # input prompt-injection blocking + output guards
GUARDRAILS_BLOCK_INJECTION=true # reject likely prompt-injection / jailbreak inputs (400)
GUARDRAILS_MASK_PII=false # mask emails/phone/card-like numbers in output
GUARDRAILS_MAX_ANSWER_CHARS=4000 # hard cap on answer length (0 disables)See .env.example for the full list of options.
The system uses a universal OpenAI-compatible adapter — most modern providers expose an OpenAI-compatible API, so we support them with a single code path.
Native providers:
openai— GPT-4o, GPT-4o-mini, etc.anthropic— Claude 3.5 Sonnet, Haiku, etc.google— Gemini models (requiresGOOGLE_API_KEY)
OpenAI-compatible (use LLM_BASE_URL override):
ollama— Local models (Llama, Mistral, etc.)openrouter— Route to 100+ modelstogether— Together AIgroq— Groq (also works natively)deepseek— DeepSeek modelsfireworks— Fireworks AImistral— Mistral AIvllm— vLLM self-hostedlmstudio— LM Studio localllamacpp— llama.cpp local
All OpenAI-compatible providers use the same langchain_openai.ChatOpenAI client. Just set LLM_BASE_URL to point to your endpoint. Local providers (Ollama, LM Studio, vLLM) don't need an API key.
Provider aliases: claude → anthropic, gpt / chatgpt → openai, llama → ollama, gemini → google.
| Provider | Type | Latency | Cost (per 1M tokens) | Best For | API Key |
|---|---|---|---|---|---|
| OpenAI | Cloud API | ~1s | Input $0.15 / Output $0.60 (gpt-4o-mini) | General production use | OPENAI_API_KEY |
| Anthropic | Cloud API | ~1.5s | Input $0.25 / Output $1.25 (claude-3.5-haiku) | Long-context reasoning, safety | ANTHROPIC_API_KEY |
| Google Gemini | Cloud API | ~1s | Free tier: 15 RPM; Paid ~$0.075/1M (gemini-2.0-flash) | Cost-effective, multimodal | GOOGLE_API_KEY |
| Groq | Cloud API | ~0.3s | Free tier available; Paid ~$0.05/1M | Fastest inference, real-time chat | GROQ_API_KEY + LLM_BASE_URL |
| DeepSeek | Cloud API | ~2s | Input $0.14 / Output $0.28 (deepseek-chat) | Budget-friendly, strong coding | OPENAI_API_KEY + LLM_BASE_URL |
| Together | Cloud API | ~1s | Varies by model (~$0.10–$0.80/1M) | Open-source model access | OPENAI_API_KEY + LLM_BASE_URL |
| Mistral | Cloud API | ~1s | Input $0.10 / Output $0.30 (mistral-small) | European data compliance | OPENAI_API_KEY + LLM_BASE_URL |
| Fireworks | Cloud API | ~0.5s | ~$0.20/1M (open-source models) | Fast open-source inference | OPENAI_API_KEY + LLM_BASE_URL |
| OpenRouter | Cloud proxy | Varies | Varies by model + 5% surcharge | Single API for 100+ models | OPENAI_API_KEY + LLM_BASE_URL |
| Ollama | Local | ~2–10s | Free (own hardware) | Full privacy, air-gapped, zero cost | None (local) |
| vLLM | Local | ~1–5s | Free (own hardware) | High-throughput self-hosted | None (local) |
| LM Studio | Local | ~2–10s | Free (own hardware) | Desktop dev/testing | None (local) |
| llama.cpp | Local | ~3–15s | Free (own hardware) | Minimal hardware, CPU-only | None (local) |
When to use local vs cloud: Use local providers (Ollama/vLLM) when data privacy is paramount, for air-gapped deployments, or to avoid API costs. Use cloud providers for production reliability, lower latency, and models that exceed local hardware capacity. Groq is the fastest cloud option; DeepSeek and Gemini Flash are the cheapest.
The default embedding provider is OpenAI (no extra dependencies).
Recommended for local embeddings — FastEmbed (ONNX):
# Already included in requirements.txt
# Set EMBEDDING_PROVIDER=fastembed
# Set EMBEDDING_MODEL=BAAI/bge-small-en-v1.5FastEmbed uses ONNX Runtime (no torch dependency):
- ~50MB download vs ~2GB for torch-based alternatives
- Zero CVEs — pure Python + ONNX
- Supports any FastEmbed-compatible model — unknown models trigger a warning but still load
Alternative — HuggingFace (torch-based):
pip install langchain-huggingface sentence-transformers transformers numpy
⚠️ sentence-transformersandtransformerspull intorchwhich has known CVEs on older versions. Only install these if you explicitly need HuggingFace-specific models not available in FastEmbed.
| Model | Provider | Dimensions | Download | Context | Best For |
|---|---|---|---|---|---|
text-embedding-3-small |
OpenAI | 1536 | API-only | 8191 | Default, production reliability |
text-embedding-3-large |
OpenAI | 3072 | API-only | 8191 | Maximum accuracy, higher cost |
BAAI/bge-small-en-v1.5 |
FastEmbed | 384 | ~50MB | 512 | Prototyping, small datasets, low memory |
BAAI/bge-base-en-v1.5 |
FastEmbed | 768 | ~120MB | 512 | Balanced speed/quality (recommended) |
BAAI/bge-large-en-v1.5 |
FastEmbed | 1024 | ~430MB | 512 | Highest local quality, slower inference |
sentence-transformers/all-MiniLM-L6-v2 |
FastEmbed | 384 | ~30MB | 256 | Fast semantic search, versatile |
sentence-transformers/all-MiniLM-L12-v2 |
FastEmbed | 384 | ~60MB | 256 | Slightly better quality than L6 |
BAAI/bge-m3 |
FastEmbed | 1024 | ~570MB | 8192 | Arabic/English mixed content, multilingual |
nomic-ai/nomic-embed-text-v1.5 |
FastEmbed | 768 | ~130MB | 8192 | Long documents (>256 tokens) |
sentence-transformers/all-MiniLM-L6-v2 |
HuggingFace | 384 | ~2GB+ | 256 | Same model, torch-based (AVOID if FastEmbed works) |
Choosing an embedding model: If you need Arabic+English support, use
BAAI/bge-m3. For most English-only use cases,BAAI/bge-base-en-v1.5offers the best balance. For zero-cost local deployment, any FastEmbed model works without API keys. OpenAI embeddings are best when you don't want to manage local inference.
Switching models requires re-ingesting: Embedding models produce different vectors — you must delete existing documents and re-ingest after changing
EMBEDDING_MODEL.
Production deployments must configure these security settings in .env:
# API Key Authentication (recommended for production)
API_KEY=your-secret-api-key-here # Set to enable auth on ingest endpoints
REQUIRE_AUTH_FOR_INGEST=true # Require API key for POST /api/ingest and GET /api/ingest/docs
# CORS — production should never use "*"
CORS_ORIGINS=["https://your-domain.com"] # Empty list [] disables CORS entirely
# Rate Limiting (behind reverse proxy)
TRUSTED_PROXIES=["10.0.0.0/8", "172.16.0.0/12"] # CIDR ranges of trusted load balancers
ALLOWED_HOSTS=["*"] # SSRF protection: whitelist download hosts or ["*"] to allow all public hostsAuthentication behavior:
DELETE /api/ingest/{doc_id}always requires theX-API-Keyheader whenAPI_KEYis set- Other ingest endpoints require it only when
REQUIRE_AUTH_FOR_INGEST=true - When
API_KEYis empty, auth is skipped (backward-compatible dev mode)
Rate limiting: 60 requests/minute per IP. If running behind Cloudflare/nginx, configure TRUSTED_PROXIES so the real client IP is used instead of the proxy's IP.
LangSmith (optional): Tracing is disabled by default (
LANGSMITH_TRACING=false). To enable it, setLANGSMITH_TRACING=trueand provide a validLANGSMITH_API_KEYfrom smith.langchain.com.
uvicorn main:app --reloadServer runs at http://127.0.0.1:8000
Or run with Docker (includes Redis):
docker-compose up --buildThe compose stack runs three services —
api, a durable ingestworker(python -m ingest.worker), andredis— withINGEST_MODE=queueso ingestion survives API restarts (see Durable ingestion). For a single-process setup, setINGEST_MODE=inline(the code default) and drop the worker.
For air-gapped, privacy-first, or zero-cost deployment, use the local compose file with Ollama:
# One command — Ollama pulls llama3.2 and nomic-embed-text on first start
docker-compose -f docker-compose.local.yml up --buildThis starts:
- Ollama on port 11434 — pulls
llama3.2for chat andnomic-embed-textfor embeddings (if you prefer FastEmbed) - Redis on port 6379 — conversation memory with AOF persistence
- API on port 8000 — configured for fully local operation
All settings are pre-configured in docker-compose.local.yml:
LLM_PROVIDER=ollama,LLM_BASE_URL=http://ollama:11434/v1EMBEDDING_PROVIDER=fastembed,EMBEDDING_MODEL=BAAI/bge-small-en-v1.5- No cloud API keys needed
Hardware requirements: Ollama with llama3.2 needs ~4GB RAM. For larger models (llama3.1-70b), you need ~40GB RAM or a GPU. See Ollama model list for options.
Customizing the model: Edit
LLM_MODELand theollama pullcommand indocker-compose.local.ymlto use any Ollama-supported model. For multilingual support, usebge-m3as the embedding model and a multilingual LLM likemistralorqwen2.
On /api/v1, ingestion is asynchronous: the request returns 202 Accepted with
status=queued and a Location header, then processes in the background — poll the
status endpoint for progress. (Legacy /api/ingest remains synchronous.)
Supported formats: PDF, TXT, Markdown (.md/.markdown), DOCX, HTML (.html/.htm).
The format is inferred from the file/URL extension and dispatched to the matching loader
(ingest/loaders.py) — adding another format is a one-line entry there.
curl -X POST "http://127.0.0.1:8000/api/v1/ingest" \
-H "Content-Type: application/json" \
-d '{"file_name": "terms_conditions", "s3_url": "https://your-host/terms.pdf"}'
# → 202 {"doc_id": "terms_conditions", "status": "queued"} # .pdf/.txt/.md/.docx/.htmlFor a fully local / on-prem setup, push a file straight from your machine — it never has to be hosted anywhere. Pair with FastEmbed + Ollama for a zero-cloud pipeline.
curl -X POST "http://127.0.0.1:8000/api/v1/ingest/upload" \
-F "file=@/path/to/return_policy.pdf" # or .txt, .md, .docx, .html
# → 202 {"doc_id": "return_policy", "status": "queued"}
# optional explicit doc id (otherwise derived + sanitized from the filename):
curl -X POST "http://127.0.0.1:8000/api/v1/ingest/upload" \
-F "file=@./Q3 Report.docx" -F "file_name=q3_report"Multipart
filemust be a supported format (validated by extension; PDFs also get aMAX_FILE_SIZE_MB. Same async contract as URL ingest: pollGET /ingest/status/{doc_id}. The web client exposes this as an Upload doc button.
curl http://127.0.0.1:8000/api/v1/ingest/status/terms_conditions
# → {"doc_id": "...", "status": "done", "added": 12, "total": 12, "version": "..."}curl "http://127.0.0.1:8000/api/v1/ingest/docs?limit=50&cursor=0"
# → {"total": N, "docs": [...], "next_cursor": "50"}curl -X DELETE http://127.0.0.1:8000/api/v1/ingest/terms_conditionsIngest management endpoints honor API-key auth —
DELETEalways requiresX-API-Key, and the others require it whenREQUIRE_AUTH_FOR_INGEST=true.
# List entries awaiting review
curl "http://127.0.0.1:8000/api/v1/review/pending?limit=50&cursor=0"
# → {"total": N, "pending": [{"entry_id": "synthesized:…", "question": "…", "answer": "…", ...}], "next_cursor": null}
# Approve → embeds into the synthesized_answers collection (retrievable in learning mode)
curl -X POST http://127.0.0.1:8000/api/v1/review/synthesized:1a2b3c4d5e6f/approve
# Reject → discards without embedding
curl -X POST http://127.0.0.1:8000/api/v1/review/synthesized:1a2b3c4d5e6f/rejectReview endpoints honor the same API-key dependency as ingest (gated by
REQUIRE_AUTH_FOR_INGEST). The queue is only populated when requests run inlearning_reviewmode.
The system supports four interaction modes via CHAT_MODE in .env:
| Mode | Behavior | When no docs match | Self-ingest | Use case |
|---|---|---|---|---|
| strict (default) | Knowledge-base-only | Refuses: "I don't have information..." | No | Legal, medical, regulated domains |
| open | Free interaction | Uses general knowledge, honest about provenance | No | General assistants, brainstorming |
| learning | Free interaction + growing KB | Synthesizes answer, embeds immediately into ChromaDB | Yes (≥50 chars, no docs found) | Knowledge-building, research assistants |
| learning_review | Same as learning, human-gated | Synthesizes answer, queues for review (not embedded) | Queued for approval (≥50 chars, no docs found) | Curated KB growth with a moderator in the loop |
Learning quality gate: Both learning modes only act on a response when (1) no documents matched the question (filling a knowledge gap) and (2) the answer is ≥50 characters. Synthesized entries live in a separate ChromaDB collection (
synthesized_answers) consulted only in the learning modes — they never pollutestrict/openretrieval.
learningvslearning_review(two-phase ingest): Inlearning, a passing answer is embedded intosynthesized_answersimmediately. Inlearning_review, it is instead queued in Redis for human review — a moderator lists entries (GET /api/v1/review/pending) and approves (embeds intosynthesized_answers, making it retrievable) or rejects (discards). This keeps unverified model output out of the vector store until a human signs off.
CHAT_MODE sets the server default; clients can override it per request (see below).
# In .env
CHAT_MODE=strict # or open, learning, learning_review
SELF_INGEST_MIN_LENGTH=50The API is versioned. New clients should use /api/v1; the unversioned /api/*
paths still work but are deprecated (responses carry Deprecation/Sunset/Link
headers). Errors everywhere use RFC 9457 application/problem+json.
/api/v1 (recommended) |
/api (legacy, deprecated) |
|
|---|---|---|
| Chat response | { answer, sources[], meta } (typed) |
{ status, data, sources[] } |
| Sources | structured objects (label, doc_id, score, page, snippet) | bare label strings |
| Streaming | POST /api/v1/chat/stream (SSE) |
— |
| Ingest | 202 + background + poll |
synchronous |
Request (q required; the rest are optional per-request overrides):
curl -X POST "http://127.0.0.1:8000/api/v1/chat" \
-H "Content-Type: application/json" -H "X-User-Id: user_123" \
-d '{"q":"what is the return policy?"}'Response:
{
"answer": "Returns are accepted within 30 days of purchase.",
"sources": [
{"label": "return_policy.pdf", "doc_id": "return_policy", "score": 0.82, "page": 3, "snippet": "Customers may return..."}
],
"meta": {"mode": "strict", "lang": "en", "self_ingested": false, "grounded": "supported", "grounded_score": 0.83, "correlation_id": "…", "model": "gpt-4o-mini"}
}Streams the answer token-by-token. Same request body. Event sequence:
event: token data: {"delta": "Returns"}
event: token data: {"delta": " are"}
event: sources data: {"sources": [ … ]}
event: done data: {"meta": { … }}
On failure an event: error frame is emitted (no internal detail). Each response
includes a X-Correlation-Id header and rate-limit headers
(X-RateLimit-Limit/Remaining/Reset, plus Retry-After on 429).
X-User-Idscopes per-user conversation memory (defaults toanonymous). It is validated ([A-Za-z0-9_.@-]{1,128}) and namespaced internally.
Groundedness (
meta.grounded): when documents back the answer,verify_answerreportssupported | partial | unsupported(+grounded_score). In strict mode anunsupportedanswer is replaced by the "Not in the knowledge base" refusal and its sources cleared (STRICT_REFUSE_ON_UNGROUNDED). On the streaming path this applies to the stored answer +donemeta (tokens already streamed cannot be retracted).
Capture 👍/👎 on an answer (open to end users; the turn's correlation id is captured automatically). Downvotes can be exported into the RAGAS golden set to close the loop.
curl -X POST "http://127.0.0.1:8000/api/v1/feedback" \
-H "Content-Type: application/json" \
-d '{"rating":"down","reason":"cited the wrong policy","question":"return window?"}'
# → 201 {"feedback_id":"a1b2c3d4e5f6","rating":"down","status":"recorded"}
# Operators list feedback (API-key gated, like review/ingest):
curl "http://127.0.0.1:8000/api/v1/feedback?rating=down&limit=50&cursor=0"curl http://127.0.0.1:8000/healthReturns ok or degraded based on Redis and ChromaDB connectivity at startup time.
curl http://127.0.0.1:8000/readyReturns 200 with {"status": "ready"} only if both Redis and ChromaDB respond right now. Returns 503 with dependency-specific error details if either is down. Use this for Kubernetes readiness probes or load balancer health checks.
A reference single-page client lives in web/ (Vite + React + TypeScript). It
demonstrates the v1 API end-to-end: streaming chat (SSE), per-request mode/
language selectors, structured citations, RTL/Arabic rendering, accessibility
(aria-live, keyboard send, focus rings), and a backend health badge.
cd web
bun install # or npm install
bun run dev # http://localhost:5173 (proxies /api and /health to :8000)
bun run build # production bundle → web/dist/See web/README.md for details.
user_guidelines.md— end-user / API-consumer guide (modes, languages, citations, errors, rate limits, streaming, worked examples).PTD.md— Project Technical Document (architecture, data flow, components, storage, security, observability, testing, CI/CD, deployment, design decisions).CONTRIBUTING.md·CHANGELOG.md
Every request is tagged with a correlation ID (X-Correlation-Id) that propagates through all log calls, graph nodes, and service layers. This enables tracing a single request across Redis operations, ChromaDB retrievals, LLM calls, and ingest pipeline steps.
- Correlation ID middleware — injects or preserves
X-Correlation-Idheader, stores incontextvars.ContextVarfor async-safe propagation - Request timing middleware — logs method, path, status code, duration (ms), and correlation ID
- Structured JSON logging — set
LOG_FORMAT=jsonfor Datadog, CloudWatch, or ELK ingestion. Each log line includestimestamp,level,correlation_id, and message fields. - Log rotation —
logs/app.logcapped at 10 MB with 5 rotated backups
LOG_FORMAT=json # "text" (default) or "json" for log aggregatorsRedis stores:
- Full conversation history
- Running conversation summary (for long-term context)
- TTL-based expiry (configurable via
REDIS_TTL_SECONDS)
POST /api/ingest { file_name, s3_url }
│
▼
┌─────────────────────────────────────────────────┐
│ STEP 1 — Download │
│ requests.get(s3_url, stream=True) │
│ • Enforce MAX_FILE_SIZE_MB limit │
│ • Write to temp .pdf file on disk │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ STEP 2 — File-level dedup check │
│ SHA-256(file) → new_file_hash │
│ │
│ Redis: GET ingest_status:{doc_id}.file_hash │
│ ┌─ same hash? ──────────────────────────────┐ │
│ │ return { status: "skipped", │ │
│ │ reason: "file unchanged" } │ │
│ └───────────────────────────────────────────┘ │
│ │
│ Redis: HGET ingest:content_hashes new_hash │
│ ┌─ hash owned by different doc? ────────────┐ │
│ │ return { status: "skipped", │ │
│ │ reason: "duplicate content │ │
│ │ already ingested as '{doc}'" } │ │
│ └───────────────────────────────────────────┘ │
└────────────────────┬────────────────────────────┘
│ (file is new or changed)
▼
┌─────────────────────────────────────────────────┐
│ STEP 3 — Parse & chunk │
│ PyPDFLoader → raw pages │
│ RecursiveCharacterTextSplitter │
│ chunk_size=800, overlap=100 │
│ separators: [\n\n, \n, ., " ", ""] │
│ _clean_text() — collapse extra whitespace │
│ MD5(chunk_text) → chunk_hash per chunk │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ STEP 4 — Chunk-level diff │
│ │
│ old_hashes = Redis SMEMBERS doc_chunks:{id} │
│ new_hashes = set of MD5s from step 3 │
│ │
│ stale = old_hashes − new_hashes │
│ └─► delete those chunk IDs from ChromaDB │
│ │
│ fresh = new_hashes − old_hashes │
│ └─► embed + add only these to ChromaDB │
│ │
│ unchanged = intersection → skip (no API call) │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ STEP 5 — Update Redis registry │
│ DEL doc_chunks:{doc_id} │
│ SADD doc_chunks:{doc_id} ← new_hashes │
│ SADD ingest:doc_ids ← doc_id │
│ HDEL ingest:content_hashes old_file_hash │
│ HSET ingest:content_hashes new_hash → doc_id │
│ HSET ingest_status:{doc_id} status/hash/counts │
└────────────────────┬────────────────────────────┘
│
▼
{ status: "done", added, removed, total }
Redis keys used by the ingest pipeline:
| Key | Type | Purpose |
|---|---|---|
ingest_status:{doc_id} |
Hash | Status, file hash, version, chunk counts per document |
doc_chunks:{doc_id} |
Set | MD5 hash of every chunk in the current version |
ingest:doc_ids |
Set | Global list of all ingested doc IDs |
ingest:content_hashes |
Hash | Maps file_hash → doc_id — catches same PDF under a different filename |
Retrieval runs in two steps, with behavior depending on CHAT_MODE:
User question
│
▼
Step 1 — Relevance gate
similarity_search k=1 → score < threshold?
│ │
│ ┌─── CHAT_MODE ──────┤
│ │ │
│ │ strict: │ open/learning:
│ │ docs = "" │ docs = best available
│ │ → refusal prompt │ → "use general knowledge"
│ │ │ prompt
│ │ │
│ score ≥ threshold (question is on-topic)
▼
Step 2 — MMR retrieval
max_marginal_relevance_search k=3, fetch_k=10
│
▼
3 diverse, relevant chunks → LLM → grounded answer
Step 1 — Score gate: fetches the single closest chunk and checks its cosine similarity score. Behavior depends on CHAT_MODE:
- Strict mode: If even the best match is below threshold, the question is off-topic and no context is sent to the LLM (refusal prompt).
- Open/learning mode: Below-threshold matches are still provided to the LLM as weak grounding signals. The prompt instructs the LLM to use general knowledge when context is weak, and to be honest about provenance.
Step 2 — MMR: only runs when step 1 passes the threshold. Fetches 10 candidates and picks the 3 that are both relevant AND diverse — avoiding 3 near-identical paragraphs being sent to the LLM.
Step 3 — Self-ingest (learning modes only): If the retrieval score was below threshold (knowledge gap) and the LLM's answer is ≥50 characters, the answer is captured with source_type=synthesized metadata. In learning mode it is embedded into the synthesized_answers collection immediately; in learning_review mode it is queued for human review instead, and only embedded once a moderator approves it (see Chat Modes and the /api/v1/review/* endpoints). This grows the knowledge base — optionally with a human in the loop.
ChromaDB is configured with cosine distance (hnsw:space: cosine) — the correct metric for text embeddings. Without this, scores are L2-based and can go negative, making the threshold meaningless.
Threshold is configurable via
RETRIEVAL_SCORE_THRESHOLDin.env(default0.3). Raise to0.7for stricter grounding, lower to0.2if too many valid questions are being rejected.
Configurable via environment variables and chat mode:
CHAT_MODE=strict— Knowledge-base-only. Refuses outside topics.CHAT_MODE=open— Free interaction. Uses general knowledge when no documents match.CHAT_MODE=learning— Free interaction + auto-ingests synthesized answers into ChromaDB.CHAT_MODE=learning_review— Like learning, but synthesized answers are queued for human approval (/api/v1/review/*) before being embedded.- Uses conversation summary (long-term memory)
- Uses recent messages (short-term memory)
- Uses retrieved context (RAG)
- Generates final response in the user's language (Arabic / English / European Portuguese)
Language detection (
lang: "auto"): Arabic is detected by script. A fast, dependency-free heuristic (distinctive Portuguese diacritics + a stopword-frequency ratio) handles the vast majority of inputs; only short, unaccented, genuinely ambiguous fragments fall through to the lingua EN/PT statistical model. Selectingpt(oren/ar) explicitly bypasses detection entirely. Portuguese output uses Portugal spelling/vocabulary (pt-PT).
API key authentication via FastAPI dependency injection (middlewares/auth.py):
DELETE /api/ingest/{doc_id}always requiresX-API-KeywhenAPI_KEYis set- Other ingest endpoints require it only when
REQUIRE_AUTH_FOR_INGEST=true - Empty
API_KEYskips auth (backward-compatible dev mode)
curl -X DELETE http://127.0.0.1:8000/api/ingest/doc_id \
-H "X-API-Key: your-secret-api-key"utils/security.py blocks downloads from:
- Private IP ranges (10/8, 172.16/12, 192.168/16, 127/8)
- Link-local addresses (169.254/16, fe80::/10)
- Cloud metadata endpoints (169.254.169.254, metadata.google.internal)
- Loopback (::1)
Controlled via ALLOWED_HOSTS env var. ["*"] allows all public hosts (still blocks private IPs).
60 requests/minute per IP, Redis-backed. Behind reverse proxies:
TRUSTED_PROXIES— CIDR ranges of trusted load balancers (e.g.,["10.0.0.0/8", "172.16.0.0/12"])X-Forwarded-ForandX-Real-IPheader handling for real client IP extraction- Fails open on Redis error (availability > strict enforcement)
Default [] — production must explicitly opt-in. Never use ["*"] in production.
CORS_ORIGINS=["https://your-domain.com"]Pydantic model_validator ensures required API keys are present for the chosen LLM_PROVIDER. Raises ValueError at startup instead of failing at runtime with opaque errors.
| Date | Score | Grade | Scope |
|---|---|---|---|
| 2026-05-28 (initial) | 72/100 | C+ | Identified 3 critical, 4 high, 5 medium, 5 low, 5 informational findings |
| 2026-05-28 (post-elevation) | 95/100 | A+ | All critical/high findings resolved; auth, SSRF, rate limiting, CORS, logging, CI hardened |
Critical findings resolved:
- Dependency hell (numpy/torch incompatibility) → pinned compatible versions
- Deserialization vulnerability (langchain-core CVE-2026-44843) → upgraded to ≥1.3.3
- Permissive CORS (
["*"]) → default[] - Missing auth on DELETE endpoint → API key dependency
- SSRF in download pipeline → private IP + metadata blocking
- Rate limiter proxy blindness → trusted proxy CIDR support
- Broad exception handling → specific exception classes
- Missing startup validation → Pydantic validators
Full audit reports: audit_artifacts/AUDIT_REPORT.md and audit_artifacts/FINAL_AUDIT.md
- ✅ Conversational memory (short + long-term via Redis)
- ✅ RAG retrieval with mode-aware score gate + MMR diversity ranking
- ✅ Context-aware query rewriting — condenses multi-turn follow-ups into a standalone search query before retrieval (
QUERY_REWRITE_ENABLED) - ✅ Groundedness / faithfulness verification —
meta.grounded(supported/partial/unsupported); strict mode refuses answers unsupported by the retrieved chunks (GROUNDEDNESS_ENABLED) - ✅ Persistent feedback loop —
POST /api/v1/feedback(👍/👎 + reason) → operator queue + export of downvotes to the RAGAS golden set - ✅ Provider resilience — retry transient 429/5xx/timeouts with backoff + in-process circuit breaker (
utils/resilience.py) - ✅ Durable, retryable ingestion —
INGEST_MODE=queuewith a Redis-backed worker that survives restarts (idempotent, retrying) - ✅ Configurable persona / domain / refusal copy — rebrand the assistant with
ASSISTANT_NAME/KNOWLEDGE_DOMAIN/ESCALATION_MESSAGE(defaults unchanged) - ✅ Optional hybrid retrieval — dense + BM25 fused via RRF (
RETRIEVAL_STRATEGY=hybrid) for acronym/SKU/exact-phrase recall - ✅ Four chat modes: strict (knowledge-base-only), open (general knowledge), learning (growing KB), learning_review (review-gated KB growth)
- ✅ Self-ingest in learning mode — synthesized answers captured with provenance metadata
- ✅
learning_reviewmode — two-phase ingest: synthesized answers queued for human approve/reject before embedding (/api/v1/review/*) - ✅ Guardrails — input prompt-injection/jailbreak blocking + output PII masking and length cap (config-toggleable)
- ✅ RAGAS evaluation harness — offline faithfulness / relevancy / context precision+recall (
eval/) - ✅ Versioned, typed API (
/api/v1) — Pydantic response envelopes with RFC 9457application/problem+jsonerrors; unversioned/api/*still works but is deprecated - ✅ SSE streaming —
POST /api/v1/chat/streamemitstoken→sources→done(anderror) Server-Sent Events - ✅ Per-request controls —
mode,lang,top_k,score_thresholdoverride server defaults per call - ✅ Reference web client (
web/) — Vite + React + TypeScript SPA: streaming chat, mode/language selectors, structured citations, RTL/Arabic rendering, health badge, and alearning_reviewreviewer panel - ✅ 14 LLM providers with universal OpenAI-compatible adapter + provider aliases
- ✅ 7+ embedding models via FastEmbed (ONNX, ~50MB, zero CVEs) + model registry
- ✅ API key authentication (FastAPI dependency injection) on destructive ingest endpoints
- ✅ SSRF protection — blocks private IPs and cloud metadata endpoints
- ✅ Proxy-aware rate limiting — 60 requests/minute per IP with CIDR trust configuration
- ✅ CORS hardened — default
[], production must opt-in - ✅ Observability — correlation ID tracing, request timing, structured JSON logging
- ✅ Kubernetes-ready — cached
/health+ live/readyprobes with dependency-specific error details - ✅ Structured citations — each source carries
label,doc_id,score,page, andsnippet(v1 API; legacy/api/chatkeeps bare label strings) - ✅ Conversational follow-ups — context-aware replies when no document match exists
- ✅ Multi-format ingestion — PDF, TXT, Markdown, DOCX, HTML via a pluggable loader registry (
ingest/loaders.py) - ✅ Two ingest paths — remote URL pull or local file upload (multipart), so documents can stay on-prem / private
- ✅ Incremental ingestion — only re-embeds changed chunks, not the whole document
- ✅ Ingestion safeguards — duplicate submission protection, file size limits, status polling endpoint
- ✅ Global duplicate detection — same PDF under different names caught via content hash
- ✅ Multilingual responses (Arabic / English / European Portuguese) — hybrid auto-detection: Arabic by script + a dependency-free Portuguese diacritic/stopword heuristic, with a lingua EN/PT statistical fallback for short ambiguous inputs
- ✅ LangGraph workflow orchestration (8-node pipeline)
- ✅ FastAPI production API layer
- ✅ Dockerized — cloud deployment (docker-compose.yml) + local deployment (docker-compose.local.yml with Ollama)
- ✅ Structured logging to console + rotating file (logs/app.log, 10 MB cap)
- ✅ 300+ tests covering adapters, graph nodes, builder, resilience, groundedness, feedback, ingest queue, rate limiter, security, API endpoints (97% coverage) + a hermetic retrieval-regression test
- Multi-provider LLM support (14 providers)
- FastEmbed local embeddings (7+ models, ONNX-based)
- Chat modes (strict, open, learning with self-ingest, learning_review with two-phase ingest)
- Local deployment (Ollama + FastEmbed, zero API keys)
- Provider comparison documentation
- CI/CD pipeline (ruff, bandit, pip-audit, coverage, Docker build)
- Security elevation (auth, SSRF, rate limiting, CORS, observability) — 72→95 audit score
- Guardrails (input prompt-injection blocking + output PII masking / length cap)
- Evaluation (RAGAS) — offline harness (
eval/) - Learning mode review workflow (two-phase ingest for synthesized entries)
- Context-aware query rewriting (condense follow-ups before retrieval)
- Groundedness / faithfulness verification (strict-mode anti-hallucination gate)
- Persistent feedback + closed quality loop (feedback → review queue / golden set)
- Provider retry/backoff + circuit breaker
- Durable, retryable ingestion (queue + worker)
- Configurable persona / domain / refusal copy
- Hermetic retrieval-regression test + opt-in scheduled eval CI
- Reviewer UI for the learning_review queue (web)
- Hybrid retrieval + reranking — scaffolding shipped (
RETRIEVAL_STRATEGY); promote to default once the eval proves a lift
- Backend: FastAPI
- LLM: 14 providers — OpenAI, Anthropic, Google, Groq, Ollama, DeepSeek, Together, Mistral, Fireworks, OpenRouter, vLLM, LM Studio, llama.cpp
- Embeddings: OpenAI / FastEmbed (7+ models) / HuggingFace
- Orchestration: LangGraph
- Framework: LangChain
- Vector DB: ChromaDB (dense) + BM25 (
rank-bm25, optional hybrid) - Cache / Memory / Queue: Redis (memory, rate limiting, ingest registry + durable ingest queue)
- Resilience: tenacity (retry/backoff) + in-process circuit breaker
- Runtime: Python 3.10+
- Container: Docker + Docker Compose (API + worker + Redis; local + cloud)
See CONTRIBUTING.md for the full guide — setup, code standards, testing, security, and PR workflow.
Quick start:
conda create -n chat-bot python=3.10 && conda activate chat-bot
pip install -r requirements.txt -r requirements-dev.txt
pytest # 300+ tests, 97% coverageGood first contributions:
- Add a new document loader (DOCX, TXT, HTML) in
ingest/ - Expand the guardrail patterns (
guardrails/) or the RAGAS golden set (eval/golden.jsonl) - Wire a concrete reranker into
ingest/retrieval.rerank(FastEmbed / cross-encoder) behindRETRIEVAL_STRATEGY=hybrid_rerank - Grow the eval golden set (
eval/golden.jsonl) or the retrieval-regression corpus (tests/test_retrieval_regression.py) - Add a new FastEmbed model to the registry in
utils/embedding_adapter.py
Please open an issue before starting large changes so we can discuss approach first.
This project demonstrates a real-world production architecture for AI chatbots combining:
RAG + Memory + LLM + Backend Engineering + Security Hardening
See CHANGELOG.md for version history, detailed changes, and the v1.0.0 → v2.0.0 comparison table.
This repository (pt-act/chat-bot) is a hardened fork of hasandeveloper/chat-bot with significant enhancements:
| Dimension | Upstream (v1.0.0) | This Fork (v2.4.0) |
|---|---|---|
| LLM providers | 3 (OpenAI, Anthropic, Groq) | 14 (universal adapter) |
| Chat modes | 1 (strict only) | 4 (strict, open, learning, learning_review) |
| Multi-turn retrieval | Raw last message | Context-aware query rewriting (condense follow-ups) |
| Answer faithfulness | Unverified | Groundedness verification + strict-mode refusal of unsupported answers |
| Feedback | None | Persistent 👍/👎 (/api/v1/feedback) → review queue + golden-set export |
| Provider failures | Fail the turn | Retry transient 429/5xx with backoff + circuit breaker |
| Ingestion durability | In-process BackgroundTasks |
Optional durable Redis queue + worker (survives restarts, retries) |
| Persona / branding | Hard-coded "our company" | Configurable name / domain / refusal copy (defaults unchanged) |
| Retrieval strategy | Dense MMR | Dense MMR + optional hybrid (BM25 + RRF) |
| Self-ingestion | None | Learning mode with quality gate + two-phase review (learning_review: queued for human approve/reject before embedding) |
| Languages | 2 (English / Arabic) | 3 (English / Arabic / European Portuguese) + hybrid auto-detection (script + diacritic/stopword heuristic + lingua fallback) |
| Document formats | PDF only | PDF / TXT / Markdown / DOCX / HTML (pluggable loader registry) |
| Ingest sources | URL / S3 pull | URL pull + local file upload (multipart, on-prem friendly) |
| HTTP API | Unversioned /api/* |
Versioned /api/v1 (typed envelopes) + RFC 9457 problem+json errors |
| Streaming | None | SSE token streaming (POST /api/v1/chat/stream) |
| Guardrails | None | Input prompt-injection blocking + output PII masking / length cap |
| Evaluation | None | RAGAS offline harness (faithfulness / relevancy / context precision+recall) |
| Web client | None | Vite + React + TypeScript reference SPA (web/) |
| Authentication | None | API key (FastAPI DI) |
| SSRF protection | None | Private IP + metadata blocking |
| Rate limiting | Direct IP only | Proxy-aware (CIDR, X-Forwarded-For) |
| Observability | None | Correlation ID + request timing |
| Logging | Text only | Text + JSON (structured) |
| Health probes | /health (static) |
/health (cached) + /ready (live) |
| CI/CD | Basic (ruff + pytest) | Full (ruff + bandit + pip-audit + coverage + Docker) |
| Test count | ~10 | 300+ (97% coverage) + hermetic retrieval-regression |
| Audit score | 72/100 (C+) | 95/100 (A+) |
| Local deployment | None | docker-compose.local.yml (Ollama + FastEmbed) |
Contributions to both repositories are welcome.
{ "q": "what is the return policy?", "mode": "strict", // strict | open | learning (overrides server default) "lang": "auto", // auto | en | ar | pt (auto-detects Arabic/Portuguese vs English; pt = European Portuguese) "top_k": 3, // 1..10 chunks to retrieve "score_threshold": 0.3 // 0..1 minimum relevance }