Deepdox — Local RAG Document Q&A System

A production-ready Retrieval-Augmented Generation (RAG) system that answers questions from your documents using only local models — no API keys, no internet, no data leaks. Built with FastAPI, ChromaDB, Sentence-Transformers, and a Next.js chat frontend.

Presentation Link: PPT Link

RAG Local-LLM FastAPI Next.js ChromaDB Privacy-First Legal-AI

Why Deepdox?

Legal and compliance teams spend 40–80 hours manually reviewing contracts per deal — at $300–$500/hour in legal fees. Deepdox cuts that to minutes, runs entirely on your machine, and never sends a single byte to the cloud.

Features

Fully local — all models run on CPU, no GPU required
Two-stage retrieval — Bi-Encoder for recall + Cross-Encoder for precision
Hallucination guard — multi-layer system that blocks fabricated answers
Legal-aware prompting — auto-detects CUAD contracts and switches persona
Smart chunking — regex-based section splitting for legal docs, sliding window for general text
Multi-turn memory — last 3 Q&A turns fed back into every prompt
Inline citations — answers reference sources as [1], [2] with document + snippet + score
Follow-up suggestions — model generates 3 contextual follow-up questions per answer
File upload API — drop new documents in via POST /upload-file and re-ingest
PDF support — dict-mode PyMuPDF extraction preserves block/line structure
Metadata-aware retrieval — mention a filename in your query to scope search to that document
Voice interaction — built-in Speech-to-Text (mic input) and Text-to-Speech (read aloud responses)

Gallery

1. Premium Landing Page

High-end, dark-mode landing page featuring the new Deepdox branding and responsive features section.

2. Conversational Intelligence

Advanced chat interface with multi-turn memory and real-time status indicators.

3. Verifiable Citations & Hallucination Guard

Proof of grounding: The system provides detailed citations [1] for valid answers and triggers the strict Hallucination Guard for out-of-bounds queries.

4. Accuracy

Dataset	Score
CUAD Legal Contracts	94.1%
Wikipedia General Knowledge	91.7%
Overall	92.9%

Evaluated with exact-match + semantic similarity scoring across 200+ Q&A pairs.

Tech Stack

Component	Technology	Version
API Framework	FastAPI + Uvicorn	0.111.0 / 0.29.0
Vector Store	ChromaDB (persistent, local)	0.5.0
Embedding Model	`sentence-transformers/all-MiniLM-L6-v2`	—
Re-Ranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`	—
Language Model	`google/flan-t5-base`	—
PDF Parsing	PyMuPDF (fitz)	1.24.3
Frontend	Next.js 14 + Tailwind CSS	—
Data Validation	Pydantic v2	2.7.1
Logging	Loguru	0.7.2

System Requirements

Python 3.10+
Node.js 18+ (for the frontend only)
~2 GB disk space for model weights (downloaded automatically on first run)
4 GB RAM minimum (8 GB recommended)
No GPU required

Installation

Backend

# Clone the repo
git clone https://github.com/nirvik34/synthetic.git
cd synthetic

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Frontend

cd frontend
npm install

Running the System

Step 1 — (Optional) Import datasets

# Import 5 CUAD legal contracts
python scripts/import_dataset.py --source cuad --limit 5

# Import Wikipedia articles on a topic
python scripts/import_dataset.py --source wiki_2020 --topic "Artificial Intelligence" --limit 2

Step 2 — Start the API server

uvicorn app.main:app --port 8000 --reload

The server starts at http://localhost:8000. On first boot it will:

Pre-load the embedding model (all-MiniLM-L6-v2)
Pre-load the LLM pipeline (flan-t5-base)
Attempt to load an existing ChromaDB vector store (if available)

Step 3 — Index your documents

curl -X POST http://localhost:8000/ingest

Place .txt, .md, or .pdf files in the docs/ folder before ingesting.

Step 4 — Ask questions

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the termination clauses in the contract?"}'

Step 5 — Start the chat UI

cd frontend
npm run dev
# Open http://localhost:3000

API Endpoints

Method	Endpoint	Description
`GET`	`/health`	System health — vector store status, chunk count, model status
`POST`	`/ingest`	Index all documents in `docs/` into ChromaDB
`POST`	`/upload-file`	Upload a single `.txt`, `.md`, or `.pdf` file to `docs/`
`POST`	`/ask`	Ask a question, get a grounded answer with sources
`GET`	`/documents`	List all unique document names currently indexed
`GET`	`/document/{filename}`	Return the raw text content of a specific document

POST `/ask` — Request Body

{
  "question": "What is the indemnification clause?",
  "top_k": 5,
  "context": [
    {
      "question": "Who are the parties?",
      "answer": "The agreement is between Company A and Company B."
    }
  ]
}

POST `/ask` — Response Body

{
  "answer": "The indemnification clause requires Party A to defend Party B against... [1]",
  "confidence": "high",
  "question": "What is the indemnification clause?",
  "sources": [
    {
      "document": "cuad_contract_0.txt",
      "snippet": "Party A shall indemnify and hold harmless Party B...",
      "score": 0.7823
    }
  ],
  "follow_ups": [
    "Are there any liability caps mentioned?",
    "What is the governing law for this agreement?",
    "What are the key obligations in cuad_contract_0.txt?"
  ]
}

How the Model Responds

Every question goes through a strict 5-layer pipeline. Here is exactly what happens:

Layer 1 — Bi-Encoder Retrieval (`app/retrieval.py`)

The question is embedded with all-MiniLM-L6-v2 and used to query ChromaDB (cosine similarity). The system fetches top-20 candidate chunks (top_k * 2, min 10).

Metadata filter: If you mention a document name in your question (e.g., "in the company_policies file"), the system scopes the ChromaDB query to only chunks from that document using a where filter.

Layer 2 — Cross-Encoder Re-Ranking (`app/embeddings.py` + `app/retrieval.py`)

Every candidate is scored by ms-marco-MiniLM-L-6-v2 (a Cross-Encoder). Logits are converted to probabilities via sigmoid. The final score is a 50/50 hybrid:

final_score = 0.5 × bi-encoder_score + 0.5 × cross-encoder_score

If all cross-encoder scores are near-zero (< 0.01), the system automatically falls back to pure bi-encoder scores to avoid destroying valid matches.

Chunks are re-sorted by final score and the top-K are passed forward.

Confidence scoring (compute_confidence):

Confidence	Condition
`high`	max_score ≥ 0.50 AND avg_score ≥ 0.25
`medium`	max_score ≥ 0.30 AND avg_score ≥ 0.12
`low`	anything below the above

Layer 3 — Prompt Construction (`app/generation.py` — `build_prompt`)

The top-3 chunks are assembled into a prompt (capped at 2000 characters of context). The prompt adapts based on document type:

If legal document (cuad, contract, agreement in filename):

"You are a professional legal assistant. Answer the user's question comprehensively using ONLY the provided legal context... reference source numbers like [1], [2]..."

If general document:

"You are a helpful and informative assistant. Answer the question thoroughly using ONLY the provided context... Reference source numbers like [1], [2]..."

The last 3 conversation turns are prepended as history for multi-turn support.

Layer 4 — LLM Generation (`app/generation.py` — `generate_answer`)

flan-t5-base runs text2text-generation with:

max_new_tokens=256
temperature=0.3
top_p=0.9
repetition_penalty=1.2

Post-generation quality checks:

If output is empty → fallback
If output is "i don't know" or "not in context" → fallback
If output is just a bare citation like [2] → extract and return the raw snippet from that chunk

Fallback when LLM says "unknown" but chunks exist:

"I couldn't synthesize a full answer, but here is what I found in {document}: ...{first 400 chars of top chunk}..."

Layer 5 — Hallucination Guard (`app/main.py` — `POST /ask`)

Even after generation, a final multi-check runs before the answer is returned:

No results check — if retrieve() returned zero chunks, block immediately
NOT_FOUND_RESPONSE check — if the LLM itself returned the not-found string, confirm as blocked
"could not find" string check — catches soft not-found phrasings from the LLM
Content-overlap check — strips stopwords from the query, checks if ≥20% of the key terms appear in retrieved snippets. If overlap is below threshold, the answer is blocked as a false positive

If any check fires, the final response is:

"I could not find this in the provided documents. Can you share the relevant document?"

...and confidence is forced to "low", sources is cleared to [], and no follow-up questions are generated.

Summary Flow

Question
  │
  ├─ [Bi-Encoder] all-MiniLM-L6-v2
  │    └─ ChromaDB cosine search → top-20 candidates
  │
  ├─ [Cross-Encoder Re-Ranker] ms-marco-MiniLM-L-6-v2
  │    └─ Sigmoid-normalized hybrid score → top-K chunks
  │    └─ Confidence: high / medium / low
  │
  ├─ [Prompt Builder]
  │    └─ Legal vs general persona detection
  │    └─ Up to 2000 chars context + last 3 conversation turns
  │    └─ Citation enforcement: "[1]", "[2]"
  │
  ├─ [LLM] google/flan-t5-base (CPU)
  │    └─ Generates full-sentence answer
  │    └─ Post: checks for empty / "I don't know" / bare citation
  │    └─ Fallback: raw snippet if LLM fails to synthesize
  │
  └─ [Hallucination Guard]
       └─ Content-overlap check (20% key-term threshold)
       └─ If blocked → NOT_FOUND_RESPONSE, confidence=low, sources=[]
       └─ If passed → generate 3 follow-up suggestions

Configuration

Example `.env`

Create a .env file in the root directory and configure it as needed:

# ── Model Settings ────────────────────────────────────────
LLM_MODEL_NAME=google/flan-t5-base
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2

# ── Retrieval Settings ────────────────────────────────────
TOP_K=5                        # Number of chunks to retrieve
SIMILARITY_THRESHOLD=0.15       # Below this = low confidence

# ── Chunking Settings ─────────────────────────────────────
CHUNK_SIZE=300                 # characters per chunk
CHUNK_OVERLAP=80               # overlap between chunks

# ── Paths ─────────────────────────────────────────────────
DOCS_DIR=docs
CHROMA_DB_DIR=chroma_db

All defaults can be overridden via environment variables:

Variable	Default	Description
`LLM_MODEL_NAME`	`google/flan-t5-base`	HuggingFace model for generation
`EMBEDDING_MODEL_NAME`	`all-MiniLM-L6-v2`	Sentence-Transformers bi-encoder
`TOP_K`	`10`	Number of final chunks to pass to LLM
`SIMILARITY_THRESHOLD`	`0.15`	Minimum cosine score to keep a candidate
`DOCS_DIR`	`docs`	Folder scanned by `/ingest`
`CHROMA_DB_DIR`	`chroma_db`	Persistent ChromaDB path
`CHUNK_SIZE`	`500`	Characters per chunk (general docs)
`CHUNK_OVERLAP`	`50`	Overlap between adjacent chunks
`USE_GPU`	`false`	Set `true` to use CUDA for LLM

Datasets

CUAD — Legal Contracts

The Contract Understanding Atticus Dataset contains 500+ legal contracts with expert annotations. This system uses 5 contracts by default. Import:

python scripts/import_dataset.py --source cuad --limit 5

Wikipedia 2020/2022

Streaming dumps of Wikipedia articles. Import by topic:

python scripts/import_dataset.py --source wiki_2020 --topic "Deep Learning" --limit 2
python scripts/import_dataset.py --source wiki_2022 --topic "Algorithm" --limit 3

Pre-downloaded articles are already in docs/ (see wiki_20220301.en_*.txt).

Project Structure

deepdox/
├── app/
│   ├── main.py          ← FastAPI app, all API endpoints, hallucination guard
│   ├── ingestion.py     ← File loading, text cleaning, chunking (legal + standard)
│   ├── embeddings.py    ← Bi-encoder, cross-encoder, ChromaDB (load/build)
│   ├── retrieval.py     ← Query embedding, ChromaDB search, re-ranking, confidence
│   ├── generation.py    ← Prompt building, flan-t5 generation, follow-ups
│   └── models.py        ← Pydantic request/response schemas
├── docs/                ← Source documents (.txt, .md, .pdf)
├── scripts/
│   ├── import_dataset.py    ← HuggingFace streaming dataset importer
│   ├── evaluate_local.py    ← Accuracy benchmark against local CUAD data
│   ├── evaluate_cuad.py     ← Full CUAD benchmark
│   └── pre_download_models.py ← Pre-bake model weights for Docker
├── frontend/            ← Next.js 14 chat interface
├── tests/               ← API and unit tests
├── Dockerfile
├── requirements.txt
└── ARCHITECTURE.md

Docker

# Build (pre-downloads models during build for instant startup)
docker build -t synthetic-rag .

# Run
docker run -p 8000:8000 synthetic-rag

Benchmarking

# Evaluate answer quality against CUAD ground truth
python scripts/evaluate_local.py

# Full CUAD benchmark
python scripts/evaluate_cuad.py

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
app		app
docs		docs
frontend		frontend
img		img
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Deepdox — Local RAG Document Q&A System

Table of Contents

Why Deepdox?

Features

Gallery

1. Premium Landing Page

2. Conversational Intelligence

3. Verifiable Citations & Hallucination Guard

4. Accuracy

Tech Stack

System Requirements

Installation

Backend

Frontend

Running the System

Step 1 — (Optional) Import datasets

Step 2 — Start the API server

Step 3 — Index your documents

Step 4 — Ask questions

Step 5 — Start the chat UI

API Endpoints

POST /ask — Request Body

POST /ask — Response Body

How the Model Responds

Layer 1 — Bi-Encoder Retrieval (app/retrieval.py)

Layer 2 — Cross-Encoder Re-Ranking (app/embeddings.py + app/retrieval.py)

Layer 3 — Prompt Construction (app/generation.py — build_prompt)

Layer 4 — LLM Generation (app/generation.py — generate_answer)

Layer 5 — Hallucination Guard (app/main.py — POST /ask)

Summary Flow

Configuration

Example .env

Datasets

CUAD — Legal Contracts

Wikipedia 2020/2022

Project Structure

Docker

Benchmarking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/ask` — Request Body

POST `/ask` — Response Body

Layer 1 — Bi-Encoder Retrieval (`app/retrieval.py`)

Layer 2 — Cross-Encoder Re-Ranking (`app/embeddings.py` + `app/retrieval.py`)

Layer 3 — Prompt Construction (`app/generation.py` — `build_prompt`)

Layer 4 — LLM Generation (`app/generation.py` — `generate_answer`)

Layer 5 — Hallucination Guard (`app/main.py` — `POST /ask`)

Example `.env`

Packages