Try it: npx pdf-to-rag ingest ./path/to/pdfs then npx pdf-to-rag query "Your natural-language question" (requires Node.js 18+; first run may download the default embedding model).
What it is: pdf-to-rag builds a local-first index over folders of PDFs—default embeddings use Transformers.js on your machine (no paid API keys)—and query returns ranked verbatim passages with file name and page so you can cite sources instead of trusting a paraphrase.
Who it’s for: developers and researchers who want a small, auditable pipeline plus CLI, library, and MCP (stdio or HTTP) surfaces—without committing to a large agent framework.
- Researchers and builders who want searchable PDF corpora without sending document text to a cloud embedding API.
- Citation-aware retrieval: each hit is a stored chunk string plus metadata (
fileName,page,score,chunkId), suitable for footnotes or UI snippets. - Not in scope: a single LLM “answer” that merges hits; hosts or models compose answers from the returned excerpts if needed.
Requirements: Node.js 18+. First run may download a default embedding model (Transformers.js) unless you use Ollama (see expandable section below).
From npm (when published):
npm install -g pdf-to-rag
pdf-to-rag ingest ./path/to/pdfs
pdf-to-rag query your natural language question
pdf-to-rag inspectFrom this repository:
git clone <repo-url> && cd pdf-to-rag
npm install
npm run build
node dist/cli.js ingest ./path/to/pdfs
node dist/cli.js query your natural language question
node dist/cli.js inspectCommon flags: --store-dir (index directory, default .pdf-to-rag), ingest --chunk-size / --overlap / --no-recursive / --no-strip-margins, query --top-k. Index file: <store-dir>/index.json.
Verify the toolchain
- Smoke (smallest PDF in
examples/):npm run examples:smoke - JSON NL + quotation checks:
npm run examples:fixtures(see examples/README.md) - MCP install:
npm run mcp:smokeafternpm run build
Optional git hooks: npm run hooks:install — pre-commit runs npm run build and npm test (see .hooks/README.md).
Architecture (diagrams)
Entry points are thin; CLI, library, and MCP all call the same application layer (runIngest, runQuery, runInspect). The pipeline stays in src/ modules (PDF → text → chunks → embeddings → JSON vector store).
flowchart TB
subgraph Surfaces
CLI[CLI cli.ts]
MCP[MCP server.ts]
LIB[Library index.ts]
end
subgraph Application
F[createAppDeps]
RI[runIngest]
RQ[runQuery]
RS[runInspect]
end
subgraph Pipeline
PDF[pdf]
NORM[normalization]
CHUNK[chunking metadata]
INGP[ingestion pipeline]
EMB[embedding]
STO[storage]
QRY[query]
end
CLI --> RI
CLI --> RQ
CLI --> RS
MCP --> RI
MCP --> RQ
MCP --> RS
LIB --> F
F --> RI
F --> RQ
F --> RS
F --> EMB
F --> STO
RI --> INGP
RQ --> QRY
RS --> STO
INGP --> PDF
INGP --> NORM
INGP --> CHUNK
INGP --> EMB
INGP --> STO
QRY --> EMB
QRY --> STO
Ingest and query data flow:
sequenceDiagram
participant User
participant App
participant Pipeline
participant Embedder
participant Store
participant Retriever
User->>App: ingest or query
App->>Pipeline: PDFs to chunks
Pipeline->>Embedder: embed batch
Pipeline->>Store: write chunks
App->>Retriever: query plus deps
Retriever->>Embedder: embed query
Retriever->>Store: topK search
Retriever-->>User: ranked hits
Layer rules (where code belongs)
src/commands/— Commander only; prints results; callsrun*.src/mcp/— stdio MCP, Zod tools, path allowlist; calls the samerun*.src/application/— orchestration,AppDeps, hooks.src/domain/— types only; no I/O.- Pipeline —
ingestion/,pdf/,normalization/,chunking/,metadata/,embeddings.ts,embedding/,storage/,query/.
Full file-level map: docs/architecture/overview.md.
| Resource | Description |
|---|---|
| docs/README.md | Doc index |
| docs/use/cli-library.md | CLI and library ↔ src/ |
| docs/use/comparison.md | vs LangChain, LlamaIndex, Unstructured (positioning) |
| docs/use/mcp.md | MCP tools, env, security |
| docs/onboarding/mcp.md | First-time MCP setup |
| docs/architecture/overview.md | Deeper diagrams and tables |
| docs/management/roadmap.md | Roadmap |
| docs/management/requirements.md | Requirements (F/N/D) |
| docs/contributing/agents.md | Cursor agents, /pdf-* commands |
MCP server (AI tools)
After npm run build, run pdf-to-rag-mcp (stdio) or pdf-to-rag-mcp-http (HTTP/SSE on port 3000) for hosts that cannot use stdio. Configure corpus access with PDF_TO_RAG_CWD, PDF_TO_RAG_ALLOWED_DIRS, and optionally PDF_TO_RAG_SOURCE_DIR. The query tool accepts an optional hypotheticalAnswer (HyDE) — pass a caller-generated hypothetical answer to improve retrieval for short or abstract questions. Enable cross-encoder reranking by setting PDF_TO_RAG_RERANK_MODEL to a Hugging Face cross-encoder id. Start here: docs/onboarding/mcp.md; full reference: docs/use/mcp.md.
Browser UI (HTTP server): npm run mcp:http serves multi-page static content from public/—home (/), setup (/setup.html), about (/about.html), and an interactive demo (/demo.html) that talks to /mcp on the same origin. See docs/use/mcp.md § Web demo UI.
Cursor: rules in .cursor/rules/, skills in .cursor/skills/ (pdf-rag-*), commands in .cursor/commands/.
CLI commands (reference)
pdf-to-rag ingest ./docs # index corpus (recursive by default; unchanged PDFs skipped)
pdf-to-rag query your question # semantic search; summary line + passages + citations
pdf-to-rag inspect # chunk count / files (no embedder load)Options: --store-dir, ingest --chunk-size, --overlap, --no-recursive, --no-strip-margins, query --top-k.
Library (programmatic)
import {
defaultConfig,
createAppDeps,
runIngest,
runQuery,
runInspect,
createNoOpHooks,
} from "pdf-to-rag";
const cwd = process.cwd();
const config = defaultConfig({});
const hooks = createNoOpHooks();
const deps = await createAppDeps(cwd, config);
await runIngest("./my-pdfs", cwd, deps, hooks);
const hits = await runQuery("your natural-language question", deps, hooks);
for (const h of hits) {
console.log(h.fileName, h.page, h.text);
}
const stats = await runInspect(cwd, config);
console.log(stats.chunkCount, stats.files);Use a custom Hooks object for beforeIngest, afterChunking, afterIndexing, beforeQuery. runInspect does not load the embedding model.
Embeddings: Transformers.js (default) and Ollama (optional)
Default: Transformers.js / ONNX in Node — no paid API; suitable for small corpora; large multi-PDF folders can be slow on CPU. Model download on first use (~tens of MB); set TRANSFORMERS_CACHE to pin cache location.
Optional fast path: PDF_TO_RAG_EMBED_BACKEND=ollama, OLLAMA_EMBED_MODEL (e.g. nomic-embed-text after ollama pull), optional OLLAMA_HOST (default http://127.0.0.1:11434). Batching: OLLAMA_EMBED_BATCH_SIZE, OLLAMA_EMBED_CONCURRENCY. With GPU or Apple Metal, full ingest of an examples/-scale tree can be on the order of minutes; CPU-only may be much slower.
Important: re-ingest after switching backend or model; the index records embeddingModel (e.g. ollama:nomic-embed-text). Query embedding dimension must match the index.
See docs/management/requirements.md (F7) and .cursor/commands/pdf-embeddings.md.
Examples and tests
- PDFs (or your own) under examples/ — examples/README.md
npm run examples:smoke— minimal ingest + querynpm run examples:fixtures—examples/query-fixtures.jsonNL + substring suitenpm run demo:papers— longer scripted demo (see examples/README.md)npm run eval:generate/eval:run/eval:compare— synthetic gold-set generation, retrieval metrics, and A/B diff (see docs/management/project.md § Testing and evaluation methodology)
Limitations (current MVP)
- Ingest skips PDFs whose mtime+size match the index (F13); new or changed files are re-chunked and re-embedded. Switching embedding backend or model still requires a full re-ingest so vectors stay in one space (F7).
- Embeddings are stored as raw
Float32Arrayin a binaryindex.binsidecar (schema v3). Search is linear cosine belowPDF_TO_RAG_HNSW_THRESHOLD(default 2000 chunks) and switches to HNSW approximate nearest-neighbor above it (hnswlib-node). - Scanned PDFs without extractable text are not a primary target.