Skip to content

mayad123/pdfToRagAI

Repository files navigation

pdf-to-rag

CI · Ecosystem comparison

Try it: npx pdf-to-rag ingest ./path/to/pdfs then npx pdf-to-rag query "Your natural-language question" (requires Node.js 18+; first run may download the default embedding model).

What it is: pdf-to-rag builds a local-first index over folders of PDFs—default embeddings use Transformers.js on your machine (no paid API keys)—and query returns ranked verbatim passages with file name and page so you can cite sources instead of trusting a paraphrase.

Who it’s for: developers and researchers who want a small, auditable pipeline plus CLI, library, and MCP (stdio or HTTP) surfaces—without committing to a large agent framework.

Terminal-style preview: ingest then query with citations


What it’s for

  • Researchers and builders who want searchable PDF corpora without sending document text to a cloud embedding API.
  • Citation-aware retrieval: each hit is a stored chunk string plus metadata (fileName, page, score, chunkId), suitable for footnotes or UI snippets.
  • Not in scope: a single LLM “answer” that merges hits; hosts or models compose answers from the returned excerpts if needed.

Install and use

Requirements: Node.js 18+. First run may download a default embedding model (Transformers.js) unless you use Ollama (see expandable section below).

From npm (when published):

npm install -g pdf-to-rag
pdf-to-rag ingest ./path/to/pdfs
pdf-to-rag query your natural language question
pdf-to-rag inspect

From this repository:

git clone <repo-url> && cd pdf-to-rag
npm install
npm run build
node dist/cli.js ingest ./path/to/pdfs
node dist/cli.js query your natural language question
node dist/cli.js inspect

Common flags: --store-dir (index directory, default .pdf-to-rag), ingest --chunk-size / --overlap / --no-recursive / --no-strip-margins, query --top-k. Index file: <store-dir>/index.json.

Verify the toolchain
  • Smoke (smallest PDF in examples/): npm run examples:smoke
  • JSON NL + quotation checks: npm run examples:fixtures (see examples/README.md)
  • MCP install: npm run mcp:smoke after npm run build

Optional git hooks: npm run hooks:install — pre-commit runs npm run build and npm test (see .hooks/README.md).


Architecture (diagrams)

Entry points are thin; CLI, library, and MCP all call the same application layer (runIngest, runQuery, runInspect). The pipeline stays in src/ modules (PDF → text → chunks → embeddings → JSON vector store).

flowchart TB
  subgraph Surfaces
    CLI[CLI cli.ts]
    MCP[MCP server.ts]
    LIB[Library index.ts]
  end
  subgraph Application
    F[createAppDeps]
    RI[runIngest]
    RQ[runQuery]
    RS[runInspect]
  end
  subgraph Pipeline
    PDF[pdf]
    NORM[normalization]
    CHUNK[chunking metadata]
    INGP[ingestion pipeline]
    EMB[embedding]
    STO[storage]
    QRY[query]
  end
  CLI --> RI
  CLI --> RQ
  CLI --> RS
  MCP --> RI
  MCP --> RQ
  MCP --> RS
  LIB --> F
  F --> RI
  F --> RQ
  F --> RS
  F --> EMB
  F --> STO
  RI --> INGP
  RQ --> QRY
  RS --> STO
  INGP --> PDF
  INGP --> NORM
  INGP --> CHUNK
  INGP --> EMB
  INGP --> STO
  QRY --> EMB
  QRY --> STO
Loading

Ingest and query data flow:

sequenceDiagram
  participant User
  participant App
  participant Pipeline
  participant Embedder
  participant Store
  participant Retriever
  User->>App: ingest or query
  App->>Pipeline: PDFs to chunks
  Pipeline->>Embedder: embed batch
  Pipeline->>Store: write chunks
  App->>Retriever: query plus deps
  Retriever->>Embedder: embed query
  Retriever->>Store: topK search
  Retriever-->>User: ranked hits
Loading
Layer rules (where code belongs)
  • src/commands/ — Commander only; prints results; calls run*.
  • src/mcp/ — stdio MCP, Zod tools, path allowlist; calls the same run*.
  • src/application/ — orchestration, AppDeps, hooks.
  • src/domain/ — types only; no I/O.
  • Pipelineingestion/, pdf/, normalization/, chunking/, metadata/, embeddings.ts, embedding/, storage/, query/.

Full file-level map: docs/architecture/overview.md.


Documentation

Resource Description
docs/README.md Doc index
docs/use/cli-library.md CLI and library ↔ src/
docs/use/comparison.md vs LangChain, LlamaIndex, Unstructured (positioning)
docs/use/mcp.md MCP tools, env, security
docs/onboarding/mcp.md First-time MCP setup
docs/architecture/overview.md Deeper diagrams and tables
docs/management/roadmap.md Roadmap
docs/management/requirements.md Requirements (F/N/D)
docs/contributing/agents.md Cursor agents, /pdf-* commands
MCP server (AI tools)

After npm run build, run pdf-to-rag-mcp (stdio) or pdf-to-rag-mcp-http (HTTP/SSE on port 3000) for hosts that cannot use stdio. Configure corpus access with PDF_TO_RAG_CWD, PDF_TO_RAG_ALLOWED_DIRS, and optionally PDF_TO_RAG_SOURCE_DIR. The query tool accepts an optional hypotheticalAnswer (HyDE) — pass a caller-generated hypothetical answer to improve retrieval for short or abstract questions. Enable cross-encoder reranking by setting PDF_TO_RAG_RERANK_MODEL to a Hugging Face cross-encoder id. Start here: docs/onboarding/mcp.md; full reference: docs/use/mcp.md.

Browser UI (HTTP server): npm run mcp:http serves multi-page static content from public/—home (/), setup (/setup.html), about (/about.html), and an interactive demo (/demo.html) that talks to /mcp on the same origin. See docs/use/mcp.md § Web demo UI.

Cursor: rules in .cursor/rules/, skills in .cursor/skills/ (pdf-rag-*), commands in .cursor/commands/.

CLI commands (reference)
pdf-to-rag ingest ./docs          # index corpus (recursive by default; unchanged PDFs skipped)
pdf-to-rag query your question    # semantic search; summary line + passages + citations
pdf-to-rag inspect               # chunk count / files (no embedder load)

Options: --store-dir, ingest --chunk-size, --overlap, --no-recursive, --no-strip-margins, query --top-k.

Library (programmatic)
import {
  defaultConfig,
  createAppDeps,
  runIngest,
  runQuery,
  runInspect,
  createNoOpHooks,
} from "pdf-to-rag";

const cwd = process.cwd();
const config = defaultConfig({});
const hooks = createNoOpHooks();
const deps = await createAppDeps(cwd, config);

await runIngest("./my-pdfs", cwd, deps, hooks);
const hits = await runQuery("your natural-language question", deps, hooks);
for (const h of hits) {
  console.log(h.fileName, h.page, h.text);
}

const stats = await runInspect(cwd, config);
console.log(stats.chunkCount, stats.files);

Use a custom Hooks object for beforeIngest, afterChunking, afterIndexing, beforeQuery. runInspect does not load the embedding model.

Embeddings: Transformers.js (default) and Ollama (optional)

Default: Transformers.js / ONNX in Node — no paid API; suitable for small corpora; large multi-PDF folders can be slow on CPU. Model download on first use (~tens of MB); set TRANSFORMERS_CACHE to pin cache location.

Optional fast path: PDF_TO_RAG_EMBED_BACKEND=ollama, OLLAMA_EMBED_MODEL (e.g. nomic-embed-text after ollama pull), optional OLLAMA_HOST (default http://127.0.0.1:11434). Batching: OLLAMA_EMBED_BATCH_SIZE, OLLAMA_EMBED_CONCURRENCY. With GPU or Apple Metal, full ingest of an examples/-scale tree can be on the order of minutes; CPU-only may be much slower.

Important: re-ingest after switching backend or model; the index records embeddingModel (e.g. ollama:nomic-embed-text). Query embedding dimension must match the index.

See docs/management/requirements.md (F7) and .cursor/commands/pdf-embeddings.md.

Examples and tests
  • PDFs (or your own) under examples/examples/README.md
  • npm run examples:smoke — minimal ingest + query
  • npm run examples:fixturesexamples/query-fixtures.json NL + substring suite
  • npm run demo:papers — longer scripted demo (see examples/README.md)
  • npm run eval:generate / eval:run / eval:compare — synthetic gold-set generation, retrieval metrics, and A/B diff (see docs/management/project.md § Testing and evaluation methodology)
Limitations (current MVP)
  • Ingest skips PDFs whose mtime+size match the index (F13); new or changed files are re-chunked and re-embedded. Switching embedding backend or model still requires a full re-ingest so vectors stay in one space (F7).
  • Embeddings are stored as raw Float32Array in a binary index.bin sidecar (schema v3). Search is linear cosine below PDF_TO_RAG_HNSW_THRESHOLD (default 2000 chunks) and switches to HNSW approximate nearest-neighbor above it (hnswlib-node).
  • Scanned PDFs without extractable text are not a primary target.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages