doc2graph

Turn documents into queryable knowledge graphs — no LLM required.

When you need to answer questions over long files, reports, or multi-document corpora, naive chunking loses structure. doc2graph extracts entities, relationships, and context as a graph, ranks relevant nodes with Personalized PageRank, and hands you exactly what your LLM needs.

Pure Python. No LLM dependency. Bring your own model.

Why graph-based context?

Approach	What you lose
Fixed-size chunking	Sentence boundaries, section context, cross-references
Embedding search	Exact match, structural relationships, citation graphs
doc2graph	Nothing — relationships are explicit edges

The graph knows that a claim is supported by evidence in a specific section, which cites a reference, which is authored by a specific person. Flat chunks don't.

Quick start

pip install docs2graph

# Extract a knowledge graph from a paper
docs2graph paper.md --graph knowledge --output paper.graph.json

# Extract ADR-style decisions from architecture docs
docs2graph architecture.md --graph decision --output decisions.graph.json

# Process an entire documentation corpus
docs2graph ./docs --graph all --output corpus.graph.json

from doc2graph import DocumentGraph

# Single document
g = DocumentGraph.from_document("report.pdf", graph_type="knowledge")

# Rank nodes relevant to a query
context = g.rank("what are the key risks?", k=10)
# Pass context["nodes"] + context["edges"] to your LLM

Installation

Core (Markdown, plain text, HTML, JSON, CSV, code files):

pip install docs2graph

With PDF support:

pip install "docs2graph[pdf]"

With Word / PowerPoint support:

pip install "docs2graph[docx,pptx]"

With OCR (images, scanned PDFs):

pip install "docs2graph[ocr]"
# Also requires: apt install tesseract-ocr  (Ubuntu/Debian)
#                brew install tesseract     (macOS)

Everything:

pip install "docs2graph[all]"

How it works

Document / File / Corpus
         │
         ▼
   Format loader ──► Text + structure
         │
         ▼
   Extractor ──► Nodes (entities, sections, claims, ...)
         │         └─► Edges (contains, references, defines, ...)
         ▼
   Knowledge graph (plain JSON)
         │
         ▼
   query → Personalized PageRank → Ranked subgraph
         │
         ▼
   Your LLM prompt

Load — auto-detects format, handles encoding, extracts clean text and structure
Extract — turns structure into typed graph nodes and labeled edges
Rank — Personalized PageRank starting from query-matched nodes surfaces the most relevant subgraph
Use — pass context["nodes"] + context["edges"] to any LLM

Graph types

`knowledge` — for research papers, reports, documentation

Extracts: documents, sections, concepts, definitions, claims, evidence, tables, citations, references, URLs

docs2graph paper.md --graph knowledge --output paper.graph.json

g = DocumentGraph.from_document("paper.md", graph_type="knowledge")
context = g.rank("graph-based context ranking", k=15)

Relationships: contains, references, defines, defined_by, supports, cites, resolves_to, links_to

Inline citations ([1], (Smith, 2024)) are resolved to matching # References entries. Claim-to-evidence support links are deterministic and conservative — only same-section evidence or evidence sharing meaningful terms with the claim.

`decision` — for ADRs and architecture documents

Extracts: problems, context/drivers, options, pros, cons, tradeoffs, decisions, consequences, confidence

docs2graph architecture.md --graph decision --output decisions.graph.json

decisions = DocumentGraph.from_document("adr.md", graph_type="decision")

Recognizes ADR-style headings (## Decision, ## Options, ## Consequences), standalone prefixed lines (Constraint:, Assumption:, Rationale:), and Markdown option tables. Context bullets link to decisions with informed_by edges so the reasoning trail is traversable.

`schema` — for data dictionaries and schema docs

Extracts table and entity graphs from schema documentation for text-to-SQL context.

`media` — for images and charts

Extracts image metadata, OCR text, and chart signal nodes.

`all` — merged graph from all extractors

docs2graph ./docs --graph all --output corpus.graph.json

Multi-document corpora

Directory input is first-class. doc2graph walks supported formats, emits a corpus root with folder/file provenance nodes, resolves explicit relative links ([ADR](adr/cache.md)) into links_to edges, and adds deterministic cross-document mentions edges when one file explicitly names another's title, section, decision, or path-derived stem.

# Process entire knowledge base
docs2graph ./knowledge-base --graph all --output corpus.graph.json

# Filter to ADRs only
docs2graph ./knowledge-base --graph decision --include "adr/**" --output adr.graph.json

# Limit corpus size
docs2graph ./exports --graph all --max-files 500 --max-file-bytes 10485760

# Bounded traversal for huge trees
docs2graph ./exports --graph all --max-depth 2 --max-total-bytes 1073741824

# Audit corpus before extraction (no files loaded)
docs2graph ./exports --graph all --scan-only --output corpus.scan.graph.json

# Cache extraction results across runs
docs2graph ./docs --graph all --cache .doc2graph-cache.json --output corpus.graph.json

Corpus limits reference

Flag	Default	Description
`--max-files N`	unlimited	Select at most N files; continues scanning for skip counts
`--stop-after-max-files`	off	Stop scanning at first file beyond `--max-files`
`--max-file-bytes N`	5 MB	Skip files larger than N bytes
`--max-total-bytes N`	unlimited	Stop extracting after N cumulative bytes
`--max-depth N`	unlimited	Bound recursive descent by subdirectory depth
`--max-scan-entries N`	unlimited	Stop directory walk after N filesystem entries
`--include PATTERN`	all	Repeatable glob filter (e.g. `--include "adr/**"`)
`--exclude PATTERN`	none	Repeatable glob exclusion
`--extension EXT`	all	Repeatable suffix allowlist (e.g. `--extension md`)
`--scan-only`	off	Build scan graph without loading any files
`--follow-symlinks`	off	Extract symlinked files (symlinked dirs always skipped)
`--cache PATH`	none	Reuse unchanged per-file extractions across runs
`--refresh-cache`	off	Rebuild all cache entries

Supported formats

Format	Extensions	Extra install
Markdown	`.md`, `.mdx`	—
Plain text	`.txt`	—
HTML	`.html`	—
JSON / JSONL	`.json`, `.jsonl`	—
CSV / TSV	`.csv`, `.tsv`	—
Source code	`.py`, `.js`, `.ts`, `.sql`, `.yaml`, `.toml`, `.sh`, ...	—
PDF	`.pdf`	`pip install "docs2graph[pdf]"`
Word	`.docx`	`pip install "docs2graph[docx]"`
PowerPoint	`.pptx`	`pip install "docs2graph[pptx]"`
Images / OCR	`.png`, `.jpg`, `.gif`, `.tif`, `.bmp`, `.webp`	`pip install "docs2graph[ocr]"` + tesseract
URLs	`https://...`	—
Google Docs/Sheets/Slides	public export URLs	`GOOGLE_DOCS_BEARER_TOKEN` for private

Python API

Single document

from doc2graph import DocumentGraph

# Auto-detect format
g = DocumentGraph.from_document("paper.pdf", graph_type="knowledge")

# Markdown
g = DocumentGraph.from_markdown("notes.md", graph_type="all")

# Plain text
g = DocumentGraph.from_text("My text content...", graph_type="knowledge")

Directory corpus

g = DocumentGraph.from_directory(
    "./docs",
    graph_type="all",
    max_depth=3,
    max_files=500,
    cache=".doc2graph-cache.json",
)

Query and rank

context = g.rank("what are the main risks?", k=10)
# Returns {"nodes": [...], "edges": [...]}

# Pass to any LLM
prompt = f"Context:\n{context}\n\nQuestion: what are the main risks?"

Build and export

# Export
g.to_json("graph.json")           # plain JSON
g.to_graphml("graph.graphml")     # GraphML for Gephi / yEd
graph_dict = g.to_dict()          # raw {"nodes": [...], "edges": [...]}

# Inspect
print(len(g.nodes))
print(len(g.edges))

Graph output format

{
  "nodes": [
    {
      "id": "claim:this_paper_proposes_a_graph_based_approach",
      "label": "This paper proposes a graph based approach",
      "content": "This paper proposes a graph based approach...",
      "attributes": {
        "type": "claim",
        "source": "paper.md",
        "extraction_method": "static"
      }
    }
  ],
  "edges": [
    {
      "from": "section:0_abstract",
      "to": "claim:this_paper_proposes_a_graph_based_approach",
      "label": "contains"
    }
  ]
}

CLI reference

docs2graph <source> [options]

Arguments:
  source          File path, directory path, or URL

Options:
  --graph TYPE    Graph type: knowledge, decision, schema, media, all (default: all)
  --output PATH   Output JSON file (default: stdout)
  --max-files N   Maximum files to extract from a directory
  --max-depth N   Maximum directory recursion depth
  --cache PATH    Cache file for incremental corpus runs
  --scan-only     Build scan graph without loading files
  --include GLOB  Include pattern (repeatable)
  --exclude GLOB  Exclude pattern (repeatable)
  --extension EXT File extension filter (repeatable)
  -h, --help      Show help

Use cases

RAG over technical docs — extract section/concept graph, rank on query, pass subgraph as focused context instead of raw chunks
Research paper analysis — extract entity/citation graph, find what a paper claims and what evidence it cites
Architecture review — extract decision graphs from ADRs, trace the reasoning behind every architectural choice
Contract review — extract clause relationships, identify obligations and conditions
Code understanding — combine with code2graph for cross-document + cross-code context
Text-to-SQL — combine with graph2sql for schema-aware query generation

Design principles

Pure Python — no LLM, no cloud service, no database required
No LLM dependency — extraction is deterministic and static; LLM enrichment is opt-in and labeled extraction_method: llm_inferred
Deterministic outputs — same input always produces the same graph, making corpus runs reproducible and diffable
Works with any model — output is plain JSON; pass to GPT-4, Claude, Llama, Mistral, or any other model
Pluggable — add your own loader or extractor without touching core code
Shared core — same Personalized PageRank engine as graph2sql

Related projects

Package	What it does
graph2sql	Graph-based schema analysis for text-to-SQL — same PPR core
code2graph	Code repository → knowledge graph (modules, classes, dependencies)

Contributing

Contributions are welcome. See CONTRIBUTING.md for guidelines.

git clone https://github.com/jw-open/doc2graph
cd doc2graph
pip install -e ".[dev]"
pytest tests/ -v

License

Apache-2.0 — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs2graph		docs2graph
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc2graph

Why graph-based context?

Quick start

Installation

How it works

Graph types

`knowledge` — for research papers, reports, documentation

`decision` — for ADRs and architecture documents

`schema` — for data dictionaries and schema docs

`media` — for images and charts

`all` — merged graph from all extractors

Multi-document corpora

Corpus limits reference

Supported formats

Python API

Single document

Directory corpus

Query and rank

Build and export

Graph output format

CLI reference

Use cases

Design principles

Related projects

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doc2graph

Why graph-based context?

Quick start

Installation

How it works

Graph types

knowledge — for research papers, reports, documentation

decision — for ADRs and architecture documents

schema — for data dictionaries and schema docs

media — for images and charts

all — merged graph from all extractors

Multi-document corpora

Corpus limits reference

Supported formats

Python API

Single document

Directory corpus

Query and rank

Build and export

Graph output format

CLI reference

Use cases

Design principles

Related projects

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`knowledge` — for research papers, reports, documentation

`decision` — for ADRs and architecture documents

`schema` — for data dictionaries and schema docs

`media` — for images and charts

`all` — merged graph from all extractors

Packages