Turn documents into queryable knowledge graphs — no LLM required.
When you need to answer questions over long files, reports, or multi-document corpora, naive chunking loses structure. doc2graph extracts entities, relationships, and context as a graph, ranks relevant nodes with Personalized PageRank, and hands you exactly what your LLM needs.
Pure Python. No LLM dependency. Bring your own model.
| Approach | What you lose |
|---|---|
| Fixed-size chunking | Sentence boundaries, section context, cross-references |
| Embedding search | Exact match, structural relationships, citation graphs |
| doc2graph | Nothing — relationships are explicit edges |
The graph knows that a claim is supported by evidence in a specific section, which cites a reference, which is authored by a specific person. Flat chunks don't.
pip install docs2graph# Extract a knowledge graph from a paper
docs2graph paper.md --graph knowledge --output paper.graph.json
# Extract ADR-style decisions from architecture docs
docs2graph architecture.md --graph decision --output decisions.graph.json
# Process an entire documentation corpus
docs2graph ./docs --graph all --output corpus.graph.jsonfrom doc2graph import DocumentGraph
# Single document
g = DocumentGraph.from_document("report.pdf", graph_type="knowledge")
# Rank nodes relevant to a query
context = g.rank("what are the key risks?", k=10)
# Pass context["nodes"] + context["edges"] to your LLMCore (Markdown, plain text, HTML, JSON, CSV, code files):
pip install docs2graphWith PDF support:
pip install "docs2graph[pdf]"With Word / PowerPoint support:
pip install "docs2graph[docx,pptx]"With OCR (images, scanned PDFs):
pip install "docs2graph[ocr]"
# Also requires: apt install tesseract-ocr (Ubuntu/Debian)
# brew install tesseract (macOS)Everything:
pip install "docs2graph[all]"Document / File / Corpus
│
▼
Format loader ──► Text + structure
│
▼
Extractor ──► Nodes (entities, sections, claims, ...)
│ └─► Edges (contains, references, defines, ...)
▼
Knowledge graph (plain JSON)
│
▼
query → Personalized PageRank → Ranked subgraph
│
▼
Your LLM prompt
- Load — auto-detects format, handles encoding, extracts clean text and structure
- Extract — turns structure into typed graph nodes and labeled edges
- Rank — Personalized PageRank starting from query-matched nodes surfaces the most relevant subgraph
- Use — pass
context["nodes"]+context["edges"]to any LLM
Extracts: documents, sections, concepts, definitions, claims, evidence, tables, citations, references, URLs
docs2graph paper.md --graph knowledge --output paper.graph.jsong = DocumentGraph.from_document("paper.md", graph_type="knowledge")
context = g.rank("graph-based context ranking", k=15)Relationships: contains, references, defines, defined_by, supports, cites, resolves_to, links_to
Inline citations ([1], (Smith, 2024)) are resolved to matching # References entries. Claim-to-evidence support links are deterministic and conservative — only same-section evidence or evidence sharing meaningful terms with the claim.
Extracts: problems, context/drivers, options, pros, cons, tradeoffs, decisions, consequences, confidence
docs2graph architecture.md --graph decision --output decisions.graph.jsondecisions = DocumentGraph.from_document("adr.md", graph_type="decision")Recognizes ADR-style headings (## Decision, ## Options, ## Consequences), standalone prefixed lines (Constraint:, Assumption:, Rationale:), and Markdown option tables. Context bullets link to decisions with informed_by edges so the reasoning trail is traversable.
Extracts table and entity graphs from schema documentation for text-to-SQL context.
Extracts image metadata, OCR text, and chart signal nodes.
docs2graph ./docs --graph all --output corpus.graph.jsonDirectory input is first-class. doc2graph walks supported formats, emits a corpus root with folder/file provenance nodes, resolves explicit relative links ([ADR](adr/cache.md)) into links_to edges, and adds deterministic cross-document mentions edges when one file explicitly names another's title, section, decision, or path-derived stem.
# Process entire knowledge base
docs2graph ./knowledge-base --graph all --output corpus.graph.json
# Filter to ADRs only
docs2graph ./knowledge-base --graph decision --include "adr/**" --output adr.graph.json
# Limit corpus size
docs2graph ./exports --graph all --max-files 500 --max-file-bytes 10485760
# Bounded traversal for huge trees
docs2graph ./exports --graph all --max-depth 2 --max-total-bytes 1073741824
# Audit corpus before extraction (no files loaded)
docs2graph ./exports --graph all --scan-only --output corpus.scan.graph.json
# Cache extraction results across runs
docs2graph ./docs --graph all --cache .doc2graph-cache.json --output corpus.graph.json| Flag | Default | Description |
|---|---|---|
--max-files N |
unlimited | Select at most N files; continues scanning for skip counts |
--stop-after-max-files |
off | Stop scanning at first file beyond --max-files |
--max-file-bytes N |
5 MB | Skip files larger than N bytes |
--max-total-bytes N |
unlimited | Stop extracting after N cumulative bytes |
--max-depth N |
unlimited | Bound recursive descent by subdirectory depth |
--max-scan-entries N |
unlimited | Stop directory walk after N filesystem entries |
--include PATTERN |
all | Repeatable glob filter (e.g. --include "adr/**") |
--exclude PATTERN |
none | Repeatable glob exclusion |
--extension EXT |
all | Repeatable suffix allowlist (e.g. --extension md) |
--scan-only |
off | Build scan graph without loading any files |
--follow-symlinks |
off | Extract symlinked files (symlinked dirs always skipped) |
--cache PATH |
none | Reuse unchanged per-file extractions across runs |
--refresh-cache |
off | Rebuild all cache entries |
| Format | Extensions | Extra install |
|---|---|---|
| Markdown | .md, .mdx |
— |
| Plain text | .txt |
— |
| HTML | .html |
— |
| JSON / JSONL | .json, .jsonl |
— |
| CSV / TSV | .csv, .tsv |
— |
| Source code | .py, .js, .ts, .sql, .yaml, .toml, .sh, ... |
— |
.pdf |
pip install "docs2graph[pdf]" |
|
| Word | .docx |
pip install "docs2graph[docx]" |
| PowerPoint | .pptx |
pip install "docs2graph[pptx]" |
| Images / OCR | .png, .jpg, .gif, .tif, .bmp, .webp |
pip install "docs2graph[ocr]" + tesseract |
| URLs | https://... |
— |
| Google Docs/Sheets/Slides | public export URLs | GOOGLE_DOCS_BEARER_TOKEN for private |
from doc2graph import DocumentGraph
# Auto-detect format
g = DocumentGraph.from_document("paper.pdf", graph_type="knowledge")
# Markdown
g = DocumentGraph.from_markdown("notes.md", graph_type="all")
# Plain text
g = DocumentGraph.from_text("My text content...", graph_type="knowledge")g = DocumentGraph.from_directory(
"./docs",
graph_type="all",
max_depth=3,
max_files=500,
cache=".doc2graph-cache.json",
)context = g.rank("what are the main risks?", k=10)
# Returns {"nodes": [...], "edges": [...]}
# Pass to any LLM
prompt = f"Context:\n{context}\n\nQuestion: what are the main risks?"# Export
g.to_json("graph.json") # plain JSON
g.to_graphml("graph.graphml") # GraphML for Gephi / yEd
graph_dict = g.to_dict() # raw {"nodes": [...], "edges": [...]}
# Inspect
print(len(g.nodes))
print(len(g.edges)){
"nodes": [
{
"id": "claim:this_paper_proposes_a_graph_based_approach",
"label": "This paper proposes a graph based approach",
"content": "This paper proposes a graph based approach...",
"attributes": {
"type": "claim",
"source": "paper.md",
"extraction_method": "static"
}
}
],
"edges": [
{
"from": "section:0_abstract",
"to": "claim:this_paper_proposes_a_graph_based_approach",
"label": "contains"
}
]
}docs2graph <source> [options]
Arguments:
source File path, directory path, or URL
Options:
--graph TYPE Graph type: knowledge, decision, schema, media, all (default: all)
--output PATH Output JSON file (default: stdout)
--max-files N Maximum files to extract from a directory
--max-depth N Maximum directory recursion depth
--cache PATH Cache file for incremental corpus runs
--scan-only Build scan graph without loading files
--include GLOB Include pattern (repeatable)
--exclude GLOB Exclude pattern (repeatable)
--extension EXT File extension filter (repeatable)
-h, --help Show help- RAG over technical docs — extract section/concept graph, rank on query, pass subgraph as focused context instead of raw chunks
- Research paper analysis — extract entity/citation graph, find what a paper claims and what evidence it cites
- Architecture review — extract decision graphs from ADRs, trace the reasoning behind every architectural choice
- Contract review — extract clause relationships, identify obligations and conditions
- Code understanding — combine with code2graph for cross-document + cross-code context
- Text-to-SQL — combine with graph2sql for schema-aware query generation
- Pure Python — no LLM, no cloud service, no database required
- No LLM dependency — extraction is deterministic and static; LLM enrichment is opt-in and labeled
extraction_method: llm_inferred - Deterministic outputs — same input always produces the same graph, making corpus runs reproducible and diffable
- Works with any model — output is plain JSON; pass to GPT-4, Claude, Llama, Mistral, or any other model
- Pluggable — add your own loader or extractor without touching core code
- Shared core — same Personalized PageRank engine as graph2sql
| Package | What it does |
|---|---|
| graph2sql | Graph-based schema analysis for text-to-SQL — same PPR core |
| code2graph | Code repository → knowledge graph (modules, classes, dependencies) |
Contributions are welcome. See CONTRIBUTING.md for guidelines.
git clone https://github.com/jw-open/doc2graph
cd doc2graph
pip install -e ".[dev]"
pytest tests/ -vApache-2.0 — see LICENSE