GitHub - newelh/udoc: A unified document extraction toolkit. CLI friendly, LLM friendly, Production pipeline friendly

Dependency free extraction from documents.

Extract text, tables, JSON, or rendered pages. CLI, Python Bindings, Pure Rust. No external parsers, libraries, or system packages are required. Provides hooks for OCR, layout detection, and entity extraction. Permissively licensed as dual MIT / Apache-2.0.

Supports PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, ODT, ODS, ODP, RTF, and Markdown.

Try it out using uv, no install required:

curl -sL https://arxiv.org/pdf/1706.03762 \
  | uvx udoc - | grep -A 18 '^Abstract'

Installation

# uv
uv add udoc

# pip
pip install udoc

# cargo (coming soon)

To build from source, see Compiling from source.

Highlights

One Document model across formats. A content spine of Block and Inline nodes, plus optional presentation, relationships, and interactions overlays. Disable any overlay via Config.
Legacy binary Office. Native parsers for .doc, .xls, and .ppt. Per-format details in the format guides.
Streaming page-by-page. The Extractor defers per-page work. A 10 GB PDF does not have to fit in memory.
Typed diagnostics. Recoverable issues become structured warnings filterable by kind. Examples: font fallbacks, malformed xref, stream-length mismatches.
Hooks for OCR, layout, and annotation. JSONL protocol for Tesseract, cloud OCR APIs, DocLayout-YOLO, GLM-OCR, vision-language models, NER, or any subprocess that reads JSON line-by-line.
LLM tool use. Agent instructions — a paste-into-context page describing udoc's CLI to assistants.

Usage

CLI

udoc paper.pdf                     # text to stdout
udoc -j paper.pdf                  # full document as JSON
udoc -J paper.pdf                  # streaming JSONL (one record per page)
udoc -t spreadsheet.xlsx           # tables only as TSV
udoc -p 1-5,10 paper.pdf           # page range
udoc render paper.pdf -o ./pages   # rasterise PDF pages to PNG
cat paper.pdf | udoc -             # read from stdin

A few real-world piping recipes:

curl -sL https://arxiv.org/pdf/1706.03762 | udoc - | head -40
udoc paper.pdf | grep -i 'attention'
udoc -J docs/*.pdf | jq '.metadata.title'

Plain text on stdout. Structured output on flags. Stderr is silent unless you pass -v. The full flag list lives in the CLI reference.

Python

import udoc

# One-shot extraction. Format detected from magic bytes.
doc = udoc.extract("paper.pdf")
print(doc.metadata.title)
for block in doc.blocks():
    print(block.text)

# Stream page by page; large documents do not have to fit in memory.
with udoc.stream("large.pdf") as ext:
    for i in range(len(ext)):
        print(f"page {i}: {ext.page_text(i)[:80]}")

# In-memory bytes with options.
with open("encrypted.pdf", "rb") as f:
    doc = udoc.extract_bytes(f.read(), password="secret")

PDF table detection and reading order are heuristic. Born-digital documents with clean ruling and standard column flow extract cleanly out of the box; the PDF format guide covers the failure modes and when to attach a layout-detection or OCR hook.

The Guide walks through configuration, overlays, diagnostics, chunking, and batch processing. The Python Library reference lists every function, class, and exception.

Rust

let doc = udoc::extract("paper.pdf")?;
println!("{:?}", doc.metadata.title);
for block in &doc.content {
    println!("{}", block.text());
}

The Rust facade mirrors the Python shape. Document is udoc_core::document::Document; iteration is by direct field access (doc.content, doc.metadata, doc.images). The Rust Library reference covers the facade, the per-format backends, configuration presets, diagnostics, and the trait that backends implement.

The full hosted manual lives at https://newelh.github.io/udoc.

Status

This is the initial alpha release of udoc. APIs and outputs are subject to change. Bugs, ergonomic suggestions, and format quirks are welcome on the issue tracker.

Security

Report vulnerabilities through GitHub Security Advisories (preferred) or, if that is not workable for you, email me@newel.dev.

See SECURITY.md for the disclosure process and docs/security.md for the unsafe-code policy and audit.

Contributing

Issues are welcome. Pull requests are not currently accepted on this repository — udoc is solo-maintained during the alpha period. File an issue describing the change you would like to see; if it is a good fit it will land in a future release. The full policy is in CONTRIBUTING.md.

Licence

Dual-licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
crates		crates
docs		docs
examples		examples
python		python
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
README.pypi.md		README.pypi.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
release.toml		release.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Highlights

Usage

CLI

Python

Rust

Status

Security

Contributing

Licence

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Highlights

Usage

CLI

Python

Rust

Status

Security

Contributing

Licence

About

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages