tablecodec

One lossless Internal Representation (IR) for image-based table-recognition datasets, plus a registry of codecs that translate between the IR and the fragmented public formats — PubTabNet, FinTabNet, OTSL, TableFormer, DocTags-tables, PubTables-1M, TableBank.

Read any of them into one neutral shape, validate it, convert between formats, and get a static, data-free loss report for any conversion before you run it. The core has zero third-party runtime dependencies — import tablecodec works on a bare Python 3.11+; heavier features (TEDS, CLI, HF streaming) are opt-in extras.

docs/spec.md is the source of truth. The 0.x line makes no API-stability promises; the public surface freezes at 1.0 (SPEC §14).

Install

pip install tablecodec            # stdlib-only core
pip install "tablecodec[cli]"     # + command-line interface (click)
pip install "tablecodec[teds]"    # + TEDS similarity metric (apted, lxml)

Quick start

import tablecodec
from tablecodec import codecs, validate, profiles, analyze_loss
from tablecodec.codecs.pubtabnet import PubTabNet20Codec

# Register a codec (the CLI self-registers the built-ins; in library use you
# register the ones you need).
codecs.register(PubTabNet20Codec())

# Stream-read a dataset into the neutral IR (constant memory).
with open("pubtabnet_val.jsonl", encoding="utf-8") as f:
    for sample in codecs.get("pubtabnet-2.0.0").read(f):
        errors = validate(sample, profile=profiles.DEFAULT)
        if errors:
            print(sample.filename, errors)

# Static, data-free loss analysis between two formats.
report = analyze_loss(source="pubtabnet-2.0.0", target="otsl-1.0.0")
print(report.round_trip_classification)  # "structure-preserving"

Supported

Verified in CI (see .github/workflows/ci.yaml).

Component	Supported	Notes
Python	3.11 – 3.14	core is stdlib-only (zero runtime deps, SPEC §13)
Codecs	9 built-in	`pubtabnet-1.0.0/2.0.0`, `otsl-1.0.0`, `fintabnet`, `fintabnet-otsl`, `tableformer`, `tablebank`, `pubtables-1m`, `doctags-tables`
Extras	`[cli]` `[teds]` `[hf]`	click · apted+lxml · datasets (occasional/local e2e)
Bridge	`docling-tables`	a separate `tablecodec-docling` package (`packages/`, own version)

Auto-generated capability tables: format support · loss matrix. Dependency bumps within these ranges are tracked by Dependabot.

TEDS similarity (`[teds]` extra)

A Tree-Edit-Distance-based Similarity score between two samples. It lives outside the core (it imports apted/lxml), so import it from its submodule:

from tablecodec.teds import teds

score = teds(pred_sample, true_sample)                        # 0.0 .. 1.0
struct = teds(pred_sample, true_sample, structure_only=True)  # ignore cell text

CLI (`[cli]` extra)

tablecodec codecs list
tablecodec analyze-loss --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec validate path/to/dataset.jsonl --codec pubtabnet-2.0.0 --profile DEFAULT
tablecodec stats path/to/dataset.jsonl --codec pubtabnet-2.0.0 --json
tablecodec convert in.jsonl out.jsonl --from pubtabnet-2.0.0 --to otsl-1.0.0
tablecodec convert in.jsonl /dev/null --from pubtabnet-2.0.0 --to otsl-1.0.0 --dry-run
tablecodec diff a.jsonl b.jsonl --codec pubtabnet-2.0.0

All commands stream their input; exit codes are non-zero on validation failures or diffs (suitable for CI / data pipelines).

End-to-end check against real datasets

scripts/e2e_hf_check.py streams real datasets through the codecs and validates the resulting IR. It is occasional / local-only (network + multi-GB datasets), not part of CI. Every shipped codec gets at least one official-corpus check, from three sources:

the Docling OTSL family (docling-project/{PubTabNet,FinTabNet,PubTables-1M,SynthTabNet}_OTSL) — a uniform converted schema that feeds all nine codecs;
the native first-published PubTabNet annotation (apoidea/pubtabnet-html) fed unmodified to the pubtabnet codecs;
the native PubTables-1M PASCAL VOC structure annotation (bsmock/pubtables-1m, download-only) with the logical grid reconstructed for the pubtables-1m codec.

just e2e-selftest              # network-free adapter smoke test
just e2e 200                   # 200 randomly-sampled rows per check (needs [hf] extra)
just e2e-fetch-pubtables1m     # download native PubTables-1M VOC (~30MB) into input/

Rows are sampled randomly and each run prints its --seed, so repeated runs progressively cover the corpora and any finding is reproducible. Failures are appended to output/e2e_findings/ (gitignored) with a replayable payload. See ADR 0003 and ADR 0004 for the data-source decisions and the canonical-vs-real-shape caveats.

Documentation

docs/spec.md — Specification (the single source of truth).
docs/glossary.md — Precise vocabulary: terms tablecodec defines vs. borrows (e.g. "loss" vs a "degenerate" bbox).
docs/intent.md — Implementation brief and roadmap (milestones, quality bar, §8 future work).
docs/adr/ — the decisions and their reasoning (the "Why").
CHANGELOG.md — Keep a Changelog format.

Development

just install      # editable install with dev + cli + teds extras
just ci           # lint + pyright (strict) + pytest + semgrep + docs-check
just docs         # regenerate the codec/loss tables (docs-check enforces freshness)
just ci-all       # core + the in-repo tablecodec-docling bridge

Releases are published from GitHub Actions via PyPI OIDC Trusted Publishing (no long-lived token), carrying PEP 740 attestations and a SLSA build provenance (ADR 0014).

License

MIT. See LICENSE. The OTSL grid-reconstruction logic and the TEDS metric are adapted (with attribution) from upstream MIT / Apache-2.0 sources — see THIRD_PARTY_NOTICES.md.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github		.github
.semgrep		.semgrep
conformance		conformance
docs		docs
packages/tablecodec-docling		packages/tablecodec-docling
scripts		scripts
src/tablecodec		src/tablecodec
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tablecodec

Install

Quick start

Supported

TEDS similarity (`[teds]` extra)

CLI (`[cli]` extra)

End-to-end check against real datasets

Documentation

Development

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tablecodec

Install

Quick start

Supported

TEDS similarity ([teds] extra)

CLI ([cli] extra)

End-to-end check against real datasets

Documentation

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

TEDS similarity (`[teds]` extra)

CLI (`[cli]` extra)

Packages