Docler

Markdown Conventions for OCR Output

This project utilizes Markdown as the primary, self-contained format for storing OCR results and associated metadata. The goal is to have a single, versionable, human-readable file representing a processed document, simplifying pipeline management and data provenance.

We employ a hybrid approach, using different mechanisms for different types of metadata:

1. Metadata Comments (for Non-Visual Markers)

For metadata that should not affect the visual rendering of the Markdown (like page boundaries or page-level information), we use specially formatted HTML/XML comments.

Format:

<!-- prefix:data_type {compact_json_payload} -->

prefix: A namespace identifier to prevent clashes. Defaults to docler.
data_type: A string indicating the kind of metadata (e.g., page_break, page_meta).
{compact_json_payload}: A standard JSON object containing the metadata key-value pairs, serialized compactly (no unnecessary whitespace, keys sorted).

Defined Types:

page_break: Marks the transition to the specified page number. Placed immediately before the content of the new page.
- Example Payload: {"next_page": 2}
- Example Comment: 
page_meta: Contains metadata specific to a page (e.g., dimensions, confidence). Often placed near the beginning of the page's content or alongside the page_break comment.
- Example Payload: {"page_num": 1, "width": 612, "height": 792, "confidence": 0.98}
- Example Comment:

2. HTML Figures (for Images and Diagrams)

For visual elements like images or diagrams, especially when they require richer metadata (like source code or bounding boxes), we use standard HTML structures within the Markdown. This allows direct association of metadata and handles complex data like code snippets gracefully.

Structure:

We typically use an HTML <figure> element:

<figure data-docler-type="diagram" data-diagram-id="sysarch-01">
  <img src="images/system_architecture.png"
       alt="System Architecture Diagram"
       data-page-num="5"
       style="max-width: 100%; height: auto;"
       >
  <figcaption>Figure 2: High-level system data flow.</figcaption>
  <script type="text/docler-mermaid">
    graph LR
        A[Data Ingest] --> B(Processing Queue);
        B --> C{Main Processor};
        D --> F(API Endpoint);
  </script>
</figure>

<figure>: The container element.
- data-docler-type: Indicates the type of figure (e.g., image, diagram).
- Other data-* attributes can be added for figure-level metadata.
<img>: The visual representation.
- src, alt: Standard attributes.
- data-*: Used for image-specific metadata like data-page-num
- style: Optional for basic presentation.
<figcaption>: Optional standard HTML caption.
<script type="text/docler-...">: Used to embed source code or other complex textual data.
- The type attribute is custom (e.g., text/docler-mermaid, text/docler-latex) so browsers ignore it.
- The raw code/text is placed inside, preserving formatting.

Rationale

Comments are used for page breaks and metadata because they are guaranteed not to interfere with Markdown rendering, ensuring purely structural information remains invisible.
HTML Figures are used for images/diagrams because HTML provides standard ways (data-*, nested elements like <script>) to directly associate rich, potentially complex or multi-line metadata (like source code) with the visual element itself.

Utilities

Helper functions for creating and parsing these metadata comments and structures are available in mkdown.

Standardized Metadata Types

The library provides standardized metadata types for common use cases:

Page Breaks: Use create_page_break() function to create page transitions:

from mkdown import create_page_break

# Create a page break marker for page 2
page_break = create_page_break(next_page=2)
# <!-- docler:page_break {"next_page":2} -->

Chunk Boundaries: Use create_chunk_boundary() function to mark semantic chunks in a document:

from mkdown import create_chunk_boundary

# Create a chunk boundary marker with metadata
chunk_marker = create_chunk_boundary(
    chunk_id=1,
    start_line=10,
    end_line=25,
    keywords=["introduction", "overview"],
    token_count=350,
)
# <!-- docler:chunk_boundary {"chunk_id":1,"end_line":25,"keywords":["introduction","overview"],"start_line":10,"token_count":350} -->

SOON:

FastAPI demo (bring your own keys) on https://contexter.net

Name		Name	Last commit message	Last commit date
Latest commit History 499 Commits
.github		.github
docs		docs
overrides		overrides
src		src
tests		tests
.copier-answers.yml		.copier-answers.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
duties.py		duties.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docler

Markdown Conventions for OCR Output

1. Metadata Comments (for Non-Visual Markers)

2. HTML Figures (for Images and Diagrams)

Rationale

Utilities

Standardized Metadata Types

SOON:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docler

Markdown Conventions for OCR Output

1. Metadata Comments (for Non-Visual Markers)

2. HTML Figures (for Images and Diagrams)

Rationale

Utilities

Standardized Metadata Types

SOON:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages