MarkFlow

MarkFlow is a provider-agnostic, OCR-first extraction pipeline that converts mixed-quality PDFs into structured Markdown documents and auditable JSON reports. It combines dynamic model discovery, benchmark-driven selection, and intelligent OCR routing to deliver production-grade document extraction.

Key Features

OCR-First Architecture: Prioritizes document understanding quality through benchmark-driven model selection
Provider Agnostic: Works with OpenAI, Anthropic, Gemini, OpenRouter, Z.AI, and custom OpenAI-compatible endpoints
Interactive Setup (TUI): User-friendly terminal interface for model discovery, configuration, and recommendations
Local OCR Support: EasyOCR, RapidOCR, and Tesseract with intelligent fallback routing
Audit-Ready Output: Detailed JSON reports with confidence scores, extraction metadata, and processing decisions
Medical-Grade Quality: Fail-closed validation mode for regulated document workflows

Quick Start

# Clone and setup
git clone https://github.com/yourusername/markflow.git
cd markflow
python -m venv .venv
source .venv/bin/activate  # Windows: .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

# Run interactive setup
python app.py --tui

# Extract a document
python app.py --input ./documents --output-dir ./out --routing-mode balanced

OCR-First Philosophy

MarkFlow is optimized for document transcription and OCR-extraction workloads where OCR quality is the primary determinant of downstream quality.

Design Priorities:

OCR/document understanding quality as primary signal
Structured extraction quality for forms, tables, and PDF artifacts
Long-context stability for multi-page parsing
Cost and latency as secondary optimization signals

Architecture

app.py                             # CLI/TUI launcher
services/
├── frontend/                      # Frontend runtime/build/deploy assets
├── api/                           # API runtime/build/deploy assets
└── worker/                        # Worker runtime/build/deploy assets
markflow/
├── extraction/                    # OCR/extraction core domain (CLI/core compatibility)
└── ...                            # Core benchmark/routing/security modules

Architectural Separation:

Service Isolation: frontend, API, and worker are independent runtime/build/deploy units
OCR Core: Document analysis, confidence scoring, page-level routing
LLM Abstraction: Provider-agnostic client with async support
Benchmarking: Canonical OCR benchmark integration and model normalization
Selection Engine: Scoring, ranking, and recommendation with explainability
Routing: Task classification and intelligent escalation
TUI Layer: Interactive setup, discovery, and user guidance

Installation

Prerequisites

Python 3.10+
Optional: Tesseract binary for local OCR (Windows: choco install tesseract or download MSI)

Windows (PowerShell)

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Linux / macOS

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

MarkFlow loads configuration from environment variables. Copy .env.example to .env and configure:

Core Settings

LLM_API_KEY (required) – API key for your LLM provider
LLM_BASE_URL (required) – OpenAI-compatible endpoint base URL
LLM_PROVIDER_PRESET (optional) – Preset: custom, openai, anthropic, gemini, openrouter, z-ai
LLM_PROVIDER_NAME (optional) – Human-readable provider label
LLM_MODEL (optional) – Fixed model ID (disables dynamic selection if set)

Provider-Specific Keys

The application auto-detects provider via preset and looks for official environment variables:

Provider	Variable	Format
OpenAI	`OPENAI_API_KEY`	Bearer token
Anthropic	`ANTHROPIC_API_KEY`	API key with `x-api-key` header
Gemini	`GEMINI_API_KEY`	Bearer token
OpenRouter	`OPENROUTER_API_KEY`	Bearer token
Z.AI	`ZAI_API_KEY`	Bearer token

Fallback: Uses LLM_API_KEY if provider-specific variable is not set.

Optional Settings

TESSERACT_CMD – Path to Tesseract binary (Windows: C:\Program Files\Tesseract-OCR\tesseract.exe)
LLM_ZAI_PLAN – Z.AI plan: general or coding (default: general)

Usage

CLI

Extract a single document:

python app.py --input ./documents/report.pdf --output-dir ./out

Batch extract with explicit provider and routing:

python app.py \
  --input ./documents \
  --output-dir ./out \
  --llm-provider-preset openai \
  --llm-base-url "https://api.openai.com" \
  --routing-mode high-accuracy-ocr

Extract with Z.AI coding endpoint:

python app.py \
  --input ./documents \
  --output-dir ./out \
  --llm-provider-preset z-ai \
  --zai-plan coding \
  --routing-mode balanced

Force a specific model (bypass dynamic selection):

python app.py --input ./documents --output-dir ./out --llm-model gpt-4-vision

List all available CLI options:

python app.py --help

Web UI

The browser UI in services/frontend/ talks to the isolated API service via HTTP/SSE contracts.

Run the backend in one terminal:

uvicorn app:create_app --factory --reload --app-dir services/api

Run the frontend in another terminal:

cd services/frontend
copy .env.local.example .env.local
npm install
npm run dev

Open the UI at:

http://localhost:3000

Open the API docs at:

http://127.0.0.1:8000/docs

Local Services (Docker Compose)

Run all services with one command from repository root:

docker compose up --build

Services:

Frontend: http://127.0.0.1:3000
API: http://127.0.0.1:8000
Redis: 127.0.0.1:6379
Worker health: http://127.0.0.1:8001/health

Local Services (Independent)

Start API:

uvicorn app:create_app --factory --host 0.0.0.0 --port 8000 --app-dir services/api

Start worker loop:

python services/worker/entrypoint.py

Start frontend:

cd services/frontend
npm ci
npm run dev

Interactive Mode (TUI)

Launch the interactive setup wizard:

python app.py --tui

The TUI guides you through:

API Key Input – Securely enter credentials (input is masked)
Provider Selection – Choose from presets or custom endpoint
Model Discovery – Auto-fetch available models from /v1/models
OCR Benchmark Ingestion – Score models against OCRBench v2 signals
Routing Mode Selection – Choose optimization strategy
Recommendation Output – View why each model was ranked, with tradeoff analysis

Execution Modes

Select with --mode:

Mode	Use Case
`auto`	Adaptive behavior for mixed document quality (default)
`fast`	Speed-optimized; minimal post-processing for high-throughput
`quality`	Fail-closed with strict confidence enforcement and recovery
`local`	No remote LLM calls; text extraction and local OCR only
`remote`	Remote-first with intelligent fallback to local methods

Routing Modes

Select with --routing-mode:

Mode	Optimization	Best For
`fast`	Lowest latency and cost	Cost-sensitive, high-volume batches
`balanced`	OCR quality ↔ cost ↔ speed	General-purpose extraction
`high-accuracy-ocr`	Maximum document understanding quality	Medical, legal, regulated documents

Benchmark-Driven Model Selection

MarkFlow uses the canonical OCRBench v2 benchmark to inform model selection:

Canonical Benchmark Source: https://99franklin.github.io/ocrbench_v2/

Selection Strategy:

Discover available models from provider via /v1/models
Normalize model IDs against benchmark dataset
Score on OCR quality, structured extraction, and long-context signals
Weight by routing mode (e.g., high-accuracy-ocr emphasizes document understanding)
Rank and return with explainable reasoning

Confidence Handling:

Models with strong benchmark evidence → high confidence in ranking
Models without benchmark data → conservative fallback scoring with warnings
Missing fields → heuristic proxies (context length, inference latency)

Output Formats

Markdown Output

<filename>.canonical.md – Clean, readable document in Markdown format.

# Document Title

## Section 1

Content extracted and formatted as structured Markdown.

## Tables

| Column 1 | Column 2 |
|----------|----------|
| Value    | Value    |

JSON Report

<filename>.canonical.report.json – Detailed audit trail and metadata.

{
  "document_id": "report_001",
  "processing_summary": {
    "total_pages": 10,
    "extraction_confidence": 0.94,
    "llm_review_required_pages": 2,
    "llm_review_passed_pages": 1,
    "needs_reprocess_pages": 1
  },
  "pages": [
    {
      "page_number": 1,
      "status": "accepted",
      "confidence": 0.98,
      "extraction_source": "text-layer",
      "cache_hit": false
    }
  ]
}

Security

API Key Handling

MarkFlow implements strict API key security:

No Persistence: API key is never written to disk, logs, or output files
Masked Input: TUI masks API key entry in terminal
Memory-Only: Credentials kept in process memory only
No CLI Arguments: API key is never accepted as CLI argument (prevents shell history leaks)
Error Redaction: Error messages redact token-like values
HTTPS Enforced: LLM base URLs restricted to HTTPS (HTTP allowed only on localhost)
Redirect Disabled: Authenticated requests block redirects to prevent credential forwarding

Important Limitations

While MarkFlow implements best-effort API key handling, no software can guarantee 100% security under every threat model. Use in secure environments and follow your organization's credential management policies.

Service Isolation Rules

This repository enforces strict service boundaries for frontend, API, and worker slices.

Policy: docs/architecture/service-isolation-policy.md
Event contracts: docs/architecture/event-contracts.md
HTTP contracts: services/api/contracts/http/
Machine-readable event schemas: services/api/contracts/events/ and services/worker/contracts/events/
Deterministic boundary check command: python scripts/check_service_boundaries.py

Development

Setup

python -m venv .venv
source .venv/bin/activate  # or .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt -r requirements-dev.txt

Quality Gates

Before submitting a PR, run:

python -m compileall -q app.py markflow tests
python -m ruff check .
python -m black --check .
python -m flake8 app.py markflow tests
python -m mypy markflow
pytest -m unit -q --no-cov
pytest -m "integration and not slow" -q --no-cov
pytest -m functional -q --no-cov
pytest -m spec -q --no-cov
pytest -q
python -m coverage report --fail-under=100

Standards

PEP 8 compliance with 100-char line length
Docstrings required for all public functions (PEP 257)
Provider Neutrality: No vendor-specific assumptions in core layers
Architecture Boundaries: Clear separation across OCR, LLM, benchmarking, routing, and TUI layers
Data Privacy: Never commit personal or sensitive sample documents

Contributing

See CONTRIBUTING.md for detailed development workflow and PR guidelines.

GSD Workflow

This repository is migrated to the GSD spec-driven workflow for Codex.

Local GSD runtime: .codex/
Project context: .planning/PROJECT.md
Requirements: .planning/REQUIREMENTS.md
Roadmap/state: .planning/ROADMAP.md, .planning/STATE.md
Codebase map: .planning/codebase/
Machine-readable feature specs: .planning/specs/features.json
Acceptance matrix: .planning/specs/feature_acceptance_matrix.md

The spec traceability gate is automated:

pytest -m spec -q --no-cov

That test verifies every v1 requirement has exactly one feature spec, every spec lists implementation and test paths, every acceptance section exists, every referenced test file contains executable tests, every referenced path exists, and every production module under markflow/ is mapped to at least one feature spec. Use the installed local GSD commands in Codex with $gsd-help, $gsd-progress, $gsd-plan-phase, and $gsd-verify-work.

Copilot and Codex Integration

MarkFlow supports deterministic cross-agent continuity between GitHub Copilot and Codex.

Copilot instructions: .github/copilot-instructions.md
Skill mapping and fallback matrix: .planning/SKILLS_FOR_COPILOT.md
Session checkpoint template: .planning/copilot-state.md
Handoff protocol: .planning/COPILOT_HANDOFF_PROTOCOL.md
Shared context baseline: .planning/CONTEXT_FOR_AGENTS.md
Operational runbook: docs/INTEGRATION_COPILOT_CODEX.md

Recommended flow:

Start from .planning/STATE.md and .planning/decisions/DECISION_LOG.md.
Execute with Copilot for interactive planning/review.
Hand off to Codex when orchestration or multi-wave execution is needed.
Resume in Copilot using the saved checkpoint and Codex completion block.

Governance Artifacts

Release changelog: CHANGELOG.md
Decision rationale changelog: docs/CHANGELOG_DECISOES.md
Canonical decision ledger: .planning/decisions/DECISION_LOG.md
Decision ledger skill: .codex/skills/gsd-decision-ledger/SKILL.md

License

MIT License. See LICENSE for details.

Need Help?

Check EXECUTION_MODES.md for detailed mode and routing documentation
See CONTRIBUTING.md for development setup
Open an issue with reproduction steps and MarkFlow version (cat VERSION)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.codex		.codex
.github		.github
.planning		.planning
docs		docs
markflow		markflow
scripts		scripts
services		services
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
EXECUTION_MODES.md		EXECUTION_MODES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
app.py		app.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MarkFlow

Key Features

Table of Contents

Quick Start

OCR-First Philosophy

Architecture

Installation

Prerequisites

Windows (PowerShell)

Linux / macOS

Configuration

Core Settings

Provider-Specific Keys

Optional Settings

Usage

CLI

Web UI

Local Services (Docker Compose)

Local Services (Independent)

Interactive Mode (TUI)

Execution Modes

Routing Modes

Benchmark-Driven Model Selection

Output Formats

Markdown Output

JSON Report

Security

API Key Handling

Important Limitations

Service Isolation Rules

Development

Setup

Quality Gates

Standards

Contributing

GSD Workflow

Copilot and Codex Integration

Governance Artifacts

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages