ocrbase

Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models.

Features

Best-in-class OCR - PaddleOCR-VL-0.9B for accurate text extraction
Structured extraction - Define schemas, get JSON back
Built for scale - Queue-based processing for thousands of documents
Real-time updates - WebSocket notifications for job progress
Self-hostable - Run on your own infrastructure

API Usage

# Parse a document
curl -X POST https://api.ocrbase.dev/api/parse \
  -H "Authorization: Bearer sk_xxx" \
  -F "file=@document.pdf"

# Extract with schema
curl -X POST https://api.ocrbase.dev/api/extract \
  -H "Authorization: Bearer sk_xxx" \
  -F "file=@invoice.pdf" \
  -F "schemaId=inv_schema_123"

Important: Jobs are processed asynchronously. Poll the job status or use WebSocket for real-time updates.

LLM Integration

Best practice: Parse documents with ocrbase before sending to LLMs. Raw PDF binary wastes tokens and produces poor results.

Self-Hosting

See Self-Hosting Guide for deployment instructions.

Requirements: Docker, Bun

Architecture

License

MIT - See LICENSE for details.

Contact

For API access, on-premise deployment, or questions: adammajcher20@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.claude		.claude
.codex		.codex
.github/workflows		.github/workflows
.vscode		.vscode
apps/server		apps/server
docker/paddleocr		docker/paddleocr
docs		docs
examples		examples
packages		packages
spec		spec
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.oxfmtrc.jsonc		.oxfmtrc.jsonc
.oxlintrc.json		.oxlintrc.json
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
lefthook.yml		lefthook.yml
package.json		package.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocrbase

Features

API Usage

LLM Integration

Self-Hosting

Architecture

License

Contact

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

ocrbase-hq/ocrbase

Folders and files

Latest commit

History

Repository files navigation

ocrbase

Features

API Usage

LLM Integration

Self-Hosting

Architecture

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages