Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models.
- Best-in-class OCR - PaddleOCR-VL-0.9B for accurate text extraction
- Structured extraction - Define schemas, get JSON back
- Built for scale - Queue-based processing for thousands of documents
- Real-time updates - WebSocket notifications for job progress
- Self-hostable - Run on your own infrastructure
# Parse a document
curl -X POST https://api.ocrbase.dev/api/parse \
-H "Authorization: Bearer sk_xxx" \
-F "file=@document.pdf"
# Extract with schema
curl -X POST https://api.ocrbase.dev/api/extract \
-H "Authorization: Bearer sk_xxx" \
-F "file=@invoice.pdf" \
-F "schemaId=inv_schema_123"Important: Jobs are processed asynchronously. Poll the job status or use WebSocket for real-time updates.
Best practice: Parse documents with ocrbase before sending to LLMs. Raw PDF binary wastes tokens and produces poor results.
See Self-Hosting Guide for deployment instructions.
Requirements: Docker, Bun
MIT - See LICENSE for details.
For API access, on-premise deployment, or questions: adammajcher20@gmail.com