Skip to content

New OCR & Embedding Providers

Choose a tag to compare

@MartimChaves MartimChaves released this 06 Feb 18:12
· 18 commits to main since this release
14710c7

Added

  • DatalabOCR: New OCR processor using Datalab API with marker model
    • Supports three processing modes: fast, balanced, and accurate
    • Page range filtering and max pages limiting
    • Image extraction with optional captions
    • Cost tracking per page processed
  • OpenAIEmbedder: New embedder using OpenAI's embedding models
    • Support for text-embedding-3-small (1536 dimensions)
    • Support for text-embedding-3-large (3072 dimensions)
    • Normalized vector embeddings with L2 norm ≈ 1
    • Token usage tracking for embedding operations
  • Comprehensive integration tests for both new components
    • Regular functionality tests
    • Behavior tests ensuring embedding quality and OCR accuracy
    • Validation of embedding dimensions, normalization, and similarity properties
  • Updated README with examples for DatalabOCR and OpenAIEmbedder
  • Added section on using alternative OCR and embedding providers