You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
SentenceChunker: Sentence-aware sliding-window chunker with configurable sentences_per_chunk, sentence_overlap, and min_chunk_size. Pure-Python regex splitting, no external dependencies.
RecursiveMarkdownChunker: Hierarchical chunker that splits by heading level (H1→H2→H3→H4), then paragraph, then sentence, with overlap merging.
VoyageAIEmbedder: Embedder using Voyage AI models — voyage-3, voyage-3-large, voyage-3-lite.
CohereEmbedder: Embedder using Cohere models — embed-v4.0, embed-english-v3.0, embed-multilingual-v3.0. Supports configurable input_type for search documents vs queries.
TableOfContentsRefiner: Refiner that detects and removes the table of contents section, storing it in extracted_data["toc_markdown"].
Example scripts (examples/): 4 runnable Python scripts demonstrating basic pipeline usage, component comparison, step-by-step execution, and cost tracking.
Jupyter notebooks (notebooks/): 3 interactive notebooks — getting started walkthrough, chunker comparison, and a component explorer exercising every component with all valid configurations.
README component reference tables: OCR, Refiners, Chunkers, and Embedders with models, parameters, defaults, and costs per 1M tokens.
README examples & notebooks section linking to all new files with descriptions.
Changed
Updated README package layout tree to include examples/, notebooks/, and expanded src/ragbandit/ subdirectories