This release establishes core API parity with the HuggingFace Python
library, introducing the standard `load_dataset` interface, streaming
capabilities, and native support for vision datasets. It removes synthetic
test data in favor of robust HTTP stubs and integrates `vix` for
high-performance image decoding.
New Features:
- HuggingFace Parity API: Added `CrucibleDatasets.load_dataset/2`
supporting `repo_id`, `config`, `split`, and `streaming` options.
- Dataset Types: Implemented `DatasetDict` for multi-split handling and
`IterableDataset` for lazy loading and streaming.
- Data Discovery: Introduced `DataFiles` resolver to automatically infer
configs and map remote files to splits using `HfHub` (v0.1.1) metadata.
- Streaming Support: Added line-buffered JSONL streaming and batched
Parquet iteration to support large-scale datasets.
- Vision Support: Added `CrucibleDatasets.Loader.Vision` for standard
benchmarks (Caltech101, Oxford Flowers 102, Oxford-IIIT Pet, Stanford Cars).
- Media Decoding: Integrated `vix` (libvips) for image decoding. Added
`Image` feature type with automatic decoding configuration.
- Real Loaders: Replaced synthetic implementations of MMLU and HumanEval
with real HuggingFace loaders.
Changes:
- Synthetic Removal: Completely removed synthetic data generation and the
`fallback_to_synthetic` configuration. All loaders now target real data.
- Testing Infrastructure: Replaced synthetic fallback tests with `Bypass` HTTP
stubs and deterministic local fixtures (`HfStub`, `HfCase`).
- Registry: Expanded registry to include 24 standard benchmark datasets covering
Chat, Math, Code, Preference, Reasoning, and Vision domains.
- Dataset Struct: Added `features` field to `Dataset` with schema inference
logic (`from_list`, `from_dataframe`).
- Dataset Operations: Added `select/2` (index/range), `num_rows/1`, and
`column_names/1` to match Python API behavior.
- Dependencies: Updated `hf_hub` to `~> 0.1.1` for archive extraction;
updated `explorer` to `~> 0.11.1`; added `vix` and `bypass`.
Fixes:
- Parquet Stability: Applied `rechunk: true` to Explorer reads to mitigate
Polars NIF panics on specific schemas.
- Path Resolution: Fixed `Fetcher.HuggingFace` to use resolved paths from
`DataFiles`, supporting extraction of archives (zip/tar).
Documentation:
- Updated README with new Quick Start examples, system dependencies (libvips),
and feature matrix.
- Added comprehensive examples for `load_dataset`, `DatasetDict`, streaming,
and vision tasks.