Skip to content

v0.4.1

@nshkrdotcom nshkrdotcom tagged this 22 Dec 07:34
This release establishes core API parity with the HuggingFace Python
library, introducing the standard `load_dataset` interface, streaming
capabilities, and native support for vision datasets. It removes synthetic
test data in favor of robust HTTP stubs and integrates `vix` for
high-performance image decoding.

New Features:
- HuggingFace Parity API: Added `CrucibleDatasets.load_dataset/2`
  supporting `repo_id`, `config`, `split`, and `streaming` options.
- Dataset Types: Implemented `DatasetDict` for multi-split handling and
  `IterableDataset` for lazy loading and streaming.
- Data Discovery: Introduced `DataFiles` resolver to automatically infer
  configs and map remote files to splits using `HfHub` (v0.1.1) metadata.
- Streaming Support: Added line-buffered JSONL streaming and batched
  Parquet iteration to support large-scale datasets.
- Vision Support: Added `CrucibleDatasets.Loader.Vision` for standard
  benchmarks (Caltech101, Oxford Flowers 102, Oxford-IIIT Pet, Stanford Cars).
- Media Decoding: Integrated `vix` (libvips) for image decoding. Added
  `Image` feature type with automatic decoding configuration.
- Real Loaders: Replaced synthetic implementations of MMLU and HumanEval
  with real HuggingFace loaders.

Changes:
- Synthetic Removal: Completely removed synthetic data generation and the
  `fallback_to_synthetic` configuration. All loaders now target real data.
- Testing Infrastructure: Replaced synthetic fallback tests with `Bypass` HTTP
  stubs and deterministic local fixtures (`HfStub`, `HfCase`).
- Registry: Expanded registry to include 24 standard benchmark datasets covering
  Chat, Math, Code, Preference, Reasoning, and Vision domains.
- Dataset Struct: Added `features` field to `Dataset` with schema inference
  logic (`from_list`, `from_dataframe`).
- Dataset Operations: Added `select/2` (index/range), `num_rows/1`, and
  `column_names/1` to match Python API behavior.
- Dependencies: Updated `hf_hub` to `~> 0.1.1` for archive extraction;
  updated `explorer` to `~> 0.11.1`; added `vix` and `bypass`.

Fixes:
- Parquet Stability: Applied `rechunk: true` to Explorer reads to mitigate
  Polars NIF panics on specific schemas.
- Path Resolution: Fixed `Fetcher.HuggingFace` to use resolved paths from
  `DataFiles`, supporting extraction of archives (zip/tar).

Documentation:
- Updated README with new Quick Start examples, system dependencies (libvips),
  and feature matrix.
- Added comprehensive examples for `load_dataset`, `DatasetDict`, streaming,
  and vision tasks.
Assets 2
Loading