This release represents a significant architectural evolution of the
CrucibleDatasets library, moving beyond a simple fetch wrapper to a comprehensive
data processing system. The core focus has been on modularity, type safety, and
support for large-scale datasets through streaming and lazy evaluation.
Foundational Architecture Changes:
A new Source abstraction layer has been introduced to decouple data retrieval
logic from specific providers. The Source behaviour defines a unified interface
for file listing, downloading, and streaming, with robust implementations for
both Local filesystems and the HuggingFace Hub. This allows seamless switching
between local and remote data sources.
Complementing the Source layer is the new Format parser layer. This standardized
interface handles file parsing for JSONL, JSON, CSV, and Parquet formats. It
includes detection logic based on file extensions and supports streaming parsing
where applicable, enabling the processing of files larger than available memory.
Dataset Operations and Protocols:
The Dataset struct has been enhanced with a comprehensive suite of eager
transformation functions. Users can now perform functional transformations
(map, filter, shuffle), structural operations (batch, concat, split, shard,
slice), and column-wise manipulations (select, rename_column, add_column,
remove_columns). The Dataset struct now implements Elixir's Enumerable protocol
and Access behaviour, enabling idiomatic iteration and bracket-based indexing.
New Data Structures:
To handle multi-split datasets, a new DatasetDict module has been added. It
serves as a container for split-specific datasets (e.g., train, test,
validation) and supports bulk operations, applying transformations across all
splits simultaneously.
For scenarios requiring memory efficiency, the new IterableDataset module enables
lazy, streaming processing. It allows users to build chains of lazy
transformations (map, filter, batch, buffered shuffle) that are only executed
when the data is consumed.
Features and Type System:
A robust Features schema system has been implemented to provide type definitions
and validation for dataset columns. This includes support for scalar Values,
categorical ClassLabels, nested Sequences, and specialized media types for
Images and Audio. This system enables schema inference and explicit data
casting for improved type safety.
Dependency Integration:
The library now integrates the hf_hub dependency (v0.1.0) to power all
HuggingFace interactions. This provides reliable API access, content-addressed
caching, and resilient downloads with resumption capabilities.
Loader Enhancements:
Loader coverage has been expanded with new modules for Reasoning datasets
(OpenThoughts3, DeepMath) and Rubric-based evaluation datasets
(Feedback-Collection). Existing loaders for MMLU, HumanEval, and others have
been updated to leverage the new Source/Format infrastructure. A global
configuration option has been added to enable automatic fallback to synthetic
data, facilitating offline development and resilient testing.
Testing Infrastructure:
The testing setup has been improved with a new mix task for running integration
tests against live data sources, while keeping the default test suite fast and
network-independent through synthetic fallbacks.
This release establishes the necessary foundation for advanced data processing
workflows and multimodal support in Elixir.