Skip to content

✨ Add 'SparseCsv' builder and 'sparse_collate_fn' for efficient high-dimensional sparse data loading#7993

Open
Ebraheem1 wants to merge 1 commit intohuggingface:mainfrom
Ebraheem1:sparse-csv-loader
Open

✨ Add 'SparseCsv' builder and 'sparse_collate_fn' for efficient high-dimensional sparse data loading#7993
Ebraheem1 wants to merge 1 commit intohuggingface:mainfrom
Ebraheem1:sparse-csv-loader

Conversation

@Ebraheem1
Copy link

@Ebraheem1 Ebraheem1 commented Feb 4, 2026

This PR introduces a new dataset builder, SparseCsv, designed to handle "wide" tabular datasets (e.g., 100k+ columns common in transcriptomics, sparse NLP features, or recommender systems) that are typically too large to load into memory as dense Arrow tables.

It also adds a utility function, sparse_collate_fn, to seamlessly convert these sparse examples into torch.sparse or scipy.sparse matrices during training.

This PR should fix #7377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for sparse arrays with the Arrow Sparse Tensor format?

1 participant