-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
When concatenating/interleaving different datasets, I stumble into an error because the features can't be aligned. After some investigation, I understood that the audio arrays had different dtypes, namely float32 and float64. Consequently, the datasets cannot be merged.
Steps to reproduce the bug
For example, for facebook/voxpopuli and mozilla-foundation/common_voice_11_0:
from datasets import load_dataset, interleave_datasets
covost = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
voxpopuli = datasets.load_dataset("facebook/voxpopuli", "nl", split="train", streaming=True)
sample_cv, = covost.take(1)
sample_vp, = voxpopuli.take(1)
assert sample_cv["audio"]["array"].dtype == sample_vp["audio"]["array"].dtype
# Fails
dataset = interleave_datasets([covost, voxpopuli])
# ValueError: The features can't be aligned because the key audio of features {'audio_id': Value(dtype='string', id=None), 'language': Value(dtype='int64', id=None), 'audio': {'array': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'path': Value(dtype='string', id=None), 'sampling_rate': Value(dtype='int64', id=None)}, 'normalized_text': Value(dtype='string', id=None), 'gender': Value(dtype='string', id=None), 'speaker_id': Value(dtype='string', id=None), 'is_gold_transcript': Value(dtype='bool', id=None), 'accent': Value(dtype='string', id=None), 'sentence': Value(dtype='string', id=None)} has unexpected type - {'array': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'path': Value(dtype='string', id=None), 'sampling_rate': Value(dtype='int64', id=None)} (expected either Audio(sampling_rate=16000, mono=True, decode=True, id=None) or Value("null").
Expected behavior
The audio should be loaded to arrays with a unique dtype (I guess float32)
Environment info
- `datasets` version: 2.7.1.dev0
- Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.15
- PyArrow version: 10.0.1
- Pandas version: 1.5.2
Metadata
Metadata
Assignees
Labels
No labels