Issues with concatenating datasets #4666

ChenghaoMou · 2022-07-09T17:45:14Z

Describe the bug

It is impossible to concatenate datasets if a feature is sequence of dict in one dataset and a dict of sequence in another. But based on the document, it should be automatically converted.

A datasets.Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list instead of the datasets.Sequence.

Steps to reproduce the bug

from datasets import concatenate_datasets, load_dataset

squad = load_dataset("squad_v2")
squad["train"].to_json("output.jsonl", lines=True)

temp = load_dataset("json", data_files={"train": "output.jsonl"})
concatenate_datasets([temp["train"], squad["train"]])

Expected results

No error executing that code

Actual results

ValueError: The features can't be aligned because the key answers of features {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)} has unexpected type - Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None) (expected either {'text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'answer_start': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)} or Value("null").

Environment info

datasets version: 2.3.2
Platform: macOS-12.4-arm64-arm-64bit
Python version: 3.8.11
PyArrow version: 6.0.1
Pandas version: 1.3.5

The text was updated successfully, but these errors were encountered:

mariosasko · 2022-07-12T17:09:17Z

Hi! I agree we should improve the features equality checks to account for this particular case. However, your code fails due to answer_start having the dtype int64 instead of int32 after loading from JSON (it's not possible to embed type precision info into a JSON file; save_to_disk does that for arrow files), which would lead to the concatenation error as PyArrow does not support this sort of type promotion. This can be fixed as follows:

temp = load_dataset("json", data_files={"train": "output.jsonl"}, features=squad["train"].features)

ChenghaoMou · 2022-07-12T17:16:14Z

That makes sense. I totally missed the int64 and int32 part. Thanks for pointing it out! Will close this issue for now.

ChenghaoMou added the bug Something isn't working label Jul 9, 2022

ChenghaoMou closed this as completed Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with concatenating datasets #4666

Issues with concatenating datasets #4666

ChenghaoMou commented Jul 9, 2022

mariosasko commented Jul 12, 2022 •

edited

Loading

ChenghaoMou commented Jul 12, 2022

Issues with concatenating datasets #4666

Issues with concatenating datasets #4666

Comments

ChenghaoMou commented Jul 9, 2022

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

mariosasko commented Jul 12, 2022 • edited Loading

ChenghaoMou commented Jul 12, 2022

mariosasko commented Jul 12, 2022 •

edited

Loading