Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with concatenating datasets #4666

Closed
ChenghaoMou opened this issue Jul 9, 2022 · 2 comments
Closed

Issues with concatenating datasets #4666

ChenghaoMou opened this issue Jul 9, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@ChenghaoMou
Copy link

Describe the bug

It is impossible to concatenate datasets if a feature is sequence of dict in one dataset and a dict of sequence in another. But based on the document, it should be automatically converted.

A datasets.Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list instead of the datasets.Sequence.

Steps to reproduce the bug

from datasets import concatenate_datasets, load_dataset

squad = load_dataset("squad_v2")
squad["train"].to_json("output.jsonl", lines=True)

temp = load_dataset("json", data_files={"train": "output.jsonl"})
concatenate_datasets([temp["train"], squad["train"]])

Expected results

No error executing that code

Actual results

ValueError: The features can't be aligned because the key answers of features {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)} has unexpected type - Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None) (expected either {'text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'answer_start': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)} or Value("null").

Environment info

  • datasets version: 2.3.2
  • Platform: macOS-12.4-arm64-arm-64bit
  • Python version: 3.8.11
  • PyArrow version: 6.0.1
  • Pandas version: 1.3.5
@ChenghaoMou ChenghaoMou added the bug Something isn't working label Jul 9, 2022
@mariosasko
Copy link
Collaborator

mariosasko commented Jul 12, 2022

Hi! I agree we should improve the features equality checks to account for this particular case. However, your code fails due to answer_start having the dtype int64 instead of int32 after loading from JSON (it's not possible to embed type precision info into a JSON file; save_to_disk does that for arrow files), which would lead to the concatenation error as PyArrow does not support this sort of type promotion. This can be fixed as follows:

temp = load_dataset("json", data_files={"train": "output.jsonl"}, features=squad["train"].features)

@ChenghaoMou
Copy link
Author

That makes sense. I totally missed the int64 and int32 part. Thanks for pointing it out! Will close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants