Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backwards compatibility broken for cached datasets that use .filter() #2943

Closed
anton-l opened this issue Sep 19, 2021 · 6 comments · Fixed by #2947
Closed

Backwards compatibility broken for cached datasets that use .filter() #2943

anton-l opened this issue Sep 19, 2021 · 6 comments · Fixed by #2947
Assignees
Labels
bug Something isn't working

Comments

@anton-l
Copy link
Member

anton-l commented Sep 19, 2021

Describe the bug

After upgrading to datasets 1.12.0, some cached .filter() steps from 1.11.0 started failing with
ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}

Related feature: #2836

❓ This is probably a wontfix bug, since it can be solved by simply cleaning the related cache dirs, but the workaround could be useful for someone googling the error :)

Workaround

Remove the cache for the given dataset, e.g. rm -rf ~/.cache/huggingface/datasets/librispeech_asr.

Steps to reproduce the bug

  1. Delete ~/.cache/huggingface/datasets/librispeech_asr if it exists.

  2. pip install datasets==1.11.0 and run the following snippet:

from datasets import load_dataset

ids = ["1272-141231-0000"]
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.filter(lambda x: x["id"] in ids)
  1. pip install datasets==1.12.1 and re-run the code again

Expected results

Same result as with the previous datasets version.

Actual results

Reusing dataset librispeech_asr (./.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1)
Loading cached processed dataset at ./.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1/cache-cd1c29844fdbc87a.arrow
Traceback (most recent call last):
  File "./repos/transformers/src/transformers/models/wav2vec2/try_dataset.py", line 5, in <module>
    ds = ds.filter(lambda x: x["id"] in ids)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/fingerprint.py", line 398, in wrapper
    out = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2169, in filter
    indices = self.map(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1686, in map
    return self._map_single(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/fingerprint.py", line 398, in wrapper
    out = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1896, in _map_single
    return Dataset.from_file(cache_file_name, info=info, split=self.split)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 343, in from_file
    return cls(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 282, in __init__
    self.info.features = self.info.features.reorder_fields_as(inferred_features)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/features.py", line 1151, in reorder_fields_as
    return Features(recursive_reorder(self, other))
  File "./envs/transformers/lib/python3.8/site-packages/datasets/features.py", line 1140, in recursive_reorder
    raise ValueError(f"Keys mismatch: between {source} and {target}" + stack_position)
ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}

Process finished with exit code 1

Environment info

  • datasets version: 1.12.1
  • Platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.17
  • Python version: 3.8.10
  • PyArrow version: 5.0.0
@anton-l anton-l added the bug Something isn't working label Sep 19, 2021
@lhoestq lhoestq self-assigned this Sep 20, 2021
@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2021

Hi ! I guess the caching mechanism should have considered the new filter to be different from the old one, and don't use cached results from the old filter.
To avoid other users from having this issue we could make the caching differentiate the two, what do you think ?

@anton-l
Copy link
Member Author

anton-l commented Sep 20, 2021

If it's easy enough to implement, then yes please 😄 But this issue can be low-priority, since I've only encountered it in a couple of transformers CI tests.

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2021

Well it can cause issue with anyone that updates datasets and re-run some code that uses filter, so I'm creating a PR

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2021

I just merged a fix, let me know if you're still having this kind of issues :)

We'll do a release soon to make this fix available

@anton-l
Copy link
Member Author

anton-l commented Sep 20, 2021

Definitely works on several manual cases with our dummy datasets, thank you @lhoestq !

@albertvillanova
Copy link
Member

Fixed by #2947.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants