Backwards compatibility broken for cached datasets that use `.filter()` #2943

anton-l · 2021-09-19T16:16:37Z

Describe the bug

After upgrading to datasets 1.12.0, some cached .filter() steps from 1.11.0 started failing with
ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}

Related feature: #2836

❓ This is probably a wontfix bug, since it can be solved by simply cleaning the related cache dirs, but the workaround could be useful for someone googling the error :)

Workaround

Remove the cache for the given dataset, e.g. rm -rf ~/.cache/huggingface/datasets/librispeech_asr.

Steps to reproduce the bug

Delete ~/.cache/huggingface/datasets/librispeech_asr if it exists.
pip install datasets==1.11.0 and run the following snippet:

from datasets import load_dataset

ids = ["1272-141231-0000"]
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.filter(lambda x: x["id"] in ids)

pip install datasets==1.12.1 and re-run the code again

Expected results

Same result as with the previous datasets version.

Actual results

Reusing dataset librispeech_asr (./.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1)
Loading cached processed dataset at ./.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1/cache-cd1c29844fdbc87a.arrow
Traceback (most recent call last):
  File "./repos/transformers/src/transformers/models/wav2vec2/try_dataset.py", line 5, in <module>
    ds = ds.filter(lambda x: x["id"] in ids)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/fingerprint.py", line 398, in wrapper
    out = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2169, in filter
    indices = self.map(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1686, in map
    return self._map_single(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/fingerprint.py", line 398, in wrapper
    out = func(self, *args, **kwargs)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1896, in _map_single
    return Dataset.from_file(cache_file_name, info=info, split=self.split)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 343, in from_file
    return cls(
  File "./envs/transformers/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 282, in __init__
    self.info.features = self.info.features.reorder_fields_as(inferred_features)
  File "./envs/transformers/lib/python3.8/site-packages/datasets/features.py", line 1151, in reorder_fields_as
    return Features(recursive_reorder(self, other))
  File "./envs/transformers/lib/python3.8/site-packages/datasets/features.py", line 1140, in recursive_reorder
    raise ValueError(f"Keys mismatch: between {source} and {target}" + stack_position)
ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}

Process finished with exit code 1

Environment info

datasets version: 1.12.1
Platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.17
Python version: 3.8.10
PyArrow version: 5.0.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-09-20T09:35:23Z

Hi ! I guess the caching mechanism should have considered the new filter to be different from the old one, and don't use cached results from the old filter.
To avoid other users from having this issue we could make the caching differentiate the two, what do you think ?

anton-l · 2021-09-20T09:49:39Z

If it's easy enough to implement, then yes please 😄 But this issue can be low-priority, since I've only encountered it in a couple of transformers CI tests.

lhoestq · 2021-09-20T10:14:20Z

Well it can cause issue with anyone that updates datasets and re-run some code that uses filter, so I'm creating a PR

lhoestq · 2021-09-20T15:11:29Z

I just merged a fix, let me know if you're still having this kind of issues :)

We'll do a release soon to make this fix available

anton-l · 2021-09-20T15:21:28Z

Definitely works on several manual cases with our dummy datasets, thank you @lhoestq !

albertvillanova · 2021-09-20T16:25:42Z

Fixed by #2947.

anton-l added the bug Something isn't working label Sep 19, 2021

lhoestq self-assigned this Sep 20, 2021

lhoestq mentioned this issue Sep 20, 2021

Don't use old, incompatible cache for the new filter #2947

Merged

albertvillanova closed this as completed Sep 20, 2021

rabeeh-karimi mentioned this issue Nov 14, 2022

when i run "bash scripts/perfect.sh", I got this error and please help me to solve this~ facebookresearch/perfect#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backwards compatibility broken for cached datasets that use `.filter()` #2943

Backwards compatibility broken for cached datasets that use `.filter()` #2943

anton-l commented Sep 19, 2021 •

edited

lhoestq commented Sep 20, 2021

anton-l commented Sep 20, 2021

lhoestq commented Sep 20, 2021

lhoestq commented Sep 20, 2021 •

edited

anton-l commented Sep 20, 2021

albertvillanova commented Sep 20, 2021

Backwards compatibility broken for cached datasets that use .filter() #2943

Backwards compatibility broken for cached datasets that use .filter() #2943

Comments

anton-l commented Sep 19, 2021 • edited

Describe the bug

Workaround

Steps to reproduce the bug

Expected results

Actual results

Environment info

lhoestq commented Sep 20, 2021

anton-l commented Sep 20, 2021

lhoestq commented Sep 20, 2021

lhoestq commented Sep 20, 2021 • edited

anton-l commented Sep 20, 2021

albertvillanova commented Sep 20, 2021

Backwards compatibility broken for cached datasets that use `.filter()` #2943

Backwards compatibility broken for cached datasets that use `.filter()` #2943

anton-l commented Sep 19, 2021 •

edited

lhoestq commented Sep 20, 2021 •

edited