Latest fsspec==2023.10.0 issue with streaming datasets #6330

ZachNagengast opened this issue Oct 22, 2023 · 8 comments · Fixed by #6331 or #6334

Describe the bug

Loading a streaming dataset with this version of fsspec fails with the following error:

NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

I suspect the issue is with this PR


Steps to reproduce the bug

  1. Upgrade fsspec to version 2023.10.0
  2. Attempt to load a streaming dataset e.g. load_dataset("laion/gpt4v-emotion-dataset", split="train", streaming=True)
  3. Observe the following exception:
  File "/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/datasets/", line 2146, in load_dataset
    return builder_instance.as_streaming_dataset(split=split)
  File "/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/datasets/", line 1318, in as_streaming_dataset
    raise NotImplementedError(
NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

Expected behavior

Should stream the dataset as normal.

Environment info


I also encountered a similar error below.
Appreciate the team could shed some light on this issue.

NotImplementedError                       Traceback (most recent call last)
[/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb]( Cell 1 line 4
      [1](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) from datasets import load_dataset, load_dataset
      [3](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) # ds = load_dataset("parquet", data_dir="/home/ubuntu/work/EveryDream2trainer/datasets/monse_v1/data")
----> [4](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3) ds = load_dataset("Raspberry-ai/monse-v1")

File [/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/](, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1800 # Build dataset for splits
   1801 keep_in_memory = (
   1802     keep_in_memory if keep_in_memory is not None else is_small_dataset(
   1803 )
-> 1804 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   1805 # Rename and cast features to match task schema
   1806 if task is not None:

File [/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/](, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1106 is_local = not is_remote_filesystem(self._fs)
   1107 if not is_local:
-> 1108     raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
   1109 if not os.path.exists(self._output_dir):
   1110     raise FileNotFoundError(
   1111         f"Dataset {}: could not find data in {self._output_dir}. Please make sure to call "
   1112         "builder.download_and_prepare(), or use "
   1113         "datasets.load_dataset() before trying to access the Dataset object."
   1114     )

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

Code to reproduce the issue:

from datasets import load_dataset

ds = load_dataset("Raspberry-ai/monse-v1")


Package                   Version
------------------------- ------------
absl-py                   2.0.0
accelerate                0.23.0
aiohttp                   3.8.4
aiosignal                 1.3.1
antlr4-python3-runtime    4.9.3
anyio                     4.0.0
appdirs                   1.4.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.0
async-lru                 2.0.4
async-timeout             4.0.3
attrs                     23.1.0
Babel                     2.13.0
backcall                  0.2.0
beautifulsoup4            4.12.2
bitsandbytes              0.41.1
bleach                    6.1.0
braceexpand               0.1.7
cachetools                5.3.1
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.3.1
click                     8.1.7
cmake                     3.27.7
colorama                  0.4.6
comm                      0.1.4
compel                    1.1.6
datasets                  2.11.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.18.0
dill                      0.3.6
docker-pycreds            0.4.0
dowg                      0.3.1
einops                    0.7.0
einops-exts               0.0.4
exceptiongroup            1.1.3
executing                 2.0.0
fastjsonschema            2.18.1
filelock                  3.12.4
fqdn                      1.5.1
frozenlist                1.4.0
fsspec                    2023.10.0
ftfy                      6.1.1
gitdb                     4.0.11
GitPython                 3.1.40
google-auth               2.23.3
google-auth-oauthlib      1.1.0
grpcio                    1.59.0
huggingface-hub           0.18.0
idna                      3.4
importlib-metadata        6.8.0
inflection                0.5.1
ipykernel                 6.25.2
ipython                   8.16.1
isoduration               20.11.0
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
json5                     0.9.14
jsonpointer               2.4
jsonschema                4.19.1
jsonschema-specifications 2023.7.1
jupyter_client            8.4.0
jupyter_core              5.4.0
jupyter-events            0.8.0
jupyter-lsp               2.2.0
jupyter_server            2.8.0
jupyter_server_terminals  0.4.4
jupyterlab                4.0.7
jupyterlab-pygments       0.2.2
jupyterlab_server         2.25.0
lightning-utilities       0.9.0
lion-pytorch              0.1.2
lit                       17.0.3
Markdown                  3.5
MarkupSafe                2.1.3
matplotlib-inline         0.1.6
mistune                   3.0.2
more-itertools            10.1.0
mpmath                    1.3.0
multidict                 6.0.4
multiprocess              0.70.14
mypy-extensions           1.0.0
nbclient                  0.8.0
nbconvert                 7.9.2
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.2
nltk                      3.8.1
notebook_shim             0.2.3
numpy                     1.23.5
oauthlib                  3.2.2
omegaconf                 2.2.3
open-clip-torch           2.22.0
open-flamingo             2.0.0
overrides                 7.4.0
packaging                 23.2
pandas                    2.1.1
pandocfilters             1.5.0
parso                     0.8.3
pathtools                 0.1.2
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    10.1.0
pip                       23.3.1
platformdirs              3.11.0
prometheus-client         0.17.1
prompt-toolkit            3.0.39
protobuf                  3.20.1
psutil                    5.9.6
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   13.0.0
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pycparser                 2.21
pyDeprecate               0.3.2
Pygments                  2.16.1
pynvml                    11.4.1
pyparsing                 3.1.1
pyre-extensions           0.0.29
python-dateutil           2.8.2
python-json-logger        2.0.7
pytorch-lightning         1.6.5
pytz                      2023.3.post1
PyYAML                    6.0.1
pyzmq                     25.1.1
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
requests-oauthlib         1.3.1
responses                 0.18.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.10.6
rsa                       4.9
safetensors               0.4.0
scipy                     1.11.3
Send2Trash                1.8.2
sentencepiece             0.1.98
sentry-sdk                1.32.0
setproctitle              1.3.3
setuptools                68.2.2
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
soupsieve                 2.5
stack-data                0.6.3
sympy                     1.12
tensorboard               2.15.0
tensorboard-data-server   0.7.1
terminado                 0.17.1
timm                      0.9.8
tinycss2                  1.2.1
tokenizers                0.13.3
tomli                     2.0.1
torch                     2.0.1+cu118
torchmetrics              1.2.0
torchvision               0.15.2+cu118
tornado                   6.3.3
tqdm                      4.66.1
traitlets                 5.11.2
transformers              4.29.2
triton                    2.0.0
typing_extensions         4.8.0
typing-inspect            0.9.0
tzdata                    2023.3
uri-template              1.3.0
urllib3                   2.0.7
wandb                     0.15.12
wcwidth                   0.2.8
webcolors                 1.13
webdataset                0.2.62
webencodings              0.5.1
websocket-client          1.6.4
Werkzeug                  3.0.0
wheel                     0.41.2
xformers                  0.0.20
xxhash                    3.4.1
yarl                      1.9.2
zipp                      3.17.0

Contributor Author

@humpydonkey FWIW setting fsspec down to 2023.9.2 fixed the issue

pip install fsspec==2023.9.2

got it, thanks @ZachNagengast

@albertvillanova albertvillanova self-assigned this Oct 23, 2023
Thanks for reporting and for the investigation, @ZachNagengast! 🤗

We are investigating the root cause of the issue. In the meantime, we are going to pin fsspec < 2023.10.0.

ap-- added a commit to ap--/datasets that referenced this issue Oct 23, 2023
Close huggingface#6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
ap-- added a commit to ap--/datasets that referenced this issue Oct 23, 2023
Close huggingface#6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
albertvillanova pushed a commit that referenced this issue Oct 24, 2023
Close #6330

was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
finger92 added a commit to protagolabs/NMP-GPT2-Tutorial that referenced this issue Nov 1, 2023
chuanli11 added a commit to LambdaLabsML/DeepSpeedExamples that referenced this issue Nov 1, 2023
Fix for "NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet."

huggingface/datasets#6330 (comment)
RubTalha commented Nov 7, 2023

lhoestq commented Nov 7, 2023

You can also update datasets:

pip install -U datasets

It will also update fsspec to use the right version

bgamari added a commit to bgamari/nixpkgs that referenced this issue Jan 21, 2024
This seems to work fine in 2.19.0. Hopefully it will not break again

sunosa commented May 8, 2024

not working for 2.19.1 !

7 participants