Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest fsspec==2023.10.0 issue with streaming datasets #6330

Closed
ZachNagengast opened this issue Oct 22, 2023 · 8 comments · Fixed by #6331 or #6334
Closed

Latest fsspec==2023.10.0 issue with streaming datasets #6330

ZachNagengast opened this issue Oct 22, 2023 · 8 comments · Fixed by #6331 or #6334
Assignees

Comments

@ZachNagengast
Copy link
Contributor

Describe the bug

Loading a streaming dataset with this version of fsspec fails with the following error:

NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

I suspect the issue is with this PR

fsspec/filesystem_spec#1381

Steps to reproduce the bug

  1. Upgrade fsspec to version 2023.10.0
  2. Attempt to load a streaming dataset e.g. load_dataset("laion/gpt4v-emotion-dataset", split="train", streaming=True)
  3. Observe the following exception:
  File "/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/datasets/load.py", line 2146, in load_dataset
    return builder_instance.as_streaming_dataset(split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/datasets/builder.py", line 1318, in as_streaming_dataset
    raise NotImplementedError(
NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

Expected behavior

Should stream the dataset as normal.

Environment info

datasets@main
fsspec==2023.10.0

@humpydonkey
Copy link

I also encountered a similar error below.
Appreciate the team could shed some light on this issue.

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
[/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb](https://vscode-remote+ssh-002dremote-002braspberry-002dg5-002e4x.vscode-resource.vscode-cdn.net/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb) Cell 1 line 4
      [1](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) from datasets import load_dataset, load_dataset
      [3](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) # ds = load_dataset("parquet", data_dir="/home/ubuntu/work/EveryDream2trainer/datasets/monse_v1/data")
----> [4](vscode-notebook-cell://ssh-remote%2Braspberry-g5.4x/home/ubuntu/work/EveryDream2trainer/prepare_dataset.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3) ds = load_dataset("Raspberry-ai/monse-v1")

File [/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/load.py:1804](https://vscode-remote+ssh-002dremote-002braspberry-002dg5-002e4x.vscode-resource.vscode-cdn.net/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/load.py:1804), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1800 # Build dataset for splits
   1801 keep_in_memory = (
   1802     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1803 )
-> 1804 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   1805 # Rename and cast features to match task schema
   1806 if task is not None:

File [/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/builder.py:1108](https://vscode-remote+ssh-002dremote-002braspberry-002dg5-002e4x.vscode-resource.vscode-cdn.net/opt/conda/envs/everydream/lib/python3.10/site-packages/datasets/builder.py:1108), in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1106 is_local = not is_remote_filesystem(self._fs)
   1107 if not is_local:
-> 1108     raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
   1109 if not os.path.exists(self._output_dir):
   1110     raise FileNotFoundError(
   1111         f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
   1112         "builder.download_and_prepare(), or use "
   1113         "datasets.load_dataset() before trying to access the Dataset object."
   1114     )

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

Code to reproduce the issue:

from datasets import load_dataset

ds = load_dataset("Raspberry-ai/monse-v1")

Dependencies:

Package                   Version
------------------------- ------------
absl-py                   2.0.0
accelerate                0.23.0
aiohttp                   3.8.4
aiosignal                 1.3.1
antlr4-python3-runtime    4.9.3
anyio                     4.0.0
appdirs                   1.4.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.0
async-lru                 2.0.4
async-timeout             4.0.3
attrs                     23.1.0
Babel                     2.13.0
backcall                  0.2.0
beautifulsoup4            4.12.2
bitsandbytes              0.41.1
bleach                    6.1.0
braceexpand               0.1.7
cachetools                5.3.1
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.3.1
click                     8.1.7
cmake                     3.27.7
colorama                  0.4.6
comm                      0.1.4
compel                    1.1.6
datasets                  2.11.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.18.0
dill                      0.3.6
docker-pycreds            0.4.0
dowg                      0.3.1
einops                    0.7.0
einops-exts               0.0.4
exceptiongroup            1.1.3
executing                 2.0.0
fastjsonschema            2.18.1
filelock                  3.12.4
fqdn                      1.5.1
frozenlist                1.4.0
fsspec                    2023.10.0
ftfy                      6.1.1
gitdb                     4.0.11
GitPython                 3.1.40
google-auth               2.23.3
google-auth-oauthlib      1.1.0
grpcio                    1.59.0
huggingface-hub           0.18.0
idna                      3.4
importlib-metadata        6.8.0
inflection                0.5.1
ipykernel                 6.25.2
ipython                   8.16.1
isoduration               20.11.0
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
json5                     0.9.14
jsonpointer               2.4
jsonschema                4.19.1
jsonschema-specifications 2023.7.1
jupyter_client            8.4.0
jupyter_core              5.4.0
jupyter-events            0.8.0
jupyter-lsp               2.2.0
jupyter_server            2.8.0
jupyter_server_terminals  0.4.4
jupyterlab                4.0.7
jupyterlab-pygments       0.2.2
jupyterlab_server         2.25.0
lightning-utilities       0.9.0
lion-pytorch              0.1.2
lit                       17.0.3
Markdown                  3.5
MarkupSafe                2.1.3
matplotlib-inline         0.1.6
mistune                   3.0.2
more-itertools            10.1.0
mpmath                    1.3.0
multidict                 6.0.4
multiprocess              0.70.14
mypy-extensions           1.0.0
nbclient                  0.8.0
nbconvert                 7.9.2
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.2
nltk                      3.8.1
notebook_shim             0.2.3
numpy                     1.23.5
oauthlib                  3.2.2
omegaconf                 2.2.3
open-clip-torch           2.22.0
open-flamingo             2.0.0
overrides                 7.4.0
packaging                 23.2
pandas                    2.1.1
pandocfilters             1.5.0
parso                     0.8.3
pathtools                 0.1.2
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    10.1.0
pip                       23.3.1
platformdirs              3.11.0
prometheus-client         0.17.1
prompt-toolkit            3.0.39
protobuf                  3.20.1
psutil                    5.9.6
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   13.0.0
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pycparser                 2.21
pyDeprecate               0.3.2
Pygments                  2.16.1
pynvml                    11.4.1
pyparsing                 3.1.1
pyre-extensions           0.0.29
python-dateutil           2.8.2
python-json-logger        2.0.7
pytorch-lightning         1.6.5
pytz                      2023.3.post1
PyYAML                    6.0.1
pyzmq                     25.1.1
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
requests-oauthlib         1.3.1
responses                 0.18.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.10.6
rsa                       4.9
safetensors               0.4.0
scipy                     1.11.3
Send2Trash                1.8.2
sentencepiece             0.1.98
sentry-sdk                1.32.0
setproctitle              1.3.3
setuptools                68.2.2
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
soupsieve                 2.5
stack-data                0.6.3
sympy                     1.12
tensorboard               2.15.0
tensorboard-data-server   0.7.1
terminado                 0.17.1
timm                      0.9.8
tinycss2                  1.2.1
tokenizers                0.13.3
tomli                     2.0.1
torch                     2.0.1+cu118
torchmetrics              1.2.0
torchvision               0.15.2+cu118
tornado                   6.3.3
tqdm                      4.66.1
traitlets                 5.11.2
transformers              4.29.2
triton                    2.0.0
types-python-dateutil     2.8.19.14
typing_extensions         4.8.0
typing-inspect            0.9.0
tzdata                    2023.3
uri-template              1.3.0
urllib3                   2.0.7
wandb                     0.15.12
wcwidth                   0.2.8
webcolors                 1.13
webdataset                0.2.62
webencodings              0.5.1
websocket-client          1.6.4
Werkzeug                  3.0.0
wheel                     0.41.2
xformers                  0.0.20
xxhash                    3.4.1
yarl                      1.9.2
zipp                      3.17.0

@ZachNagengast
Copy link
Contributor Author

@humpydonkey FWIW setting fsspec down to 2023.9.2 fixed the issue

pip install fsspec==2023.9.2

@humpydonkey
Copy link

got it, thanks @ZachNagengast

@albertvillanova albertvillanova self-assigned this Oct 23, 2023
@albertvillanova
Copy link
Member

Thanks for reporting and for the investigation, @ZachNagengast! 🤗

We are investigating the root cause of the issue. In the meantime, we are going to pin fsspec < 2023.10.0.

ap-- added a commit to ap--/datasets that referenced this issue Oct 23, 2023
Close huggingface#6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
ap-- added a commit to ap--/datasets that referenced this issue Oct 23, 2023
Close huggingface#6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
lhoestq pushed a commit that referenced this issue Oct 23, 2023
Close #6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
albertvillanova pushed a commit that referenced this issue Oct 24, 2023
Close #6330

`fsspec.implementations.LocalFilesystem.protocol`
was changed from `str` "file" to `tuple[str,...]` ("file", "local")
in `fsspec>=2023.10.0`

This commit supports both styles.
finger92 added a commit to protagolabs/NMP-GPT2-Tutorial that referenced this issue Nov 1, 2023
chuanli11 added a commit to LambdaLabsML/DeepSpeedExamples that referenced this issue Nov 1, 2023
Fix for "NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet."

huggingface/datasets#6330 (comment)
@RubTalha
Copy link

RubTalha commented Nov 7, 2023

https://stackoverflow.com/questions/77433096/notimplementederror-loading-a-dataset-cached-in-a-localfilesystem-is-not-suppor/77433141#77433141

@lhoestq
Copy link
Member

lhoestq commented Nov 7, 2023

You can also update datasets:

pip install -U datasets

It will also update fsspec to use the right version

bgamari added a commit to bgamari/nixpkgs that referenced this issue Jan 21, 2024
@dhruv-anand-aintech
Copy link

This seems to work fine in 2.19.0. Hopefully it will not break again

@sunosa
Copy link

sunosa commented May 8, 2024

not working for 2.19.1 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants