Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load dataset that was saved with save_to_disk #6703

Closed
casper-hansen opened this issue Mar 1, 2024 · 8 comments
Closed

Unable to load dataset that was saved with save_to_disk #6703

casper-hansen opened this issue Mar 1, 2024 · 8 comments

Comments

@casper-hansen
Copy link

casper-hansen commented Mar 1, 2024

Describe the bug

I get the following error message: You are trying to load a dataset that was saved using save_to_disk. Please use load_from_disk instead.

Steps to reproduce the bug

  1. Save a dataset with save_to_disk
  2. Try to load it with load_datasets

Expected behavior

I am able to load the dataset again with load_datasets which most packages uses over load_from_disk. I want to have a workaround that allows me to create the same indexing that push_to_hub creates for you before using save_to_disk - how can that be achieved?

Environment info

datasets 2.17.1, python 3.10

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2024

save_to_disk uses a special serialization that can only be read using load_from_disk.

Contrary to load_dataset, load_from_disk directly loads Arrow files and uses the dataset directory as cache.

On the other hand load_dataset does a conversion step to get Arrow files from the raw data files (could be in JSON, CSV, Parquet etc.) and caches them in the datasets cache directory (default is ~/.cache/huggingface/datasets). We haven't implemented any logic in load_dataset to support datasets saved with save_to_disk because they don't use the same cache.

EDIT: note that you can save your dataset in Parquet format locally using .to.parquet() (make sure to shard in multiple files your dataset if it's multiple GBs - you can use .shard() + .to_parquet() to do that) and you'll be able to reload it using load_dataset

@casper-hansen
Copy link
Author

@lhoestq, so is it correctly understood that if I run to_parquet() and then save_to_disk(), I can load it with load_dataset? If yes, then it would resolve this issue (and should probably be documented somewhere 😄)

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2024

Here is an example:

ds.to_parquet("my/local/dir/data.parquet")

# later
ds = load_dataset("my/local/dir")

and for bigger datasets:

num_shards = 1024  # set number of files to save (e.g. try to have files smaller than 5GB)
for shard_idx in num_shards:
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(f"my/local/dir/{shard_idx:05d}.parquet")  # 00000.parquet to 01023.parquet

# later
ds = load_dataset("my/local/dir")

I hope this helps :)

@casper-hansen
Copy link
Author

Thanks for helping out! Does this approach work with s3fs? e.g. something like this:

import s3fs
s3 = s3fs.S3FileSystem(anon=True)
with s3.open('mybucket/new-file.parquet', 'w') as f:
    ds.to_parquet(f)

This is instead of save_to_disk to save to an S3 bucket.

Otherwise, I am not sure how to make this work when saving the dataset to an S3 bucket. Would dataset.set_format("arrow") work as a replacement?

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2024

load_dataset does't support S3 buckets unfortunately :/

@casper-hansen
Copy link
Author

load_dataset does't support S3 buckets unfortunately :/

I am aware but I have some code that downloads it to disk before using that method. The most important part is to store it in a format that load_dataset is compatible with.

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2024

Feel free to use Parquet then :)

@casper-hansen
Copy link
Author

I ended up with this. Not ideal to save to local disk, but it works and loads via load_datasets after downloading from S3 with another method.

with tempfile.TemporaryDirectory() as dir:
    dataset_nbytes = ds._estimate_nbytes()
    max_shard_size_local = convert_file_size_to_int(max_shard_size)
    num_shards = int(dataset_nbytes / max_shard_size_local) + 1

    for shard_idx in range(num_shards):
        shard = ds.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"{dir}/{shard_idx:05d}.parquet")
    
    fs.upload(
        lpath=dir,
        rpath=s3_path,
        recursive=True,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants