Unable to load dataset that was saved with `save_to_disk` #6703

casper-hansen · 2024-03-01T11:59:56Z

Describe the bug

I get the following error message: You are trying to load a dataset that was saved using save_to_disk. Please use load_from_disk instead.

Steps to reproduce the bug

Save a dataset with save_to_disk
Try to load it with load_datasets

Expected behavior

I am able to load the dataset again with load_datasets which most packages uses over load_from_disk. I want to have a workaround that allows me to create the same indexing that push_to_hub creates for you before using save_to_disk - how can that be achieved?

Environment info

datasets 2.17.1, python 3.10

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-03-02T10:46:34Z

save_to_disk uses a special serialization that can only be read using load_from_disk.

Contrary to load_dataset, load_from_disk directly loads Arrow files and uses the dataset directory as cache.

On the other hand load_dataset does a conversion step to get Arrow files from the raw data files (could be in JSON, CSV, Parquet etc.) and caches them in the datasets cache directory (default is ~/.cache/huggingface/datasets). We haven't implemented any logic in load_dataset to support datasets saved with save_to_disk because they don't use the same cache.

EDIT: note that you can save your dataset in Parquet format locally using .to.parquet() (make sure to shard in multiple files your dataset if it's multiple GBs - you can use .shard() + .to_parquet() to do that) and you'll be able to reload it using load_dataset

casper-hansen · 2024-03-02T11:03:31Z

@lhoestq, so is it correctly understood that if I run to_parquet() and then save_to_disk(), I can load it with load_dataset? If yes, then it would resolve this issue (and should probably be documented somewhere 😄)

lhoestq · 2024-03-02T11:08:43Z

Here is an example:

ds.to_parquet("my/local/dir/data.parquet")

# later
ds = load_dataset("my/local/dir")

and for bigger datasets:

num_shards = 1024  # set number of files to save (e.g. try to have files smaller than 5GB)
for shard_idx in num_shards:
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(f"my/local/dir/{shard_idx:05d}.parquet")  # 00000.parquet to 01023.parquet

# later
ds = load_dataset("my/local/dir")

I hope this helps :)

casper-hansen · 2024-03-02T11:22:01Z

Thanks for helping out! Does this approach work with s3fs? e.g. something like this:

import s3fs
s3 = s3fs.S3FileSystem(anon=True)
with s3.open('mybucket/new-file.parquet', 'w') as f:
    ds.to_parquet(f)

This is instead of save_to_disk to save to an S3 bucket.

Otherwise, I am not sure how to make this work when saving the dataset to an S3 bucket. Would dataset.set_format("arrow") work as a replacement?

lhoestq · 2024-03-02T12:13:52Z

load_dataset does't support S3 buckets unfortunately :/

casper-hansen · 2024-03-02T12:44:54Z

load_dataset does't support S3 buckets unfortunately :/

I am aware but I have some code that downloads it to disk before using that method. The most important part is to store it in a format that load_dataset is compatible with.

lhoestq · 2024-03-02T22:18:41Z

Feel free to use Parquet then :)

casper-hansen · 2024-03-04T13:46:20Z

I ended up with this. Not ideal to save to local disk, but it works and loads via load_datasets after downloading from S3 with another method.

with tempfile.TemporaryDirectory() as dir:
    dataset_nbytes = ds._estimate_nbytes()
    max_shard_size_local = convert_file_size_to_int(max_shard_size)
    num_shards = int(dataset_nbytes / max_shard_size_local) + 1

    for shard_idx in range(num_shards):
        shard = ds.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"{dir}/{shard_idx:05d}.parquet")
    
    fs.upload(
        lpath=dir,
        rpath=s3_path,
        recursive=True,
    )

casper-hansen closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load dataset that was saved with `save_to_disk` #6703

Unable to load dataset that was saved with `save_to_disk` #6703

casper-hansen commented Mar 1, 2024 •

edited

Loading

lhoestq commented Mar 2, 2024 •

edited

Loading

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 4, 2024

Unable to load dataset that was saved with save_to_disk #6703

Unable to load dataset that was saved with save_to_disk #6703

Comments

casper-hansen commented Mar 1, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Mar 2, 2024 • edited Loading

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 2, 2024

lhoestq commented Mar 2, 2024

casper-hansen commented Mar 4, 2024

Unable to load dataset that was saved with `save_to_disk` #6703

Unable to load dataset that was saved with `save_to_disk` #6703

casper-hansen commented Mar 1, 2024 •

edited

Loading

lhoestq commented Mar 2, 2024 •

edited

Loading