-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to load dataset that was saved with save_to_disk
#6703
Comments
Contrary to On the other hand EDIT: note that you can save your dataset in Parquet format locally using |
@lhoestq, so is it correctly understood that if I run |
Here is an example: ds.to_parquet("my/local/dir/data.parquet")
# later
ds = load_dataset("my/local/dir") and for bigger datasets: num_shards = 1024 # set number of files to save (e.g. try to have files smaller than 5GB)
for shard_idx in num_shards:
shard = ds.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(f"my/local/dir/{shard_idx:05d}.parquet") # 00000.parquet to 01023.parquet
# later
ds = load_dataset("my/local/dir") I hope this helps :) |
Thanks for helping out! Does this approach work with import s3fs
s3 = s3fs.S3FileSystem(anon=True)
with s3.open('mybucket/new-file.parquet', 'w') as f:
ds.to_parquet(f) This is instead of Otherwise, I am not sure how to make this work when saving the dataset to an S3 bucket. Would |
|
I am aware but I have some code that downloads it to disk before using that method. The most important part is to store it in a format that load_dataset is compatible with. |
Feel free to use Parquet then :) |
I ended up with this. Not ideal to save to local disk, but it works and loads via with tempfile.TemporaryDirectory() as dir:
dataset_nbytes = ds._estimate_nbytes()
max_shard_size_local = convert_file_size_to_int(max_shard_size)
num_shards = int(dataset_nbytes / max_shard_size_local) + 1
for shard_idx in range(num_shards):
shard = ds.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(f"{dir}/{shard_idx:05d}.parquet")
fs.upload(
lpath=dir,
rpath=s3_path,
recursive=True,
) |
Describe the bug
I get the following error message: You are trying to load a dataset that was saved using
save_to_disk
. Please useload_from_disk
instead.Steps to reproduce the bug
save_to_disk
load_datasets
Expected behavior
I am able to load the dataset again with
load_datasets
which most packages uses overload_from_disk
. I want to have a workaround that allows me to create the same indexing thatpush_to_hub
creates for you before usingsave_to_disk
- how can that be achieved?Environment info
datasets 2.17.1, python 3.10
The text was updated successfully, but these errors were encountered: