Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with latest s3fs (2024.3.0): max_concurrency kwarg #80

Closed
ryan-williams opened this issue Mar 17, 2024 · 6 comments · Fixed by fsspec/s3fs#863
Closed
Labels
bug Something isn't working p1-important

Comments

@ryan-williams
Copy link

s3fs 2024.3.0 (released yesterday) added a max_concurrency kwarg (fsspec/s3fs#848), and today I have a job failing during a dvc pull from S3, referencing an unexpected keyword argument 'max_concurrency':

ERROR: failed to transfer '54a252d859eea2207da0fb933661dca0' - S3FileSystem._get_file() got an unexpected keyword argument 'max_concurrency'
ERROR: failed to pull data from the cloud - 1 files failed to download

(GHA link)

I was unable to repro it locally (with most of the same relevant versions: dvc{,_s3}, *boto*, s3fs), but pinning s3fs<=2024.2 allowed the same dvc pull to succeed (GHA link).

Mentioning here in case others run into it / can better triage.

shcheklein added a commit to shcheklein/s3fs that referenced this issue Mar 17, 2024
Fixes iterative/dvc-s3#80 

It is similar to `gcsfs` and `adlfs`.

On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error.

I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
shcheklein added a commit to shcheklein/s3fs that referenced this issue Mar 17, 2024
Fixes iterative/dvc-s3#80

It is similar to `gcsfs` and `adlfs`.

On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error.

I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
shcheklein added a commit to shcheklein/s3fs that referenced this issue Mar 17, 2024
Fixes iterative/dvc-s3#80

It is similar to `gcsfs` and `adlfs`.

On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error.

I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
@shcheklein
Copy link
Member

Thanks @ryan-williams . Should be fixed by fsspec/s3fs#863 .

We should probably for now do a some workaround here also (in case of s3 avoid using max_concurrency for get file) in dvc-objects. Also, we should see why it was not implemented in the first place - @pmrowla if could give a context - that would be helpful - was it involved and / or nor needed?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2024

Supporting concurrent chunked downloads is more complex than uploads, so it was not implemented due to time constraints.

For uploads (fsspec _put_file), the S3 API inherently supports uploading chunks out of order in multipart uploads. Re-assembling them in order is handled on the amazon/server end.

For downloads (fsspec _get_file), doing a chunked/concurrent download requires downloading the required byte-ranges and then re-assembling them locally. There is no built-in support for the out of order re-assembly step in the S3 API or in the botocore libraries. (For adlfs, the microsoft azure python sdk does include built-in support for concurrent chunked downloads, which is why it's supported in adlfs _get_file)

Since there is no guarantee that downloads will be completed in order, this means completed chunks need to either be kept in memory or written out to temp files on disk before re-assembling them at the end of the download operation (once all chunks are available, or at least when the "next" sequential chunk is available).

This is something that can be implemented in s3fs, but IMO it would be better for it to be implemented at the outer client level (i.e. the client calling fsspec, in this case dvc-objects) which would make chunked downloads supported for all filesystems that support downloading a specific byte range (which is essentially every fsspec implementation).

@martindurant
Copy link

Since there is no guarantee that downloads will be completed in order, this means completed chunks need to either be kept in memory or written out to temp files on disk before re-assembling them at the end of the download operation

I think all local filesystems support seeking beyond the end of the file to write data, so reassembly should not be hard. On linux this even produces "sparse" files (on windows you get padding), which is not important in this case because we intend to fill all the gaps.

@dberenbaum
Copy link
Contributor

dberenbaum commented Mar 19, 2024

Should we create a separate issue for concurrent chunked downloads?

Edit: And can we then close this issue? I see @martindurant released 2024.03.01.

@martindurant
Copy link

Should we create a separate issue for concurrent chunked downloads?

It's worth making a note, but someone should benchmark that it actually makes a difference. Currently we go concurrent over all files (subject to a throttle for limiting number of file descriptors) but each file streams. Although windows does allow seeking beyond a file's end to extend it, I wonder if it allows multiple writers in a single file. If not, each task would need to open/seek/write/close every time.

@skshetry
Copy link
Member

skshetry commented Mar 19, 2024

Closed by fsspec/s3fs#863, and released in s3fs==2024.3.1.

clrpackages pushed a commit to clearlinux-pkgs/pypi-fsspec that referenced this issue Mar 19, 2024
…rsion 2024.3.1

commit efbe1e4c23a06e65b3df6a82f28fc49bab0dbd78
Author: Martin Durant <martindurant@users.noreply.github.com>
Date:   Mon Mar 18 15:42:28 2024 -0400

    changelog (#864)

commit 5cf759d2e670eb4cb79d978491bf42ed0eff23a5
Author: Ivan Shcheklein <shcheklein@gmail.com>
Date:   Mon Mar 18 07:40:19 2024 -0700

    fix(core): accept kwargs in get file (#863)

    Fixes iterative/dvc-s3#80
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1-important
Projects
None yet
6 participants