Possible issue with latest `s3fs` (2024.3.0): `max_concurrency` kwarg #80

ryan-williams · 2024-03-17T17:38:41Z

s3fs 2024.3.0 (released yesterday) added a max_concurrency kwarg (fsspec/s3fs#848), and today I have a job failing during a dvc pull from S3, referencing an unexpected keyword argument 'max_concurrency':

ERROR: failed to transfer '54a252d859eea2207da0fb933661dca0' - S3FileSystem._get_file() got an unexpected keyword argument 'max_concurrency'
ERROR: failed to pull data from the cloud - 1 files failed to download

(GHA link)

I was unable to repro it locally (with most of the same relevant versions: dvc{,_s3}, *boto*, s3fs), but pinning s3fs<=2024.2 allowed the same dvc pull to succeed (GHA link).

Mentioning here in case others run into it / can better triage.

The text was updated successfully, but these errors were encountered:

@pmrowla

Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.

@pmrowla

Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.

@pmrowla

Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.

shcheklein · 2024-03-17T21:44:05Z

Thanks @ryan-williams . Should be fixed by fsspec/s3fs#863 .

We should probably for now do a some workaround here also (in case of s3 avoid using max_concurrency for get file) in dvc-objects. Also, we should see why it was not implemented in the first place - @pmrowla if could give a context - that would be helpful - was it involved and / or nor needed?

pmrowla · 2024-03-17T23:34:34Z

Supporting concurrent chunked downloads is more complex than uploads, so it was not implemented due to time constraints.

For uploads (fsspec _put_file), the S3 API inherently supports uploading chunks out of order in multipart uploads. Re-assembling them in order is handled on the amazon/server end.

For downloads (fsspec _get_file), doing a chunked/concurrent download requires downloading the required byte-ranges and then re-assembling them locally. There is no built-in support for the out of order re-assembly step in the S3 API or in the botocore libraries. (For adlfs, the microsoft azure python sdk does include built-in support for concurrent chunked downloads, which is why it's supported in adlfs _get_file)

Since there is no guarantee that downloads will be completed in order, this means completed chunks need to either be kept in memory or written out to temp files on disk before re-assembling them at the end of the download operation (once all chunks are available, or at least when the "next" sequential chunk is available).

This is something that can be implemented in s3fs, but IMO it would be better for it to be implemented at the outer client level (i.e. the client calling fsspec, in this case dvc-objects) which would make chunked downloads supported for all filesystems that support downloading a specific byte range (which is essentially every fsspec implementation).

martindurant · 2024-03-18T14:31:15Z

Since there is no guarantee that downloads will be completed in order, this means completed chunks need to either be kept in memory or written out to temp files on disk before re-assembling them at the end of the download operation

I think all local filesystems support seeking beyond the end of the file to write data, so reassembly should not be hard. On linux this even produces "sparse" files (on windows you get padding), which is not important in this case because we intend to fill all the gaps.

Fixes iterative/dvc-s3#80

dberenbaum · 2024-03-19T12:42:51Z

Should we create a separate issue for concurrent chunked downloads?

Edit: And can we then close this issue? I see @martindurant released 2024.03.01.

martindurant · 2024-03-19T14:25:17Z

Should we create a separate issue for concurrent chunked downloads?

It's worth making a note, but someone should benchmark that it actually makes a difference. Currently we go concurrent over all files (subject to a throttle for limiting number of file descriptors) but each file streams. Although windows does allow seeking beyond a file's end to extend it, I wonder if it allows multiple writers in a single file. If not, each task would need to open/seek/write/close every time.

skshetry · 2024-03-19T15:52:49Z

Closed by fsspec/s3fs#863, and released in s3fs==2024.3.1.

…rsion 2024.3.1 commit efbe1e4c23a06e65b3df6a82f28fc49bab0dbd78 Author: Martin Durant <martindurant@users.noreply.github.com> Date: Mon Mar 18 15:42:28 2024 -0400 changelog (#864) commit 5cf759d2e670eb4cb79d978491bf42ed0eff23a5 Author: Ivan Shcheklein <shcheklein@gmail.com> Date: Mon Mar 18 07:40:19 2024 -0700 fix(core): accept kwargs in get file (#863) Fixes iterative/dvc-s3#80

shcheklein mentioned this issue Mar 17, 2024

fix(core): accept kwargs in get file shcheklein/s3fs#1

Closed

shcheklein mentioned this issue Mar 17, 2024

fix(core): accept kwargs in get file fsspec/s3fs#863

Merged

shcheklein added p1-important bug Something isn't working labels Mar 17, 2024

skshetry mentioned this issue Mar 18, 2024

pull: S3FileSystem._get_file() got an unexpected keyword argument 'max_concurrency' iterative/dvc#10358

Closed

martindurant pushed a commit to fsspec/s3fs that referenced this issue Mar 18, 2024

fix(core): accept kwargs in get file (#863)

5cf759d

Fixes iterative/dvc-s3#80

skshetry mentioned this issue Mar 18, 2024

pull: _get_file() got an unexpected keyword argument 'max_concurrency' iterative/dvc#10361

Closed

shcheklein mentioned this issue Mar 18, 2024

DVC_STAGE: Added environment variable to track DVC stage name iterative/dvc#10357

Merged

2 tasks

skshetry closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with latest `s3fs` (2024.3.0): `max_concurrency` kwarg #80

Possible issue with latest `s3fs` (2024.3.0): `max_concurrency` kwarg #80

ryan-williams commented Mar 17, 2024

shcheklein commented Mar 17, 2024

pmrowla commented Mar 17, 2024 •

edited

martindurant commented Mar 18, 2024

dberenbaum commented Mar 19, 2024 •

edited

martindurant commented Mar 19, 2024

skshetry commented Mar 19, 2024 •

edited

Possible issue with latest s3fs (2024.3.0): max_concurrency kwarg #80

Possible issue with latest s3fs (2024.3.0): max_concurrency kwarg #80

Comments

ryan-williams commented Mar 17, 2024

shcheklein commented Mar 17, 2024

pmrowla commented Mar 17, 2024 • edited

martindurant commented Mar 18, 2024

dberenbaum commented Mar 19, 2024 • edited

martindurant commented Mar 19, 2024

skshetry commented Mar 19, 2024 • edited

Possible issue with latest `s3fs` (2024.3.0): `max_concurrency` kwarg #80

Possible issue with latest `s3fs` (2024.3.0): `max_concurrency` kwarg #80

pmrowla commented Mar 17, 2024 •

edited

dberenbaum commented Mar 19, 2024 •

edited

skshetry commented Mar 19, 2024 •

edited