Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GCS multipart upload is not supported. #217

Closed
s4mur4i opened this issue Aug 21, 2020 · 6 comments
Closed

Using GCS multipart upload is not supported. #217

s4mur4i opened this issue Aug 21, 2020 · 6 comments

Comments

@s4mur4i
Copy link
Contributor

s4mur4i commented Aug 21, 2020

Uploading files larger then the partition size will cause s5cmd to use multipart uploads, which is not supported.
Current work around is to set:

-c=1 -p=1000000

for copying to prevent sharding.

Error is also misleading:

InvalidArgument: Invalid argument. status code: 400, request id: , host id:
@doit-mattporter
Copy link

Encountering this issue as well, specifically when uploading to a GCS bucket. The following fails with the misleading InvalidArgument error:

s5cmd --endpoint-url https://storage.googleapis.com cp test_30GB_file s3://test-gcs-bucket/

while the following succeeds:

s5cmd --endpoint-url https://storage.googleapis.com cp -c=1 -p=1000000 test_30GB_file s3://test-gcs-bucket/

@igungor
Copy link
Member

igungor commented Sep 1, 2020

Thanks for the report.

s5cmd treats S3 as a first class citizen because it's been designed to communicate with S3. Naturally, it's using the official AWS SDK to communicate with S3. s5cmd can access GCS through its S3-compatible (well, mostly) API Gateway.

If you try to upload a file to an object store, s5cmd will split the file and upload the parts in parallel to achieve maximum throughput, using the S3 Multipart Upload API contract. The problem is, GCS doesn't have multipart upload support in their S3-compatible API. The misleading error you're encounting is the result of the lack of this support.

An excerpt from the official GCS docs:
Screen Shot 2020-09-01 at 11 55 55

So, multipart upload is not supported out of the box. If you use -c=1 -p=<number-higher-than-filesize-in-mb>, SDK is going to upload the file in one pass by using the PutObject API call, which is supported by GCS. For larger files though, the operation will not be as performant as multipart upload.

@doit-mattporter
Copy link

Native s5cmd support for GCP could be massively beneficial for that cloud as GCP's native gsutil tooling offers very poor download performance compared to s5cmd. gsutil upload performance is not great either, however this can't be improved by using s5cmd until multi-part upload support is added. See this benchmarking deep-dive I wrote on the subject:
https://blog.doit-intl.com/optimize-data-transfer-between-compute-engine-and-cloud-storage-9a1ecd030e30

@rosibaj
Copy link

rosibaj commented May 9, 2022

Is this still an existing issue?

@jordan-jack-schneider
Copy link

@igungor
Copy link
Member

igungor commented Jul 25, 2023

It seems that GCS now supports both ListObjectsV2 and S3 multipart upload protocols according to their changelog.

I can't test it now but if someone could test and report, that'd be very helpful. I'm closing the issue. Please feel free to re-open if you see any problem with GCS multipart uploads.

@igungor igungor closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants