Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 uploads do not appear to run in parallel #8343

Closed
storcium opened this issue Feb 1, 2024 · 13 comments
Closed

S3 uploads do not appear to run in parallel #8343

storcium opened this issue Feb 1, 2024 · 13 comments
Assignees
Labels
Milestone

Comments

@storcium
Copy link

storcium commented Feb 1, 2024

Describe the bug

A clear and concise description of what the bug is.

Using s3ng, uploads commence after exiting quarantine, but do not appear to parallelize, resulting in far slower throughput than expected.

Steps to reproduce

  1. Run Wasabi s3ng benchmark with 10 streams and 16MB objects, such as:
    s3-benchmark -a -s -b -t 10 -z 16M -u
  2. Throughput shows on the S3 side at 381 MiB/s = 3 Gbps, 28 requests/sec
  3. Upload to OCIS space a folder with 5GB made of 35 x 140 MB files with random data, using the same network, compute and storage resources.
  4. Throughput shows as 60 MiB/s = 480 Mbps, 5 requests/sec

Expected behavior

Similar performance with a parallel benchmark and oCIS

Actual behavior

oCIS is about 6 times slower

Setup

Note that we are measuring peak sustained throughput, not the timing (quarantine, postprocessing, etc.). This has been tested as a binary, and in a K8s setup with similar results.

STORAGE_USERS_OCIS_ASYNC_UPLOADS is set to true

@micbar
Copy link
Contributor

micbar commented Feb 2, 2024

What is exactly the problem you are trying to solve?

We deliberately designed ocis to do uploads onto a quarantine area on the local disk to boost the upload performance.
Our clients close the connection after the last byte has been uploaded into ocis. By doing this, our frontend upload speed is as fast as possible.

As a second step, we do all kinds of asynchronous postprocessing on that file. Virus Scanning, Content indexing, OCR on images and so on.

After the postprocessing, the files are finally uploaded to S3.

@storcium
Copy link
Author

storcium commented Feb 2, 2024

Hi Michael, thank you for the response. Please see the graphs attached. Loading 10 streams in parallel in S3 maxes out the throughput at a much higher rate than oCIS does this - referring to the S3 upload stage. Does oCIS parallelize the S3 uploads on the S3 back end, or does them sequentially? Is there a way to parallelize?

The below graph shows max load with everything being the same - OS, network, storage.

s3 load test
Uploading image.png…

@micbar
Copy link
Contributor

micbar commented Feb 2, 2024

I still don’t understand the problem you are trying to solve 🤔

@tbsbdr
Copy link
Contributor

tbsbdr commented Feb 6, 2024

@nicholas-wilson-au
Copy link

Once a user has uploaded multiple large files and the files have completed post processing, upload via S3 appears to be sequential rather than utilising multipart uploading which significantly delays the availability of this file. This was implemented in OC10 as listed above. Can we investigate if this is possible to implement in OCIS?

@butonic
Copy link
Member

butonic commented Feb 7, 2024

ocis will copy the blob into s3 whenever the postprocessing event is received. Depending on the concurrency multiple goroutines may be running in parallel. Each trying to upload a blob.

For every upload, the github.com/minio/minio-go/v7 package we use makes a single PutObject call, which internally decides wether to do a multipart upload or not. The default size that switches to multipart uploads is 16MB.

I can see three put multipart implementations that might be called from putObjectCommon:

  • if c.overrideSignerType.IsV2() -> putObjectMultipart, we should be using V4, so I will not follow this
  • if size < 0 { (size is unknown) -> putObjectMultipartStreamNoLength, ocis we always sets a size -> irrelevant
  • finally putObjectMultipartStream -> internally checks if the reader implements ReadAt, if so, it will use a parallel uploader using putObjectMultipartStreamFromReadAt if not it will do a sequential upload with putObjectMultipartStreamOptionalChecksum

ocis will pass an os.File as the Reader which does implement ReadAt. So we should be doing parallel uploads. putObjectMultipartStreamFromReadAt internally calls getNumThreads, which, if not given as an option (which we currently don't), defaults to totalWorkers = 4.

We could make PutObjectOptions.NumThreads configurable, but I wonder how we can verify that parallel mutlipart uploads are correctly being started.

👀 it seems we are not using parallel uploads because we set opts.SendContentMd5 to true...

@butonic
Copy link
Member

butonic commented Feb 7, 2024

ok so this diff will use concurrent uploads:

diff --git a/pkg/storage/fs/s3ng/blobstore/blobstore.go b/pkg/storage/fs/s3ng/blobstore/blobstore.go
index 9c744e754..4fa391ebb 100644
--- a/pkg/storage/fs/s3ng/blobstore/blobstore.go
+++ b/pkg/storage/fs/s3ng/blobstore/blobstore.go
@@ -71,7 +71,7 @@ func (bs *Blobstore) Upload(node *node.Node, source string) error {
        }
        defer reader.Close()
 
-       _, err = bs.client.PutObject(context.Background(), bs.bucket, bs.path(node), reader, node.Blobsize, minio.PutObjectOptions{ContentType: "application/octet-stream", SendContentMd5: true})
+       _, err = bs.client.PutObject(context.Background(), bs.bucket, bs.path(node), reader, node.Blobsize, minio.PutObjectOptions{ContentType: "application/octet-stream"})
 
        if err != nil {
                return errors.Wrapf(err, "could not store object '%s' into bucket '%s'", bs.path(node), bs.bucket)

Let me see what SendContentMd5 is used for exactly.

@butonic
Copy link
Member

butonic commented Feb 7, 2024

IIRC we needed that parameter for a specific S3 implementation and just hardcoded it. So we need to make it configurable. 🥳 Probably together with the NumThreads ...

@butonic
Copy link
Member

butonic commented Feb 7, 2024

Adding this to the project board so someone can prioritize.

@micbar
Copy link
Contributor

micbar commented Feb 7, 2024

The mentioned param is needed to fulfil the bucket policy.

was brought up by @wkloucek due to an incindent.

@butonic
Copy link
Member

butonic commented Feb 7, 2024

yup:

The Content-MD5 header is required for any request to upload an object with a retention period configured using Amazon S3 Object Lock. For more information about Amazon S3 Object Lock, see Amazon S3 Object Lock Overview in the Amazon S3 User Guide.

@micbar
Copy link
Contributor

micbar commented Feb 19, 2024

Needs dependency bump and re-testing.

@butonic
Copy link
Member

butonic commented Feb 19, 2024

needs

@butonic butonic closed this as completed Feb 21, 2024
@micbar micbar added this to the Release 5.0.0 milestone Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants