Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For files over 50 GB S3_complete_multipart executes prematurely. #2114

Closed
2 tasks done
JustinKyleJames opened this issue May 5, 2023 · 6 comments
Closed
2 tasks done
Assignees
Labels

Comments

@JustinKyleJames
Copy link
Contributor

JustinKyleJames commented May 5, 2023

  • main
  • 4-2-stable

When using streaming mode but with a local cache file, when the cache file is flushed, there are simultaneously S3_MPU_THREADS being executed (if the file is significantly large). Normally there is a one-to-one mapping between part sizes and number of threads.

However, if each thread handles more than S3_MAX_UPLOAD_SIZE megabytes, more parts have to be used so there is not a one-to-one mapping.

With the defaults of S3_MPU_THREADS = 10 and S3_MAX_UPLOAD_SIZE of 5 GB, if the file is over 50 GB this happens.

The logic is roughly as follows:

  1. Figure out number of required parts.
  2. Perform min(S3_MPU_THREADS, parts_left) S3_upload_part calls in parallel.
  3. Wait for all active upload threads to complete.
  4. If more parts need to be uploaded, continue with 2.

The problem is that I was reusing an existing thread_pool for the second iteration of 2-4. Because of this the wait returned immediately and the complete multipart executed prematurely.

The fix is to instantiate a new version of thread_pool on each iteration.

As an example, if the file is 55 GB, there will be 11 parts each of size 5 GB. The first 10 will execute and before the 11th part executes S3_complete_multipart() will execute.

@JustinKyleJames
Copy link
Contributor Author

Note that this only happens when a cache file is required. That generally happens after a redirect from one server to another because the S3 plugin doesn't know the file size which would be necessary to use streaming mode.

@korydraughn
Copy link
Contributor

The fix is to instantiate a new version of thread_pool on each iteration.

That sounds pretty costly. Can the existing thread pool be reused?

@JustinKyleJames JustinKyleJames self-assigned this May 12, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue May 12, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue May 12, 2023
alanking pushed a commit that referenced this issue May 12, 2023
@JustinKyleJames
Copy link
Contributor Author

JustinKyleJames commented May 17, 2023

The fix is to instantiate a new version of thread_pool on each iteration.

That sounds pretty costly. Can the existing thread pool be reused?

It is costly but for a file of 50 GiB and 10 upload threads we are talking about 1 re-instantiation. With the change for 5119, it would be 4 re-instantiations.

@trel
Copy link
Member

trel commented May 17, 2023

right, marginally zero if you're moving 50GiB.

JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jun 21, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jul 5, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jul 5, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jul 5, 2023
alanking pushed a commit that referenced this issue Jul 5, 2023
@alanking
Copy link
Contributor

alanking commented Jul 5, 2023

@JustinKyleJames - Please close if complete. Thanks!

@JustinKyleJames
Copy link
Contributor Author

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants