Skip to content

Optimize MD5 checksum calculation #10278

@cyberduck

Description

@cyberduck

4d59c66 created the issue

Two suggestions to optimize checksum calculation while uploading to S3.

I frequently upload very large files (75-100GB) to S3 and the checksum calculation adds a significant delay in a time sensitive workflow. I was just uploading a 75GB file, and the checksum calculation took 10min before the actual upload started. Actual upload time is 32min, so that adds a 33% time penalty in uploading, which is significant and very unfortunate.

  • Compute the checksum during the upload, rather than a separate pre-calc pass. Yes, that reduces redundancy of the checksum because it becomes a single read, but errors are more likely during upload than local disk read.
  • The algorithm for reading the file for checksum calculation seems slow. My primary storage (RAID5) supports read bandwidth in excess of 400MB/s, yet during the calculation of the checksum the read speed never exceeds 120MB/s, so checksum calculation is limited by code not I/O bandwidth.

Metadata

Metadata

Assignees

Labels

s3AWS S3 Protocol Implementation

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions