Two suggestions to optimize checksum calculation while uploading to S3.
I frequently upload very large files (75-100GB) to S3 and the checksum calculation adds a significant delay in a time sensitive workflow. I was just uploading a 75GB file, and the checksum calculation took 10min before the actual upload started. Actual upload time is 32min, so that adds a 33% time penalty in uploading, which is significant and very unfortunate.
Compute the checksum during the upload, rather than a separate pre-calc pass. Yes, that reduces redundancy of the checksum because it becomes a single read, but errors are more likely during upload than local disk read.
The algorithm for reading the file for checksum calculation seems slow. My primary storage (RAID5) supports read bandwidth in excess of 400MB/s, yet during the calculation of the checksum the read speed never exceeds 120MB/s, so checksum calculation is limited by code not I/O bandwidth.
The text was updated successfully, but these errors were encountered:
PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.
PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.
We already do compute the returned concatenated MD5 hash for multipart uploads (see S3MultipartUploadService.
Two suggestions to optimize checksum calculation while uploading to S3.
I frequently upload very large files (75-100GB) to S3 and the checksum calculation adds a significant delay in a time sensitive workflow. I was just uploading a 75GB file, and the checksum calculation took 10min before the actual upload started. Actual upload time is 32min, so that adds a 33% time penalty in uploading, which is significant and very unfortunate.
The text was updated successfully, but these errors were encountered: