Skip to content

Conversation

@woodshop
Copy link
Contributor

When executing boto3.s3.Client.Copy, large files are chunked into
multiple parts. The default chunk threshold and size is 8MB. This
caused an issue when caching an object larger than 8M, where the
original object had not been uploaded in multiple parts. The original
object and the cached object would have different ETags, causing an
exception. This fixes the issue by dynamically setting the chunk
threshold to be one byte larger than the size of the original object
therefore avoiding multipart copies.

Fixes #3174

  • [:white_check_mark:] ❗ Have you followed the guidelines in the Contributing to DVC list?

  • [:white_check_mark:] 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • [:white_check_mark:] ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

When executing `boto3.s3.Client.Copy`, large files are chunked into
multiple parts. The default chunk threshold and size is 8MB. This
caused an issue when caching an object larger than 8M, where the
original object had not been uploaded in multiple parts. The original
object and the cached object would have different ETags, causing an
exception. This fixes the issue by dynamically setting the chunk
threshold to be one byte larger than the size of the original object
therefore avoiding multipart copies.

Fixes iterative#3174
@efiop
Copy link
Contributor

efiop commented Jan 17, 2020

Thank you @woodshop ! 🙏

@efiop efiop merged commit 0bd433d into iterative:master Jan 17, 2020
import logging
import os
import threading
from boto3.s3.transfer import TransferConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this import be inside _copy()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @skshetry ! I think it's better to move inside indeed. Mind creating a PR for this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry , got you) #3195

Good catch, indeed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed that one! Thank you @skshetry !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

remote s3 caching uses multipart copy which may cause an ETag mismatch

4 participants