Skip to content

remote s3 caching uses multipart copy which may cause an ETag mismatch #3174

@woodshop

Description

@woodshop

DVC version: 0.81.3, ubuntu 18.04 LTS, pip install, python 3.7.4

When specifying a remote object on s3 as a cached output, an ETag mismatch error is sometimes raised during the caching stage. E.g., observe the following debug output.

DEBUG: Removing s3://XXXXX/artifacts/test/train/models/model.01.ckpt.data-00000-of-00002
DEBUG: Created 'copy': s3://XXXXX/s3cache/ee/28f5ee86fc9aaca1aca65e64abde58 -> s3://XXXXX/artifacts/test/train/models/model.01.ckpt.data-00000-of-00002
DEBUG: cache 's3://XXXXX/s3cache/82/ed1912264aec805954939a0c84b5ff' expected '82ed1912264aec805954939a0c84b5ff' actual 'None'
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(429,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: failed to run command - ETag mismatch detected when copying file to cache! (expected: '82ed1912264aec805954939a0c84b5ff', actual: '284b1e7130c3689baad589f9de093810-11')

In this example,
s3://XXXXX/artifacts/test/train/models/model.01.ckpt.data-00000-of-00002 is 81.4 MB. When dvc attempts to cache the file, it produces a multipart copy with a mismatched ETAG.

I believe this is because s3.copy

s3.copy(source, to_info.bucket, to_info.path, ExtraArgs=extra_args)
uses a default chunk size and chunk threshold of 8MB.

I've patched the problem locally for myself with the following lines:

import boto3
s3.copy(
    source, to_info.bucket, to_info.path, ExtraArgs=extra_args,
    Config=boto3.s3.transfer.TransferConfig(
        multipart_threshold=1024**3
   )
)

There are probably more elegant methods to avoid a multipart copy. For instance, https://stackoverflow.com/a/38058798 suggests using put_object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    awaiting responsewe are waiting for your reply, please respond! :)bugDid we break something?

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions