Skip to content

remote: getting a directory checksum requires uploading a file to an external cache #2647

@ghost

Description

Version: 0.62.1

Description: The current approach to calculate a directory checksum is to collect its contents' checksum and then "upload" it and get the checksum provided by the remote:

https://github.com/iterative/dvc/blob/a091b67341b00fea8ef43906f59c88cb38f384ff/dvc/remote/base.py#L246-L256

For example, if it is an S3 remote, DVC will create a file containing an "index" of the directory with its checksums on the S3 cache and then will try to get the ETag of such file.

This is an easy way to check if the directory has changed. Since the "directory index" is already there, you are a head operation away from knowing if the directory changed or not. Otherwise, you'll need to recompute the file on the spot.

Right now we don't require setting up a cache for using external dependencies, but as mentioned above, this is needed for directories.

Let's discuss if there's another way around uploading files or just settle with that implementation and modify the code and docs accordingly.

Related: #1654

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusionp2-mediumMedium priority, should be done, but less importantresearch

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions