Skip to content

unnecessary recalc of hashes on first checkout of a clone #3009

@vincilab

Description

@vincilab

If you have a project setup to use symlinks onto a NAS:

$ cat .dvc/config
[core]
analytics = false
remote = myremote
['remote "myremote"']
url = /net/dvc_test
[cache]
dir = /net/dvc_cache
type = "hardlink,symlink"
protected = true

If you clone that repo and use dvc checkout, it will correctly symlink to the files on the NAS, but it will also recalculate md5 hashes for all the dvc-managed files unnecessarily. This can take a really long time if you have a really large dataset, and makes it impractical to use widely on machines that need access to the files quickly, like a build agent (which, for every build, will git clone the repo and should be able to start churning through the dataset without waiting for 2TB of md5 hashes to be computed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions