You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using DVC for version management of experimental data. This data is usually located on the measurement PCs in a main folder, where it is then sorted into subfolders according to users, projects, etc.
To simplify usage, we just perform an dvc add data command for the main folder after measuring and evaluating new data. On huge repositories this can take a very long time, as it seems that even unchanged files will be updated in the working directory. We see this problem on Windows and Linux running on local file systems.
Reproduce
Modify/Rename one file in an already added directory and compare timings for:
dvc add directory
dvc commit directory
dvc commit directory/changed_file -f
Example:
For testing I've used a repo containing a data directory with 204 files and 360 MB that was already added to DVC. After renaming 1 file, testing and timing some commands gives the following results on Windows:
time dvc data status --granular: ~2.3s
time dvc add data: ~10.2s (this is even the same if there are no changes)
time dvc commit data -f: ~10.2s (this is even the same if there are no changes)
time dvc commit data/changed_file -f: 2.8s
After add/commit of the complete data directory, the last modification date of all files in the workspace is updated. If only the changed file is commited, then only this file will be updated in the workspace. For a larger repo with approx. 40.000 files and 10 GB I haven't exactly timed, but checking the status takes 1...2 minutes, whereas adding/committing the entire data directory (even without anything changed) will be > 1 hour.
Expected
The add/commit commands for entire directory should only update/checkout the changed files, that are already known by dvc data status
We should probably just stop doing relinking by default in add/commit and introduce an explicit flag --relink like in checkout. This force-relinking is a bit of a legacy thing from the times where hardlink/symlink was default and is only useful these days for when you have reflink support on your filesystem to remove duplicates. The latter is a perfectly valid thing to try to do, and in ideal world we would be able to easilly tell that a file is already a reflink and skip relinking it, but unfortunately that is not possible (reflink detection is a tricky thing to do and is probably not worth it), and it doesn't make sense to waste so much time relinking stuff by default.
Looking into introducing explicit --relink flags for add/commit. I think it is worth breaking backward compatibility here now because it improves normal use cases significantly.
One challenge here is handling permissions for symlink/hardlink cases because it could lead to data corruption. So we need to catch mismatched permissions during checkout phase of add/commit and relink that stuff even without --relink. We do collect the info about in meta, so should be doable.
Bug Report
Description
We are using DVC for version management of experimental data. This data is usually located on the measurement PCs in a main folder, where it is then sorted into subfolders according to users, projects, etc.
To simplify usage, we just perform an
dvc add data
command for the main folder after measuring and evaluating new data. On huge repositories this can take a very long time, as it seems that even unchanged files will be updated in the working directory. We see this problem on Windows and Linux running on local file systems.Reproduce
Modify/Rename one file in an already added directory and compare timings for:
dvc add directory
dvc commit directory
dvc commit directory/changed_file -f
Example:
For testing I've used a repo containing a data directory with 204 files and 360 MB that was already added to DVC. After renaming 1 file, testing and timing some commands gives the following results on Windows:
time dvc data status --granular
: ~2.3stime dvc add data
: ~10.2s (this is even the same if there are no changes)time dvc commit data -f
: ~10.2s (this is even the same if there are no changes)time dvc commit data/changed_file -f
: 2.8sAfter add/commit of the complete data directory, the last modification date of all files in the workspace is updated. If only the changed file is commited, then only this file will be updated in the workspace. For a larger repo with approx. 40.000 files and 10 GB I haven't exactly timed, but checking the status takes 1...2 minutes, whereas adding/committing the entire data directory (even without anything changed) will be > 1 hour.
Expected
The
add/commit
commands for entire directory should only update/checkout the changed files, that are already known bydvc data status
Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: