Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/commit: Using on directories also updates unchanged files in the working directory #8806

Open
spinnau opened this issue Jan 13, 2023 · 2 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove

Comments

@spinnau
Copy link

spinnau commented Jan 13, 2023

Bug Report

Description

We are using DVC for version management of experimental data. This data is usually located on the measurement PCs in a main folder, where it is then sorted into subfolders according to users, projects, etc.

To simplify usage, we just perform an dvc add data command for the main folder after measuring and evaluating new data. On huge repositories this can take a very long time, as it seems that even unchanged files will be updated in the working directory. We see this problem on Windows and Linux running on local file systems.

Reproduce

Modify/Rename one file in an already added directory and compare timings for:

  1. dvc add directory
  2. dvc commit directory
  3. dvc commit directory/changed_file -f

Example:

For testing I've used a repo containing a data directory with 204 files and 360 MB that was already added to DVC. After renaming 1 file, testing and timing some commands gives the following results on Windows:

  1. time dvc data status --granular: ~2.3s
  2. time dvc add data: ~10.2s (this is even the same if there are no changes)
  3. time dvc commit data -f: ~10.2s (this is even the same if there are no changes)
  4. time dvc commit data/changed_file -f: 2.8s

After add/commit of the complete data directory, the last modification date of all files in the workspace is updated. If only the changed file is commited, then only this file will be updated in the workspace. For a larger repo with approx. 40.000 files and 10 GB I haven't exactly timed, but checking the status takes 1...2 minutes, whereas adding/committing the entire data directory (even without anything changed) will be > 1 hour.

Expected

The add/commit commands for entire directory should only update/checkout the changed files, that are already known by dvc data status

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.41.1 (exe)
---------------------------------
Platform: Python 3.10.9 on Windows-10-10.0.10240-SP0
Subprojects:

Supports:
        azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0, pyarrow = 10.0.1),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2022.11.0)
Cache types: hardlink, symlink
Cache directory: NTFS on C:\
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git
$ dvc doctor
DVC version: 2.41.1 (pip)
---------------------------------
Platform: Python 3.10.9 on Linux-6.1.4-arch1-1-x86_64-with-glibc2.36
Subprojects:
	dvc_data = 0.29.0
	dvc_objects = 0.14.1
	dvc_render = 0.0.17
	dvc_task = 0.1.9
	dvclive = 1.3.1
	scmrepo = 0.1.5
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	ssh (sshfs = 2022.6.0)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sdb1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sdb1
Repo: dvc, git
@daavoo daavoo added the A: data-management Related to dvc add/checkout/commit/move/remove label Jan 16, 2023
@efiop efiop self-assigned this Feb 15, 2023
@efiop
Copy link
Member

efiop commented Feb 21, 2023

We should probably just stop doing relinking by default in add/commit and introduce an explicit flag --relink like in checkout. This force-relinking is a bit of a legacy thing from the times where hardlink/symlink was default and is only useful these days for when you have reflink support on your filesystem to remove duplicates. The latter is a perfectly valid thing to try to do, and in ideal world we would be able to easilly tell that a file is already a reflink and skip relinking it, but unfortunately that is not possible (reflink detection is a tricky thing to do and is probably not worth it), and it doesn't make sense to waste so much time relinking stuff by default.

Looking into introducing explicit --relink flags for add/commit. I think it is worth breaking backward compatibility here now because it improves normal use cases significantly.

@efiop
Copy link
Member

efiop commented Feb 21, 2023

One challenge here is handling permissions for symlink/hardlink cases because it could lead to data corruption. So we need to catch mismatched permissions during checkout phase of add/commit and relink that stuff even without --relink. We do collect the info about in meta, so should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove
Projects
None yet
Development

No branches or pull requests

3 participants