Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to not follow symbolic links when hashing directory, or an option to ignore broken symlinks #9971

Open
jnareb opened this issue Sep 23, 2023 · 0 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@jnareb
Copy link

jnareb commented Sep 23, 2023

Justification for the feature

I am working on a machine learning project belonging to the Mining Software Repositories (MSR) field. One of the stages is cloning the repository. With DVC I have tried to add cloned repository (or rather repository with a set of cloned repositories) as external output (not cached). The contents of the cloned repository, its checked out files, is not something under the control of the cloning stage.

With this setup, when I run dvc repro or dvc stage, DVC fails with the following error:

ERROR: unexpected error - [Errno 2] No such file or directory:
 '/mnt/data/repositories/freeradius-server/src/tests/salt-test-server/salt/ldap/freeradius.ldif'

The error message is a bit misleading: the file mentioned in the error message do exist, it is just a broken symbolic link. The file in question exists, the file it references does not.

Option to not follow symbolic links

I propose for DVC to have an option to not follow symbolic links when hashing directories or files. With such option (whether it is named --no-follow-symlinks, or --no-dereference, or something else), DVC would hash the contents of the symbolic link, and not the file it points to.

By the way, as far as I understand it, this is the default behavior for 'tar' and similar tools.

If this option / feature is enabled, DVC would have to check if the file is a symbolic link with os.path.islink() or Path.is_symlink(), and then instead of using open to read its contents (or rather the contents of the file it points to), use os.readlink() or Path.readlink().

This could be encapsulated in a custom open function and/or context manager.

Option to ignore broken symbolic links

If this option / feature is enabled, DVC would catch an exception that occurs when trying to open a broken symbolic link, extract the path to the file that is being opened or remember it, and if the file (the symbolic link) exists, it would ignore the error.

This I think is more backward-compatibile solution, and it might be easier to implement.

@pmrowla pmrowla added feature request Requesting a new feature A: data-management Related to dvc add/checkout/commit/move/remove labels Sep 25, 2023
@dberenbaum dberenbaum added the p3-nice-to-have It should be done this or next sprint label Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

3 participants