-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add : deletes files from the cache on NAS servers on Windows when duplicate images are present #6368
Comments
Hi @louistransfer ! Could you try |
Hi @efiop ! We executed the command, then the command The output has slightly changed however, as the WinErrors "Accès refusé" have turned into "[Errno 13] Permission denied":
|
@louistransfer Looks like you don't have rights to write files in that location. Could you check if you can create files in |
@eliop I managed to create a new txt file in the directory and I also managed to move image files inside so I have the rights to write files in that location. Maybe python uses different permissions ? |
I am also running into this problem. I can reproduce the error as described and I get the same verbose output as @louistransfer. I've done some debugging and I'm thinking it is some sort of a race condition. While debugging I saw this error which wasn't shown in the verbose output: [WinError 183] Cannot create a file when that file already exists: '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350.dKvQKmJbSYaXVe6gD9k2j6' -> '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350'
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\shutil.py", line 791, in move
os.rename(src, real_dst)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\utils\fs.py", line 114, in move
shutil.move(tmp, dst)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\fs\local.py", line 97, in move
move(from_info, to_info)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\objects\db\base.py", line 78, in add
self.fs.move(path_info, cache_info)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\concurrent\futures\thread.py", line 80, in _worker
work_item.run()
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 890, in _bootstrap
self._bootstrap_inner() (Note that the file already existed as I had duplicate files that I was adding) I saw in the verbose stack trace that Instead I put a breakpoint inside of Environment
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
http (requests = 2.25.1),
https (requests = 2.25.1)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: local, local, local, local, local
Workspace directory: ('unknown', 'none')
Repo: dvc, git Also, I've ran |
@cubrink Are you also using nfs? |
Regarding the deletion, But it is more important to debug the errors you guys ran into, as that is the main problem that prevents |
@efiop Just spoke to IT as I couldn't speak authoritatively about the network configuration. I was told accessing the NAS is via SMB protocol, not NFS |
@cubrink Could you show us full log, please? |
@efiop, I ran some tests and have logged the output. If there are other logs that you need just let me know. SetupThese tests are with 3 total files.
I ran the following tests, which vary by the state of the cache:
1. Empty cacheResultsAll images in Output
2021-08-04 13:22:47,985 DEBUG: Check for update is enabled.
2021-08-04 13:22:48,071 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2021-08-04 13:22:48,118 DEBUG: Spawned '['daemon', '-q', 'updater']'
Adding... 2021-08-04 13:22:48,454 DEBUG: state save (8262666151072389689, 1628101306255478528, 7563) b8cf30f4746145cefc80f3dce8e0aee9 ?md5/s]
2021-08-04 13:22:48,459 DEBUG: state save (3369592122099693323, 1628101306224222976, 367529) f65bcf2182da5af309d2b30c77f79350
2021-08-04 13:22:48,462 DEBUG: state save (641099020852704712, 1628101306208571136, 367529) f65bcf2182da5af309d2b30c77f79350
2021-08-04 13:22:48,490 DEBUG: state save (9219199384940878093, 1628101368989038592, 205) 582ccb789a45fc816efc5f5ed47ada9e.dir
2021-08-04 13:22:48,492 DEBUG: state save (8189424156304719068, 7ed664205d3177de31678d30c57a25d7, 742621) 582ccb789a45fc816efc5f5ed47ada9e.dir
2021-08-04 13:22:48,495 DEBUG: {'images': 'modified'}
2021-08-04 13:22:48,506 DEBUG: Computed stage: 'images.dvc' md5: 'None'
2021-08-04 13:22:48,554 DEBUG: state save (8262666151072389689, 1628101306255478528, 7563) b8cf30f4746145cefc80f3dce8e0aee9 ?file/s]
2021-08-04 13:22:48,558 DEBUG: state save (641099020852704712, 1628101306208571136, 367529) f65bcf2182da5af309d2b30c77f79350
Adding...
2021-08-04 13:22:48,590 ERROR: unexpected error - [Errno 13] Permission denied: '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350'
------------------------------------------------------------
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 791, in move
os.rename(src, real_dst)
PermissionError: [WinError 5] Access is denied: '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350.cJGKLGRMxxHxi39ZLzvE6R' -> '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\main.py", line 55, in main
ret = cmd.do_run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\base.py", line 50, in do_run
return self.run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\add.py", line 21, in run
self.repo.add(
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\collections.py", line 128, in inner
result = func(*ba.args, **ba.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\__init__.py", line 51, in wrapper
return f(repo, *args, **kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
return method(repo, *args, **kw)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\add.py", line 187, in add
stage.commit()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
return call()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\__init__.py", line 507, in commit
out.commit(filter_info=filter_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\output.py", line 576, in commit
objects.save(self.odb, obj)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\__init__.py", line 39, in save
future.result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 437, in result
return self.__get_result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 389, in __get_result
raise self._exception
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\db\base.py", line 78, in add
self.fs.move(path_info, cache_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\fs\local.py", line 97, in move
move(from_info, to_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\fs.py", line 114, in move
shutil.move(tmp, dst)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 811, in move
copy_function(src, real_dst)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 435, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350'
------------------------------------------------------------
2021-08-04 13:22:50,402 DEBUG: Version info for developers:
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
http (requests = 2.25.1),
https (requests = 2.25.1)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: local, local, local, local, local
Workspace directory: ('unknown', 'none')
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-08-04 13:22:50,415 DEBUG: Analytics is enabled.
2021-08-04 13:22:50,417 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmp0ra57hku']'
2021-08-04 13:22:50,471 DEBUG: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmp0ra57hku']' 2. Cache already contains duplicate fileResultsOnly duplicate files in Output
2021-08-04 13:27:45,325 DEBUG: Check for update is enabled.
2021-08-04 13:27:45,408 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2021-08-04 13:27:45,473 DEBUG: Spawned '['daemon', '-q', 'updater']'
Adding... 2021-08-04 13:27:45,780 DEBUG: Adding 'images' to '.gitignore'.
2021-08-04 13:27:45,823 DEBUG: state save (3134965523380195232, 1627999366736115456, 367529) f65bcf2182da5af309d2b30c77f79350 ?md5/s]
2021-08-04 13:27:45,826 DEBUG: state save (5810031232835155488, 1628000036176760576, 7563) b8cf30f4746145cefc80f3dce8e0aee9
2021-08-04 13:27:45,830 DEBUG: state save (8074608159519882170, 1627999366740106240, 367529) f65bcf2182da5af309d2b30c77f79350
2021-08-04 13:27:45,858 DEBUG: state save (39950517512335132, 1628101666359935232, 274) 559e3d5b4ba59eb701d29bd45dd9bdef.dir
2021-08-04 13:27:45,861 DEBUG: state save (8189424156304719068, 103d3b5ea78f3d05f4a38a7dfce13db9, 757981) 559e3d5b4ba59eb701d29bd45dd9bdef.dir
2021-08-04 13:27:45,863 DEBUG: {'images': 'modified'}
2021-08-04 13:27:45,875 DEBUG: Computed stage: 'images.dvc' md5: 'None'
2021-08-04 13:27:45,919 DEBUG: state save (5810031232835155488, 1628000036176760576, 7563) b8cf30f4746145cefc80f3dce8e0aee9
Adding...
2021-08-04 13:27:45,962 ERROR: unexpected error - [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db'
------------------------------------------------------------
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 791, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db' -> '<path_to_my_dvc_project>\\.dvc\\cache\\39\\efca5cf287d61175024b42a2ca8527.2fUsc9jKPw8SkWgAoFsE26'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\main.py", line 55, in main
ret = cmd.do_run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\base.py", line 50, in do_run
return self.run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\add.py", line 21, in run
self.repo.add(
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\collections.py", line 128, in inner
result = func(*ba.args, **ba.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\__init__.py", line 51, in wrapper
return f(repo, *args, **kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
return method(repo, *args, **kw)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\add.py", line 187, in add
stage.commit()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
return call()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\__init__.py", line 507, in commit
out.commit(filter_info=filter_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\output.py", line 576, in commit
objects.save(self.odb, obj)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\__init__.py", line 39, in save
future.result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 437, in result
return self.__get_result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 389, in __get_result
raise self._exception
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\db\base.py", line 78, in add
self.fs.move(path_info, cache_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\fs\local.py", line 97, in move
move(from_info, to_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\fs.py", line 112, in move
shutil.move(src, tmp)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 812, in move
os.unlink(src)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db'
------------------------------------------------------------
2021-08-04 13:27:46,204 DEBUG: Version info for developers:
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
http (requests = 2.25.1),
https (requests = 2.25.1)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: local, local, local, local, local
Workspace directory: ('unknown', 'none')
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-08-04 13:27:46,215 DEBUG: Analytics is enabled.
2021-08-04 13:27:46,218 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmpuw_myffg']'
2021-08-04 13:27:46,263 DEBUG: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmpuw_myffg']' 3. Cache contains only non-duplicate filesResultsOnly duplicate files in Output
2021-08-04 13:30:26,125 DEBUG: Check for update is enabled.
2021-08-04 13:30:26,213 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2021-08-04 13:30:26,259 DEBUG: Spawned '['daemon', '-q', 'updater']'
Adding... 2021-08-04 13:30:26,564 DEBUG: Adding 'images' to '.gitignore'.
2021-08-04 13:30:26,651 DEBUG: state save (232056546587901106, 1628000036176760576, 7563) b8cf30f4746145cefc80f3dce8e0aee9 ?md5/s]
2021-08-04 13:30:26,680 DEBUG: state save (3860289204771833624, 1628101827168513536, 274) 559e3d5b4ba59eb701d29bd45dd9bdef.dir
2021-08-04 13:30:26,683 DEBUG: state save (8189424156304719068, 103d3b5ea78f3d05f4a38a7dfce13db9, 757981) 559e3d5b4ba59eb701d29bd45dd9bdef.dir
2021-08-04 13:30:26,685 DEBUG: {'images': 'modified'}
2021-08-04 13:30:26,697 DEBUG: Computed stage: 'images.dvc' md5: 'None'
2021-08-04 13:30:26,743 DEBUG: state save (8074608159519882170, 1627999366740106240, 367529) f65bcf2182da5af309d2b30c77f79350
Adding...
2021-08-04 13:30:26,785 ERROR: unexpected error - [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db'
------------------------------------------------------------
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 791, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db' -> '<path_to_my_dvc_project>\\.dvc\\cache\\39\\efca5cf287d61175024b42a2ca8527.iwJpd79FSwcKJkh9uHuVmH'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\main.py", line 55, in main
ret = cmd.do_run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\base.py", line 50, in do_run
return self.run()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\command\add.py", line 21, in run
self.repo.add(
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\collections.py", line 128, in inner
result = func(*ba.args, **ba.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\__init__.py", line 51, in wrapper
return f(repo, *args, **kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
return method(repo, *args, **kw)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\repo\add.py", line 187, in add
stage.commit()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
return call()
File "C:\Users\cbrinker\AppData\Roaming\Python\Python38\site-packages\funcy\decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\stage\__init__.py", line 507, in commit
out.commit(filter_info=filter_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\output.py", line 576, in commit
objects.save(self.odb, obj)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\__init__.py", line 39, in save
future.result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 437, in result
return self.__get_result()
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\_base.py", line 389, in __get_result
raise self._exception
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\objects\db\base.py", line 78, in add
self.fs.move(path_info, cache_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\fs\local.py", line 97, in move
move(from_info, to_info)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\site-packages\dvc\utils\fs.py", line 112, in move
shutil.move(src, tmp)
File "c:\users\cbrinker\anaconda3\envs\dvc-test\lib\shutil.py", line 812, in move
os.unlink(src)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '<path_to_my_dvc_project>\\images\\Thumbs.db'
------------------------------------------------------------
2021-08-04 13:30:27,027 DEBUG: Version info for developers:
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
http (requests = 2.25.1),
https (requests = 2.25.1)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: local, local, local, local, local
Workspace directory: ('unknown', 'none')
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-08-04 13:30:27,038 DEBUG: Analytics is enabled.
2021-08-04 13:30:27,041 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmpctiymb6u']'
2021-08-04 13:30:27,088 DEBUG: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\cbrinker\\AppData\\Local\\Temp\\tmpctiymb6u']' |
Hm, still not sure what's going on there and not able to reproduce 🙁 Closing as stale for now, we'll definitely improve rollback though in #6387 |
Bug Report
Description
With @anasitomtn, we have been working on using DVC on a Windows NAS server using a NTFS file system. One of our data scientists reported a strange issue when he started to use DVC, as files which were supposed to be copied to the cache completely disappeared when he used the dvc add command.
We managed to narrow the issue down. Initially we couldn't manage to reproduct the issue with the same dvc and python versions on windows and with different images. However, when we used the same images as him, the issue appeared again. Reducing the images folder to only 2 duplicate images yielded the bug.
We are also investigating an issue with links on a Windows NAS which may or may not be linked to this.
There appears to be an issue with os.rename (see the "Additional information" section). A theory of ours is that when a duplicate is present, a cache file is created by dvc with a hash for the first duplicate image. For an unknown reason DVC supposes that all hashes are unique when building the cache, but when it tries to create a cache file for the second duplicate image, it fails as it has insufficient permissions to replace this existing cache file with the new one (which has the exact same name as the hash is deterministic). However this hypothesis needs to be confirmed. Please note that all files are removed from the original folder in any case (not only the duplicates).
This issue from 2019 appears to have a similar configuration, however it is run from Ubuntu and not from Windows (in our case, there are no issues on Linux). I do not think that the original issue was fixed, instead the ticket was closed when another bug related to "dvc version" evoked in the ticket thread was fixed.
Fortunately, all of our tests were run on test data, but we believe that this bug can be very dangerous for data scientists who would want to run experiments from production data on a NAS, as it can happen any time in the dvc workflow (before any push to a S3 remote for instance).
Even worse, the bug wipes any batch of images from the workspace as long as they are included in the new dvc add : if you do a "dvc add images_folder" after having added 1000 images containing only two duplicates to a folder which is already tracked by dvc , the 1000 images will be deleted from the workspace and will not be added to the cache. If a lot of images are already present in the workspace, the DS may never notice that those new images have disappeared. If a production pipeline is run on a windows NAS with dvc add commands for ML experiments, some images could disappear silently.
Reproduce
The results of the two scenarios are the same, we added them for easier reproduction.
1 - Initialisation situation
git init
dvc init
The project should look like this :
dvc add images/ -v
2 - Adding a new batch of images
The project should look like this :
or
dvc add images/ -v
Expected
The images should be moved to the cache without being removed from the workspace.
DVC should at least output an error when it fails to copy the files to the cache and not touch any of the original files.
Environment information
Any version of DVC and python running on Windows, on a NAS server.
Additional Information (if any):
Here are the logs (note : "accès refusé" means "access not permitted") :
The text was updated successfully, but these errors were encountered: