Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: removing of files / folder can fail randomly on Windows #4487

Open
MSoegtropIMC opened this issue Dec 31, 2020 · 5 comments
Open

Windows: removing of files / folder can fail randomly on Windows #4487

MSoegtropIMC opened this issue Dec 31, 2020 · 5 comments

Comments

@MSoegtropIMC
Copy link

There is one stability issue with using opam on Windows: removing of files or folders on Windows is never guaranteed to work, because some Explorer, Windows indexer, Virus scanner, you name it process might have a handle on the file and then remove fails. Unlike Unix where open files are referenced via inode and deleting a file doesn't hinder some other process to still have the file open, on Windows open files are referenced by name and any handle to an open file makes it impossible to delete it. I know that this is a design bug of Windows, but I still have frequent failures of opam cause removing some temporary file failed, even though I have a decent virus scanner and have parameterized it such as to keep relevant folders alone. It is hard to track what happens - if I run Procmon in parallel the issue is much harder to reproduce (I have opam based scripts which fail about 50% otherwise).

I know it is a grotesque hack, but would it be possible to just retry if a remove of a file or folder fails? This is really the only stability issue I have on Windows and I waste a lot of CPU and personal time with this and it is also a major issue for CI.

@dra27
Copy link
Member

dra27 commented Mar 17, 2021

Sorry for the slow reply. This "feature" of Windows regularly gets in the way, yes! Retrying is definitely a worthwhile option, but it's also worth trying to get to a place where this doesn't matter as much. One way, for example, is by not depending on precise temporary file names. I remember having to do that in the OCaml testsuite - all the tests created a program called "program", but you got much more reliable performance on Windows by using a different name for each and trusting that at some point Windows would finally erase program1.exe, program2.exe once the virus scanner and whatever else had had enough! Often these "zombie" files can be renamed and even moved, but not deleted so another option in this case would be to move them to a "trash" folder to retry another time.

@MSoegtropIMC
Copy link
Author

The removals which come in the way are typically of build folders after a build finished successfully. I guess the main reason to remove these in opam is to save disk space, so renaming wouldn't help. In my experience typically the removal works 1s later.

@MSoegtropIMC
Copy link
Author

P.S.: this experience comes from a set of meta build shell scripts I maintained before switching to opam. There I had a removal retry with 1s sleep and the 2nd try usually did work. If I remember right I had a timeout of 5 minutes which was never exceeded unless I had e.g. a file in the build folder open in an editor. In that case it would make sense to show a message "Cannot delete XYZ - do you have it open in an editor?". Not nice but better than failing for reasons users don't understand.

@dra27
Copy link
Member

dra27 commented Mar 18, 2021

Indeed - the renaming part of it is because I'd prefer (Windows) opam to waste space rather than waste time - in other words, if the delete fails, I'd just like something which "marks" the file/directory for future garbage collection without blocking other opam operations from proceeding.

@MSoegtropIMC
Copy link
Author

Yes, although I believe that dependability is more important than speed. That a remove fails is quite rare on a per package base, but happens frequently when I build say 100 opam packages en batch. Such a build anyway takes 2 hours and if it takes a few seconds longer, I don't really care. If you make a wait 1s retry, I would guess that on average the delay per package is around 10ms.

@dra27 dra27 added the KIND: BUG label Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants