-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different stale detection to avoid >1 workers in directory. #2082
Conversation
2180770
to
b462180
Compare
CI doesn't seem to get any MacOS runners. |
3fa9a90
to
a64d7f9
Compare
@vdbergh the ci has a big backlog and workers are stuck here https://github.com/official-stockfish/fishtest/actions/runs/9603511710, + some of your ci seems to be stuck too, ci times of 3+ hours |
Yes some of my older commits had a bug making the tests hang. But I thought they had all been cancelled now. |
6e6f36c
to
2016ff6
Compare
We touch the lock file every two seconds and check the age of the lock file when we start the worker. If the lock file is too old then we consider it as stale. For the actual lock we open the lock file with the flags O_EXCL|O_CREAT. This is atomic and will fail if the lock file already exists. The logic has been refactored and is contained in a separate package packages.openlock. Not tested on Windows but we don't use any windows specific apis.
Drafted this since there is still a race condition. |
The race condition involves three processes attempting to acquire the lock
The result is that process 3 thinks it holds the lock while in reality it is open. I do not see a solution for this. It seems atomic file system operations (which are ubiquitous) do not really help with handling stale lock files (the goal being that in the common case, where there are no stale lock files, no timeouts are being used). |
@vdbergh We have just moved PROD and the net server on a new server (8 threads, 32 GB RAM, 438 GB hard disk space), running with Ubuntu 22.04 and MongoDB 7. |
Great. Although I liked the constrained environment of the previous server :) That makes it more pleasant to look for optimizations... |
@ppigazzini A question about this PR. The worker uses the following command on Windows:
However on CI this commands seems to return nothing - presumably because getting the command line of a process requires elevated privileges. Does it work on a normal Windows where the user has default privileges? |
We kept the same configuration and minimalistic philosophy. I simply dropped |
I can do some tests. Unfortunately, in my experience powershell command permission is a moving target with Windows versions, so local tests with VM are not really conclusive. |
Hmm. This command is what protects the worker (on master) against being started multiple times in the same directory (on Windows). |
(Get-CimInstance Win32_Process -Filter "ProcessId = $pid") on Powershell works for me in a very non-admin-restricted environment. |
Yes I know. But |
The point is the |
@vdbergh Your cmdline has a wrong backtick (typo?). I ran successfully the command in a normal cmd prompt, on a clean Windows 11 VM, with a standard user
|
Thanks. It's good to hear that it works! Now I wonder how to deal with this in CI (where the command seems to fail silently)... |
Actually the command in your example does not have arguments. Could you test it with a running worker ( |
The culprit could be the Windows Server 2022 used by GitHub as runner. |
There is also a bug in the worker since it is looking for PID + command in the returned string. But the returned string does not contain the PID... I think I am just going to look for the string "python" (+PID) in the process list. There will not be many python processes on windows. |
|
Thanks! Still as long as it does not work in CI it is inconvenient (difficult to validate changes)... |
I will double check if it really doesn't work. |
Actually it does work. I guess I was confused by the PID bug. Sorry for the noise. |
We touch the lock file every two seconds and check the age of the lock file when we start the worker. If the lock file is too old then we consider it as stale.
For the actual lock we open the lock file with the flags
O_EXCL|O_CREAT
. This is atomic and will fail if the lock file already exists.The logic has been refactored and is contained in a separate package
packages.openlock
.Not tested on Windows but it does not use any window specific apis.