Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different stale detection to avoid >1 workers in directory. #2082

Closed
wants to merge 1 commit into from

Conversation

vdbergh
Copy link
Contributor

@vdbergh vdbergh commented Jun 20, 2024

We touch the lock file every two seconds and check the age of the lock file when we start the worker. If the lock file is too old then we consider it as stale.

For the actual lock we open the lock file with the flags O_EXCL|O_CREAT. This is atomic and will fail if the lock file already exists.

The logic has been refactored and is contained in a separate package packages.openlock.

Not tested on Windows but it does not use any window specific apis.

@vdbergh vdbergh force-pushed the openlock branch 17 times, most recently from 2180770 to b462180 Compare June 21, 2024 05:05
@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 21, 2024

CI doesn't seem to get any MacOS runners.

@vdbergh vdbergh force-pushed the openlock branch 2 times, most recently from 3fa9a90 to a64d7f9 Compare June 21, 2024 06:15
@Disservin
Copy link
Member

@vdbergh the ci has a big backlog and workers are stuck here https://github.com/official-stockfish/fishtest/actions/runs/9603511710, + some of your ci seems to be stuck too, ci times of 3+ hours

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 21, 2024

Yes some of my older commits had a bug making the tests hang. But I thought they had all been cancelled now.

@vdbergh vdbergh force-pushed the openlock branch 3 times, most recently from 6e6f36c to 2016ff6 Compare June 21, 2024 09:34
We touch the lock file every two seconds and check the age
of the lock file when we start the worker. If the lock
file is too old then we consider it as stale.

For the actual lock we open the lock file with the flags
O_EXCL|O_CREAT. This is atomic and will fail if the lock file
already exists.

The logic has been refactored and is contained in a separate
package packages.openlock.

Not tested on Windows but we don't use any windows specific
apis.
@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 21, 2024

Drafted this since there is still a race condition.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 22, 2024

The race condition involves three processes attempting to acquire the lock

PID Step
1 detects stale lock file
2 detects stale lock file
2 deletes lock file
3 creates lock file atomically
1 deletes lock file

The result is that process 3 thinks it holds the lock while in reality it is open.

I do not see a solution for this. It seems atomic file system operations (which are ubiquitous) do not really help with handling stale lock files (the goal being that in the common case, where there are no stale lock files, no timeouts are being used).

@vdbergh vdbergh closed this Jun 22, 2024
@ppigazzini
Copy link
Collaborator

@vdbergh We have just moved PROD and the net server on a new server (8 threads, 32 GB RAM, 438 GB hard disk space), running with Ubuntu 22.04 and MongoDB 7.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

@vdbergh We have just moved PROD and the net server on a new server (8 threads, 32 GB RAM, 438 GB hard disk space), running with Ubuntu 22.04 and MongoDB 7.

Great. Although I liked the constrained environment of the previous server :) That makes it more pleasant to look for optimizations...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

@ppigazzini A question about this PR. The worker uses the following command on Windows:

"powershell (Get-CimInstance` Win32_Process -Filter 'ProcessId = {pid}').CommandLine".format(pid)

However on CI this commands seems to return nothing - presumably because getting the command line of a process requires elevated privileges.

Does it work on a normal Windows where the user has default privileges?

@ppigazzini
Copy link
Collaborator

We kept the same configuration and minimalistic philosophy. I simply dropped cpulimit from the scheduled task because we can use 100% of the CPU without the risk to be stopped by the ISP for abuse. The only real bottleneck was the HD space for the pgns with a fleet. Now that constrain is dropped and no fleet will join anytime soon.

@ppigazzini
Copy link
Collaborator

Does it work on a normal Windows where the user has default privileges?

I can do some tests. Unfortunately, in my experience powershell command permission is a moving target with Windows versions, so local tests with VM are not really conclusive.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Does it work on a normal Windows where the user has default privileges?

I can do some tests. Unfortunately, in my experience powershell command permission is a moving target with Windows versions, so local tests with VM are not really conclusive.

Hmm. This command is what protects the worker (on master) against being started multiple times in the same directory (on Windows).

@peregrineshahin
Copy link
Contributor

peregrineshahin commented Jun 24, 2024

(Get-CimInstance Win32_Process -Filter "ProcessId = $pid") on Powershell works for me in a very non-admin-restricted environment.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Get-CimInstancenWin32_Process -Filter 'ProcessId = {pid}' on Powershell works for me in a very non-admin-restricted environment.

Yes I know. But (Get-CimInstancenWin32_Process -Filter 'ProcessId = {pid}').CommandLine is what does not work on CI...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

The point is the (...).CommandLine suffix.

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 24, 2024

@vdbergh Your cmdline has a wrong backtick (typo?).

I ran successfully the command in a normal cmd prompt, on a clean Windows 11 VM, with a standard user

"powershell (Get-CimInstance Win32_Process -Filter 'ProcessId = 8912').CommandLine"

image

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Thanks. It's good to hear that it works!

Now I wonder how to deal with this in CI (where the command seems to fail silently)...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Actually the command in your example does not have arguments. Could you test it with a running worker (python worker.py)?

@ppigazzini
Copy link
Collaborator

Now I wonder how to deal with this in CI (where the command seems to fail silently)...

The culprit could be the Windows Server 2022 used by GitHub as runner.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

There is also a bug in the worker since it is looking for PID + command in the returned string. But the returned string does not contain the PID...

I think I am just going to look for the string "python" (+PID) in the process list. There will not be many python processes on windows.

@ppigazzini
Copy link
Collaborator

"C:/msys64/ucrt64/bin\python3.11.exe" -i worker.py is the print of command after fixing the condition for the pid in the stdout and starting a second worker in the same folder.

______ _     _     _            _                        _
|  ___(_)   | |   | |          | |                      | |
| |_   _ ___| |__ | |_ ___  ___| |_  __      _____  _ __| | _____ _ __
|  _| | / __| '_ \| __/ _ \/ __| __| \ \ /\ / / _ \| '__| |/ / _ \ '__|
| |   | \__ \ | | | ||  __/\__ \ |_   \ V  V / (_) | |  |   <  __/ |
\_|   |_|___/_| |_|\__\___||___/\__|   \_/\_/ \___/|_|  |_|\_\___|_|

Worker started in C:\fishtest-worker-setup-winget\worker ... (PID=2092)
"C:/msys64/ucrt64/bin\python3.11.exe" -i worker.py


*** Worker (PID=2092) stopped! ***
Another worker (PID=10300) is already running in this directory, using the lock file:
C:\fishtest-worker-setup-winget\worker\worker.lock
Traceback (most recent call last):
  File "C:\fishtest-worker-setup-winget\worker\worker.py", line 1706, in <module>
    sys.exit(worker())
SystemExit: 1
>>>

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Thanks! Still as long as it does not work in CI it is inconvenient (difficult to validate changes)...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

I will double check if it really doesn't work.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 24, 2024

Actually it does work. I guess I was confused by the PID bug. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants