Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every three minutes validate a random run from the cache. #2029

Merged
merged 1 commit into from
May 31, 2024

Conversation

vdbergh
Copy link
Contributor

@vdbergh vdbergh commented May 20, 2024

The reason for this PR is that before introducing more dynamic updates we would like to verify that the dynamic updates of run["workers"] and run["cores"] introduced in #2010 are bug free.

In order to facilitate the use of periodic timers in Fishtest we have introduced a new scheduler in utils.py which may be interesting in its own right.

Currently the timers defined are the cache flush timer and the timer introduced in this PR.

Note that it would be best to test this PR on PROD before merging it. If it turns out that it floods the event log with messages about non-validated runs then obviously there is something wrong with #2010 .

@vdbergh
Copy link
Contributor Author

vdbergh commented May 20, 2024

An example of the use of the scheduler

start = datetime.now(timezone.utc)
s = Scheduler()

def two_second_timer():
    now = datetime.now(timezone.utc)
    print(now - start, "I execute every two seconds")

def three_second_timer():
    now = datetime.now(timezone.utc)
    print(now - start, "I execute every three seconds")

def ten_second_timer():
    now = datetime.now(timezone.utc)
    print(now - start, "I execute every ten seconds")

def eleven_second_timer():
    now = datetime.now(timezone.utc)
    print(now - start, "I execute every eleven seconds")


s.add_task(2, two_second_timer)
s.add_task(3, three_second_timer)
s.add_task(10, ten_second_timer)
s.add_task(11, eleven_second_timer)
0:00:02.000511 I execute every two seconds
0:00:03.001490 I execute every three seconds
0:00:04.001044 I execute every two seconds
0:00:06.000971 I execute every two seconds
0:00:06.001702 I execute every three seconds
0:00:08.001197 I execute every two seconds
0:00:09.001299 I execute every three seconds
0:00:10.001017 I execute every two seconds
0:00:10.001852 I execute every ten seconds
0:00:11.001272 I execute every eleven seconds
0:00:12.000971 I execute every two seconds
0:00:12.001793 I execute every three seconds
0:00:14.000914 I execute every two seconds
0:00:15.001382 I execute every three seconds
0:00:16.000916 I execute every two seconds
0:00:18.000956 I execute every two seconds
0:00:18.001788 I execute every three seconds
0:00:20.001445 I execute every two seconds
0:00:20.002234 I execute every ten seconds
0:00:21.001342 I execute every three seconds
0:00:22.000998 I execute every two seconds
0:00:22.001817 I execute every eleven seconds
0:00:24.000961 I execute every two seconds
0:00:24.001741 I execute every three seconds
0:00:26.000851 I execute every two seconds
0:00:27.001338 I execute every three seconds

@vdbergh
Copy link
Contributor Author

vdbergh commented May 20, 2024

The scheduler is created via

scheduler = Scheduler(jitter=0.0, auto_start=True)

The full signature of add_task is

scheduler.add_task(period, worker, initial_delay=None, one_shot=False, jitter=0.0, args=(), kwargs={})

To stop the scheduler (e.g. in a SIGINT/SIGTERM-handler) do

scheduler.stop()

@vdbergh vdbergh force-pushed the scheduler branch 8 times, most recently from bb83e8b to ed283d3 Compare May 20, 2024 18:14
@vdbergh
Copy link
Contributor Author

vdbergh commented May 21, 2024

It is perhaps useful to observe that currently flush_buffers() has rather confusing code. It does 3 essentially different things;

  • write the oldest dirty run to disk (and then declare it clean);
  • scan the clean runs in the cache for dead tasks and delete them;
  • evict old clean runs from the cache.

With the scheduler contained in this PR it would be trivial to split this into 3 different tasks, with independent periods.

@vdbergh vdbergh force-pushed the scheduler branch 3 times, most recently from 498ed94 to 330249d Compare May 21, 2024 04:12
@vondele
Copy link
Member

vondele commented May 21, 2024

concerning flush_buffers, this is also one of the functions that is high up in the sample based profiling. So if some work doesn't need to be scheduled every second, that's a good thing. Do we actually need it to be every second?

Actually, that makes me wonder about another feature of the timer: Instead of scheduling it regularly, have a function that schedules the next wake up. Two example use cases:

  • if (in the worker) we want to sent a beat every two minutes, but we see that the last beat is 30s ago, there is no need to check every second, but just waking up in 90s for the next check would work.
  • for e.g. flush_buffers, one could measure the time the function takes (e.g. 0.5s) and decide to wake up in 10x that time (5s), so that the function takes less than 10% of CPU time independent on its execution length.

Thinking a bit more about it. Even for regular scheduled tasks with N sec wait time, it would probably be good that wake up N sec after the scheduled task completed, and not every N sec. This would avoid that for tasks taking more than N seconds a long queue of pending tasks builds up.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 21, 2024

concerning flush_buffers, this is also one of the functions that is high up in the sample based profiling. So if some work doesn't need to be scheduled every second, that's a good thing. Do we actually need it to be every second?

Maybe not. Using a scheduler we can tune this.

Actually, that makes me wonder about another feature of the timer: Instead of scheduling it regularly, have a function that schedules the next wake up. Two example use cases:

If you mean a "one_shot" timer. This is already provided.

* if (in the worker) we want to sent a beat every two minutes, but we see that the last beat is 30s ago, there is no need to check every second, but just waking up in 90s for the next check would work.

* for e.g. flush_buffers, one could measure the time the function takes (e.g. 0.5s) and decide to wake up in 10x that time (5s), so that the function takes less than 10% of CPU time independent on its execution length.

Thinking a bit more about it. Even for regular scheduled tasks with N sec wait time, it would probably be good that wake up N sec after the scheduled task completed, and not every N sec. This would avoid that for tasks taking more than N seconds a long queue of pending tasks builds up.

Well for now I would prefer simple code so I would only implement such features when needed. Currently I foresee only applications for tasks that execute much quicker than their scheduling period, so that this would not matter. Having a good scheduler would already be important progress compared to the current situation.

@vdbergh vdbergh marked this pull request as draft May 21, 2024 11:02
@vdbergh vdbergh marked this pull request as ready for review May 21, 2024 13:17
@vdbergh vdbergh marked this pull request as draft May 21, 2024 16:32
@vdbergh
Copy link
Contributor Author

vdbergh commented May 21, 2024

Drafting again since I found a better way to code the scheduler using threading.Event.

@vdbergh vdbergh marked this pull request as ready for review May 21, 2024 18:40
@vdbergh vdbergh marked this pull request as draft May 21, 2024 22:24
@vdbergh vdbergh marked this pull request as ready for review May 22, 2024 02:05
@vdbergh vdbergh marked this pull request as draft May 22, 2024 02:34
@vdbergh vdbergh marked this pull request as ready for review May 22, 2024 02:40
@vdbergh vdbergh force-pushed the scheduler branch 11 times, most recently from 693b952 to 205d681 Compare May 28, 2024 07:57
@vdbergh vdbergh force-pushed the scheduler branch 2 times, most recently from 592b6cc to 7e80d4a Compare May 30, 2024 19:26
@ppigazzini ppigazzini added enhancement server server side changes labels May 31, 2024
@ppigazzini
Copy link
Collaborator

Running on PROD right now after some testing on DEV.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 31, 2024

So far no event log messages... A fleet visit would be the ultimate stress test...

Do the following messages appear in the server log?

 print(f"Validate_random_run: validated aggregated data in cache run {run_id}...", flush=True)

Just to verify that the scheduler is working...

@vondele
Copy link
Member

vondele commented May 31, 2024

yes, they are in the log.
Some more load later today/tomorrow, i would guess.

The reason for this PR is that before introducing more dynamic
updates we would like to verify that the dynamic updates of
run["workers"] and run["cores"] introduced in
official-stockfish#2010
are bug free.

In order to facilitate the use of periodic timers in Fishtest
we have introduced a new scheduler in utils.py which may be
interesting in its own right.

Currently the timers defined are the cache flush timer and the
timer introduced in this PR.
@vdbergh
Copy link
Contributor Author

vdbergh commented May 31, 2024

Still no event log messages....

Last push is a rebase.

@ppigazzini
Copy link
Collaborator

$ sudo journalctl -u fishtest@6543 --since "120 minutes ago" | grep random -A1
May 31 12:53:30 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6657f5c86b0e318cefa8d7f3...
May 31 12:54:30 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 12:56:22 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 665937e16b0e318cefa905e4...
May 31 12:56:34 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 12:59:19 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6658fe7b6b0e318cefa90232...
May 31 12:59:37 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:02:11 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6659a8adea624d64ea5f3206...
May 31 13:02:20 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:05:16 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 66568064a86388d5e27dd93e...
May 31 13:05:17 tests.stockfishchess.org pserve[8772]: dead task: run: https://tests.stockfishchess.org/tests/view/665949bc6b0e318cefa9064b task_id: 166 worker: vdv-7cores-3cab8986-91b7*
--
May 31 13:08:15 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 66591f436b0e318cefa902ad...
May 31 13:08:15 tests.stockfishchess.org pserve[8772]: Update_task: task 665946f26b0e318cefa90639/606 is not active
--
May 31 13:11:12 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 66593e196b0e318cefa90608...
May 31 13:11:15 tests.stockfishchess.org pserve[8772]: Update_task: task 6659bf52a27a71165d2475ae/291 is not active
--
May 31 13:14:16 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6657f6e46b0e318cefa8d7f9...
May 31 13:14:27 tests.stockfishchess.org pserve[8772]: Update_task: task 665946f26b0e318cefa90639/575 is not active
--
May 31 13:17:17 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 664e5741928b1fb18de4e345...
May 31 13:17:29 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:20:22 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6658d1aa6b0e318cefa90122...
May 31 13:20:33 tests.stockfishchess.org pserve[8772]: Update_task: task 665946f26b0e318cefa90639/587 is not active
--
May 31 13:23:22 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6659bf52a27a71165d2475ae...
May 31 13:24:22 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:26:22 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 665945666b0e318cefa90631...
May 31 13:27:14 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:29:28 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 66599441e4856a090bda09c8...
May 31 13:29:42 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:32:28 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6657f6e46b0e318cefa8d7f9...
May 31 13:32:46 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:35:24 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6659bd76a27a71165d2475a2...
May 31 13:35:56 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:38:30 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6640edea70e85e1f2ce321f9...
May 31 13:39:01 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue
--
May 31 13:41:27 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6659a8d2ea624d64ea5f320e...
May 31 13:42:04 tests.stockfishchess.org pserve[8772]: Failed_task: failure for: https://tests.stockfishchess.org/tests/view/6659ca56f426908fcc6b5fc4, task_id: 28, worker: vdv-7cores-d06e5604-6ad4*, reason: 'Get request to https://api.github.com/repos/FauziAkram/Stockfish/zipball/4fd04aa13a58c65cd5b63eabab6f936aee0b0f42 failed'
--
May 31 13:44:29 tests.stockfishchess.org pserve[8772]: Validate_random_run: validated aggregated data in cache run 6659481b6b0e318cefa9063e...
May 31 13:45:01 tests.stockfishchess.org pserve[8772]: Request_task: refresh queue

@vdbergh
Copy link
Contributor Author

vdbergh commented May 31, 2024

The validations nicely happen every 3 minutes (+- 5% jitter).

@vondele
Copy link
Member

vondele commented May 31, 2024

so, no errors in the logs as far as I can see.

Copy link
Collaborator

@ppigazzini ppigazzini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on PROD

@ppigazzini ppigazzini merged commit 4b461fe into official-stockfish:master May 31, 2024
19 checks passed
@ppigazzini
Copy link
Collaborator

Thank you @vdbergh

@vondele
Copy link
Member

vondele commented Jun 1, 2024

Meanwhile logs show quite a few validation errors (passed through cut -c16- | sort | uniq -c ):

      1  tests.stockfishchess.org pserve[10329]: The run object 655a9d74136acbc57353fb25 does not validate: run (value:{'_id': ObjectId('655a9d74136acbc57353fb25'), 'args': {'base_tag': 'master', 'new_tag': 'probcutDA6...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 73
      1  tests.stockfishchess.org pserve[12453]: The run object 6407ee412644b62c3393db12 does not validate: run (value:{'_id': ObjectId('6407ee412644b62c3393db12'), 'args': {'base_tag': 'cdec775a-c1', 'new_tag': 'cdec7...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 39
      5  tests.stockfishchess.org pserve[12453]: The run object 646f085268661bfd9843262e does not validate: run (value:{'_id': ObjectId('646f085268661bfd9843262e'), 'args': {'base_tag': 'master', 'new_tag': 'df0fb847-3...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 184
      4  tests.stockfishchess.org pserve[12453]: The run object 64992b43dc7002ce609cfd20 does not validate: run (value:{'_id': ObjectId('64992b43dc7002ce609cfd20'), 'args': {'base_tag': 'master', 'new_tag': 'nn-1b951f8...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 1023
      2  tests.stockfishchess.org pserve[12453]: The run object 655a9d74136acbc57353fb25 does not validate: run (value:{'_id': ObjectId('655a9d74136acbc57353fb25'), 'args': {'base_tag': 'master', 'new_tag': 'probcutDA6...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 73
      3  tests.stockfishchess.org pserve[12453]: The run object 655ad4e9136acbc57354001f does not validate: run (value:{'_id': ObjectId('655ad4e9136acbc57354001f'), 'args': {'base_tag': 'master', 'new_tag': 'ssgc', 'ba...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 223
      2  tests.stockfishchess.org pserve[12453]: The run object 655b6ad4136acbc573540d1e does not validate: run (value:{'_id': ObjectId('655b6ad4136acbc573540d1e'), 'args': {'base_tag': 'master', 'new_tag': 'recapture_...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 189
      2  tests.stockfishchess.org pserve[12453]: The run object 655b96bb136acbc5735410da does not validate: run (value:{'_id': ObjectId('655b96bb136acbc5735410da'), 'args': {'base_tag': 'master', 'new_tag': 'singular_o...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 167
      3  tests.stockfishchess.org pserve[12453]: The run object 655c5ac9136acbc5735420ce does not validate: run (value:{'_id': ObjectId('655c5ac9136acbc5735420ce'), 'args': {'base_tag': 'master', 'new_tag': 'ext_fail_p...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 149
      2  tests.stockfishchess.org pserve[12453]: The run object 655cc2b7136acbc5735428ca does not validate: run (value:{'_id': ObjectId('655cc2b7136acbc5735428ca'), 'args': {'base_tag': 'master', 'new_tag': 'singular_o...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 376
      5  tests.stockfishchess.org pserve[12453]: The run object 655d21c7136acbc5735431a6 does not validate: run (value:{'_id': ObjectId('655d21c7136acbc5735431a6'), 'args': {'base_tag': 'master', 'new_tag': 'useStatice...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 162
      6  tests.stockfishchess.org pserve[12453]: The run object 655d609f136acbc573543710 does not validate: run (value:{'_id': ObjectId('655d609f136acbc573543710'), 'args': {'base_tag': 'master', 'new_tag': 'b4e9ee72_e...[TRUNCATED]...}) is not of type 'cores_must_match': Cores mismatch. Cores from tasks: 0. Cores from run: 38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement server server side changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants