Eio.Workpool #584

SGrondin · 2023-07-12T19:50:36Z

For your consideration, this PR is an invitation to discuss the user-facing API and the implementation of a Eio.Workpool module.

Let's start with some general "guiding principles" that underpin the current design. I'll gladly make changes to the design if the maintainers invalidate any of these principles.

Principles

P1: A workpool should not pay attention to recommended_domain_count. Users know their own workloads (we don't). The users will tell us how many domains to use.

P2: We need to support m concurrent jobs per domain. An efficient design is one that fully utilizes each core. For CPU-bound workloads, it means 1 job per thread. For IO-bound it's a lot more. There's also hybrid workloads, with CPU-demanding processing interspersed between IO calls. For those workloads, the right number is 1/p where p is the proportion of each job being CPU-bound.

P3: The user knows the right moment to start and shutdown their workpools. They want the threads to be ready to go by the time the jobs come flying in. Having to spawn threads on the fly (lazily) should be avoided.

Tests

We've got them! They test durations using monotonic clocks to validate that certain tasks execute concurrently. Despite it being an obvious race condition, the tests are solid and consistent, not flaky at all. I hope they'll behave that way in CI testing too. The tests are now fully deterministic using mock clocks, mock domains and the mock backend.

Caveats

This PR uses Fiber.n_any which is only added as part of #587 so it obviously has to wait until the other one is merged.

The code is fairly short and I've added plenty of comments to help the reviewers.
I'll trim some comments once it's been reviewed.

Thank you for your time.

SGrondin · 2023-07-13T12:54:51Z

I just made a few tweaks: fixed a race condition and removed the Core-style ~f named arguments since Eio doesn't normally label lambdas.

SGrondin · 2023-07-20T16:41:35Z

I'm marking it as Ready For Review since I've fixed the race condition (using Fiber.n_any which is only available as part of #587)

SGrondin · 2023-08-20T15:35:12Z

Is there a way to test this under dscheck? I looked over in tests/dscheck and I didn't see any use of domains or env, making me think it may not be realistic to do so.

talex5

Is there a way to test this under dscheck?

This isn't doing any lock-free stuff itself, so dscheck isn't necessary. Instead, you can use the mock domain manager to make the tests deterministic. That currently lives in network.md, but it's generally useful and should be moved to the eio.mock library.

lib_eio/workpool.ml

lib_eio/workpool.mli

talex5 · 2023-08-25T09:17:57Z

lib_eio/workpool.ml

+  if Atomic.compare_and_set instance.is_terminating false true
+  then (
+    (* Instruct workers to shutdown *)
+    Promise.resolve w1 (Quit { atomic = Atomic.make 1; target = instance.domain_count; all_done = w2 });


This seems a bit over-complicated. I was expecting it to write instance.domain_count quit messages to the stream. That would avoid the need for a second channel for the quit message and the n_any.

The reason it's this way is due to a combination of factors:

I don't want the Quit messages to have to wait for the other enqueued jobs to run and complete ahead of them

As much as possible, calling terminate should "immediately" halt starting new jobs.
The obvious solution is to make terminate reject all queued jobs before enqueueing Quit messages, but that doesn't work well with the post-termination background rejection loop, because terminate becomes both a producer and a consumer while the workers are still consumers. It can be made to work, but the end result was more complex and less predictable.

I don't want the Quit messages to have to wait for the other enqueued jobs to run and complete ahead of them.

The workers can still check is_terminating before running a job and reject the job if it's set. I think that would have the same effect.

After trying it out, I'm remembering why it's not that way.
It's minor but I think it makes a difference.

By using a second channel, we're able to start the background rejection loop immediately. Otherwise, if all workers are fully occupied at the time terminate is called, we have to wait for a worker to be available to start rejecting jobs.

An alternative I've also explored is to immediately (T0) reject all queued jobs before enqueueing n Quit messages (T1), but that leaves jobs enqueued between T0 and T1 to hang until the background rejection job starts (which can only happen after all workers have exited so their Quit messages don't get dropped). This inconsistent behavior can be patched over by checking the is_terminating (bool Atomic.t) when submitting a new job but I'm trying to avoid all unnecessary thread coordination in the hot path...

I think we have to do that anyway, otherwise this could happen:

Client checks is_terminating and see it's still OK.

Pool gets terminated; all workers finish; switch ends.

Client adds job to queue.

It's not clear what behaviour we really want for terminate though. e.g. why do we want to reject jobs that were submitted before terminate was called? Then there's no way for a client to behave "correctly" (so that it's jobs never get rejected) in the graceful shutdown case.

It would probably be helpful to have an example of a program that needs support for this kind of graceful shutdown (rather than just finishing the switch, which I suspect will cover most real uses). Or possibly we should make an initial version without terminate and add it later if/when it's needed.

talex5 · 2023-08-25T09:18:14Z

lib_eio/workpool.ml

+
+let is_terminating { terminating = p, _; _ } = Promise.is_resolved p
+
+let is_terminated { terminated = p, _; _ } = Promise.is_resolved p


Do we need this? The caller can just wait for their switch to finish.

It's there to allow for uncommon use cases where terminate is called before the switch release. My original idea for these 2 functions was to allow for rare use cases without forcing those users to keep track of all this through side channels.
But now I'm starting to think maybe having these 2 functions is a mistake because without them the user would be encouraged to create finer lifetimes instead of reusing the same (overly) long-lived Switch.
What do you think?
Edit: I'm using them in tests and would like to continue doing so. Maybe they should go into a Private submodule?

lib_eio/workpool.ml

lib_eio/workpool.mli

talex5 · 2023-08-28T07:59:10Z

Instead, you can use the mock domain manager to make the tests deterministic. That currently lives in network.md, but it's generally useful and should be moved to the eio.mock library.

I've moved it out in #610.

SGrondin · 2023-09-02T17:38:35Z

Thanks for the thorough review. I've either made the requested change or left a comment/question above. I'm now working on using a mock clock in tests.

Edit: The tests are now 100% deterministic! Mock clocks, mock domains, mock backend.

SGrondin · 2023-09-02T20:48:44Z

tests/workpool.md

+        Fiber.yield ();
+        Fiber.yield ();
+        Fiber.yield ();
+        Fiber.yield ();
+        Fiber.yield ();
+        Fiber.yield ();
+        Fiber.yield ();


Is there a "yield until idle" function? If not, this is the precise number of yields needed...

At the moment if -for example- the implementation of Stream.take were to change it could require an additional yield. That would be quite confusing for the person making the change! Obviously I could just yield 50 times to be safe, but I'm sure there's a better way hidden somewhere deep in the internals of Eio, right?

There isn't. It needs support in Eio_mock.Backend. I was thinking of having a Backend.run_full that provides an environment with a mock clock that auto-advances whenever the run-queue is empty.
e.g.

Eio_mock.Backend.run @@ fun env -> let clock = env#clock in ...

That would be fantastic to have! While reading other tests to see how mock clocks were used, I saw multiple opportunities where it could have been used.

We also have mixed feelings about those explicit yields in the tests. A purist argument would be that it forces the writer to clearly express the semantics of a function... but in practice it is quite a headache.

SGrondin · 2023-11-06T15:58:42Z

Closing in favor of #639

SGrondin force-pushed the workpool branch from 393af7f to 9c2e9a6 Compare July 13, 2023 12:50

SGrondin mentioned this pull request Jul 19, 2023

Safe Fiber races: ~combine and Fiber.n_any #587

Merged

SGrondin force-pushed the workpool branch 2 times, most recently from 16d21de to f0d31ae Compare July 20, 2023 16:32

SGrondin marked this pull request as ready for review July 20, 2023 16:39

SGrondin changed the title ~~Eio.Workpool prototype~~ Eio.Workpool Jul 26, 2023

SGrondin force-pushed the workpool branch from f0d31ae to f0a3258 Compare July 29, 2023 13:25

SGrondin force-pushed the workpool branch 2 times, most recently from 9706699 to 4afaf1d Compare August 12, 2023 17:48

SGrondin mentioned this pull request Aug 18, 2023

Make MDX tests idempotent #601

Merged

talex5 reviewed Aug 25, 2023

View reviewed changes

SGrondin force-pushed the workpool branch from 4afaf1d to 6da5398 Compare September 2, 2023 17:33

SGrondin force-pushed the workpool branch from 6da5398 to 925f912 Compare September 2, 2023 20:47

SGrondin commented Sep 2, 2023

View reviewed changes

SGrondin force-pushed the workpool branch 2 times, most recently from d2ae2a6 to e06618e Compare September 3, 2023 14:13

SGrondin force-pushed the workpool branch from e06618e to de67cbb Compare November 4, 2023 17:44

Eio.Workpool

38deacf

SGrondin force-pushed the workpool branch from de67cbb to 38deacf Compare November 4, 2023 21:04

SGrondin mentioned this pull request Nov 6, 2023

Eio.Executor_pool #639

Merged

SGrondin closed this Nov 6, 2023

talex5 mentioned this pull request Nov 15, 2023

eio.mock: auto-advancing mock clock #644

Merged

SGrondin deleted the workpool branch December 3, 2023 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eio.Workpool #584

Eio.Workpool #584

SGrondin commented Jul 12, 2023 •

edited

Loading

SGrondin commented Jul 13, 2023

SGrondin commented Jul 20, 2023 •

edited

Loading

SGrondin commented Aug 20, 2023

talex5 left a comment

talex5 Aug 25, 2023

SGrondin Sep 2, 2023

talex5 Oct 23, 2023

SGrondin Nov 4, 2023

talex5 Nov 6, 2023

talex5 Aug 25, 2023

SGrondin Sep 2, 2023 •

edited

Loading

talex5 commented Aug 28, 2023

SGrondin commented Sep 2, 2023 •

edited

Loading

SGrondin Sep 2, 2023 •

edited

Loading

talex5 Sep 3, 2023

SGrondin Sep 3, 2023

wokalski Sep 6, 2023

SGrondin commented Nov 6, 2023


		let is_terminating { terminating = p, _; _ } = Promise.is_resolved p

		let is_terminated { terminated = p, _; _ } = Promise.is_resolved p

Eio.Workpool #584

Eio.Workpool #584

Conversation

SGrondin commented Jul 12, 2023 • edited Loading

Principles

Tests

Caveats

SGrondin commented Jul 13, 2023

SGrondin commented Jul 20, 2023 • edited Loading

SGrondin commented Aug 20, 2023

talex5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SGrondin Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

talex5 commented Aug 28, 2023

SGrondin commented Sep 2, 2023 • edited Loading

SGrondin Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SGrondin commented Nov 6, 2023

SGrondin commented Jul 12, 2023 •

edited

Loading

SGrondin commented Jul 20, 2023 •

edited

Loading

SGrondin Sep 2, 2023 •

edited

Loading

SGrondin commented Sep 2, 2023 •

edited

Loading

SGrondin Sep 2, 2023 •

edited

Loading