fixes in async_semaphore for concurrency ID calculation #10436

gshuflin · 2020-07-23T20:18:13Z

Problem

This is an attempt to fix an issue where the concurrency ID passed into processes executed under the BoundedCommandRunner is being limited to a smaller set of IDs than the limit of the BoundedCommandRunner.

We were previously using the (usize) value of available_permits on the inner, locked portion of the AsyncSemaphor data structure that BoundedCommandRunner uses internally, as the value for the concurrency ID. However, this resulted in undesirable behavior where IDs passed to processes would count down from the local concurrency limit value (e.g. 16) to 1, and then never rise above 1 again. Every time a task finished, it would increment available_permits from 0 to 1, and then wake up the next process waiting on the queue, which would decrement available_permits again when it began to run and use 1 as its ID.

There was also an issue with always waking up the task in the front position in the queue. In the case where a large number of processes were all scheduled at once, they might all call wake_by_id on the same waiting task, meaning that any other waiting tasks after the first would wouldn't get awoken until either the first task, or a newly-scheduled task, completed and woke up tasks in the queue beyond the first.

Solution

To solve the concurrency ID issue, we add a new queue data structure to Inner, that keeps track of discrete IDs. When a permit to run is issued (in PermitFuture::poll), an ID is popped from the front of this queue, and when a task completes (in Permit::drop) that ID is pushed onto the back of the queue. So, a limited number of IDs will always be cycling through the AsyncSemaphor as long as it is running, and we won't see the behavior. This change also has the effect of causing IDs to count up from 1 rather than down from the local concurrency limit, which is only an implementation detail, since these IDs are meant to be opaque identifiers.

When an asynchronous task completes, instead of calling wake_by_ref() on the waiter at the front of the queue, use available_permits to index into the waiters queue and wake up a task. This ensures that each completing task will wake up at least one yet-unwoken waiting task (if there are any such waiting tasks), which will cause those waiting tasks to get scheduled immediately and take up a concurrency slot.

Also, a debug logger with the concurrency ID is added in this commit, to facilitate debugging any further concurrency-ID-related issues we may run into.

coveralls · 2020-07-23T21:37:03Z

Coverage remained the same at 0.0% when pulling 0d623e9 on gshuflin:fixing_concurrency_id into ef79642 on pantsbuild:master.

gshuflin · 2020-07-23T23:41:34Z

As an illustration of the changes this patch makes, here is some log output from pants running ./pants --no-pantsd --local-execution-root-dir=/home/gregs/large-partition/ test src/python/pants:: on my system with "info" level logging of the concurrency ID:

Old:
old-code.log

New:
newer-code.log

stuhood

Thanks!

There was also an issue with always waking up the task in the front position in the queue. In the case where a large number of processes were all scheduled at once, they might all call wake_by_id on the same waiting task, meaning that any other waiting tasks after the first would wouldn't get awoken until either the first task, or a newly-scheduled task, completed and woke up tasks in the queue beyond the first.

Good fix. But IMO, we should really, really switch to tokio semaphore. I'm fine with not doing it in this PR as long as you're confident in the tests.

stuhood · 2020-07-24T00:59:37Z

src/rust/engine/async_semaphore/src/lib.rs

@@ -158,7 +166,8 @@ impl Future for PermitFuture {
              // queue, so we don't have to waste time searching for it in the Drop
              // handler.
              self.waiter_id = None;
-              Some(inner.available_permits)
+              let id = inner.available_ids.pop_front().unwrap_or(0); // The unwrap_or case should never happen.


It would probably be less error prone for this to either:

replace the inner.available_permits counter entirely, so that we can be sure that we don't promise a permit unless there is an available_id.

be decoupled from the semaphore and completely independent (such that it generated tokens lazily if it was empty, counting upward or something).

But if you trust that the test is sufficient, it's not a blocker per-se.

src/rust/engine/process_execution/src/lib.rs

gshuflin force-pushed the fixing_concurrency_id branch from 4a3fdd2 to 620e923 Compare July 23, 2020 20:40

gshuflin changed the title ~~WIP testing~~ async_semaphor: wake more than a single waiting process on Drop Jul 23, 2020

gshuflin requested review from tdyas and stuhood July 23, 2020 21:23

gshuflin marked this pull request as ready for review July 23, 2020 21:23

Changes to async_semaphor

0887573

gshuflin force-pushed the fixing_concurrency_id branch from 620e923 to 0887573 Compare July 23, 2020 23:17

gshuflin changed the title ~~async_semaphor: wake more than a single waiting process on Drop~~ fixes in async_semaphor for concurrency ID caculation Jul 23, 2020

stuhood changed the title ~~fixes in async_semaphor for concurrency ID caculation~~ fixes in async_semaphore for concurrency ID caculation Jul 24, 2020

stuhood changed the title ~~fixes in async_semaphore for concurrency ID caculation~~ fixes in async_semaphore for concurrency ID calculation Jul 24, 2020

stuhood approved these changes Jul 24, 2020

View reviewed changes

Spell semaphore correctly

0d623e9

stuhood merged commit 4441dd3 into pantsbuild:master Jul 24, 2020

gshuflin deleted the fixing_concurrency_id branch July 24, 2020 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes in async_semaphore for concurrency ID calculation #10436

fixes in async_semaphore for concurrency ID calculation #10436

gshuflin commented Jul 23, 2020 •

edited

coveralls commented Jul 23, 2020 •

edited

gshuflin commented Jul 23, 2020

stuhood left a comment

stuhood Jul 24, 2020

fixes in async_semaphore for concurrency ID calculation #10436

fixes in async_semaphore for concurrency ID calculation #10436

Conversation

gshuflin commented Jul 23, 2020 • edited

Problem

Solution

coveralls commented Jul 23, 2020 • edited

gshuflin commented Jul 23, 2020

stuhood left a comment

Choose a reason for hiding this comment

stuhood Jul 24, 2020

Choose a reason for hiding this comment

gshuflin commented Jul 23, 2020 •

edited

coveralls commented Jul 23, 2020 •

edited