Test task should use workers that were busy when the test task starts but later become idle #3955

adammurdoch · 2018-01-04T01:49:07Z

Given the simplistic way that the test task assigns tests to test workers during test discovery, the test task will effectively ignore workers that were busy (or recently idle) during test discovery. If one of these worker becomes idle after test discovery has completed, it will not run any tests even though it could do so.

For example, I see this pretty much every time I run integration tests on the gradle/gradle build for more than one project. When one test suite is fast and the other is slow, the slow test suite can end up using just one worker even after the fast test suite has completed.

Expected Behavior

The test task should use up to the specific maximum number of workers, as they become idle at any time during the test task execution.

Current Behavior

The test task can get stuck using a small number of workers while other workers are idle.

Steps to Reproduce (for bugs)

./gradlew langNat:check ideNat:check in the gradle/gradle build often leads to this behaviour.

The text was updated successfully, but these errors were encountered:

ldaley · 2018-01-04T01:52:13Z

Do we now (conceptually) use the worker pool? When did that happen?

oehme · 2018-01-04T08:34:57Z

Depends on what you mean by "the worker pool". Tests have always respected "max workers". There is no single "worker pool" in terms of actual processes. Each worker process can only be reused for the same kind of work (i.e. a compiler worker can only compile). So you can have many more processes than "max workers", but only "max workers" of them will be busy at any given time. Tests also use a different kind of worker (use-once) than all other kinds of work (e.g. compilation which is reusable).

ldaley · 2018-01-04T08:54:43Z

Tests have always respected "max workers".

Pretty sure that's not the case. They may have respected it for some time, but it wasn't always like that.

At least some time in the past, running two tests tasks concurrently with maxParallelForks = 4 would result in 8 test execution “workers”, regardless of --max-workers. I didn't realise that had changed.

There is no single "worker pool" in terms of actual processes.

Hence the (conceptually).

oehme · 2018-01-04T09:03:03Z

I think there was a bug where you could end up with 2*max-workers some time ago, which Gary and Paul fixed late 2016/early 2017. So it's been reliable since at least Gradle 3.4

ldaley · 2018-01-05T00:23:46Z

9445ef4

So this changed happened Mar 2017 and in Gradle 3.5.

oehme · 2018-01-05T09:50:43Z

That was a different bug where Gradle would deadlock when maxParallelForks was larger than the number of available workers.

eskatos · 2018-01-05T10:22:00Z

The Test task started honoring max-workers in 3.0: https://docs.gradle.org/3.0/release-notes.html#parallel-task-execution-improvements. The deadlock you mention was then fixed in 3.5.

I think what Adam describes is something else: test dispatching limits the number of forked test workers when max-workers cap is hit, but it doesn't fork new test workers later when new slots are available. Or something along those lines.

ldaley · 2018-01-07T23:27:58Z

If we were going to do this, it would make sense to consider the following related improvements:

Test workers should always be busy as long as there are pending tests (e.g. balance tests on demand across workers)
Test workers with no remaining work to do should be “return immediately”, freeing a worker for the rest of the build

oehme · 2018-01-08T07:26:42Z

These points would "just work" if we let the scheduling threads block for the worker to complete the test before sending over the next one. That might come at the expense of slower unit tests though, where the turnaround between daemon and worker is in the same ballpark as executing the test itself.

ldaley · 2018-01-08T09:21:40Z

For the absolute best of both worlds we’d probably want to either ship work in small batches, and/or add work stealing. That could reduce the book keeping overhead but avoid imbalance when things aren’t actually balanced.

tomwidmer · 2018-01-11T14:01:53Z

@ldaley Your point 1 is #2669 that I raised last year. Work stealing sounds good if you know how to do it - I assume this means that when a test worker becomes idle, it would steal some tests (or just 1 test class at a time) from another test worker. My company suffers from this issue, so they might be willing to allocate a few hours of my time to help out.

ldaley · 2018-01-12T03:56:49Z

Hi @tomwidmer,

I can't see a solution that is not complex. As @oehme points out here, the simple approach of just pulling one class at a time is going to reduce throughput and is likely to be a non negligible reduction. So, something more complex is needed where test workers coordinate only when they have run out of work. A viable solution is probably going to take a good many hours to implement.

oehme · 2018-01-12T10:57:13Z

You could certainly try the simplistic blocking approach and measure it. We have performance tests for executing fast-running tests, so any regression would quickly be caught.

adammurdoch · 2018-01-14T22:09:00Z

I'd be keen for this issue to not turn into a "solve the universe" type issue, so that we can address the specific problem in the title in a simple way without necessarily having to fix everything. For example, we might continue to assume that we'll have a fixed number of buckets of tests and assign tests to these buckets up front. Each available worker can grab a bucket and work on it exclusively. When a new worker becomes available, it can grab an unassigned bucket and work on it. When a worker completes a bucket, it can grab another bucket or exit. We might use n * max-parallel-forks buckets or a maximum number of test requests per bucket or some other very simple strategy to assign tests to buckets. We can add work stealing later, if it turns out that fine tuning the current strategy isn't enough.

tomwidmer · 2018-01-25T14:20:56Z

@adammurdoch I agree something simple would be good. #2669 has been raised in JIRA and on forums in the past (as I've found out when trying to work out how to fix it), so I think it's important to address that issue as well.

@ldaley As long as the behaviour is opt-in (in the test task config?), I don't think scheduling overhead need be a concern for an initial implementation. Also, whole test classes will be scheduled, not individual tests, so there's an element of batching already built into it. Work stealing or something more complex could wait.

ldaley · 2018-01-29T00:35:19Z

@tomwidmer it shouldn't be opt-in; that's unnecessary configuration. The initial implementation has to be not be non-negligibly worse than the current behaviour.

That's not a claim that what has been proposed as the initial solution will not be good enough.

tcfurrer · 2018-06-15T22:01:57Z

Any idea when this issue might get some development action?

I ask because this issue can be a real disappointment for developers who are migrating to Gradle specifically to benefit from its excellent parallelism.

Especially as both the number of workers and test class runtime increases, the unnecessary increase in wall clock time due to this issue can become embarrassingly bad. My daily work server has 32 threads, with some individual tests taking hours. So this issue can result in a real productivity loss for my team. I guess with such long running tests, I could play around with forkEvery to get some benefit... but that shouldn't be required.

stale · 2020-07-13T18:01:43Z

This issue has been automatically marked as stale because it has not had recent activity. Given the limited bandwidth of the team, it will be automatically closed if no further activity occurs. If you're interested in how we try to keep the backlog in a healthy state, please read our blog post on how we refine our backlog. If you feel this is something you could contribute, please have a look at our Contributor Guide. Thank you for your contribution.

stale · 2020-08-03T18:07:44Z

This issue has been automatically closed due to inactivity. If you can reproduce this on a recent version of Gradle or if you have a good use case for this feature, please feel free to reopen the issue with steps to reproduce, a quick explanation of your use case or a high-quality pull request.

bmuschko added in:core DO NOT USE from:member a:feature A new functionality labels Jan 8, 2018

oehme added a:performance-issue and removed a:feature A new functionality labels Aug 3, 2018

dagguh mentioned this issue Jul 5, 2019

JPERF-518: Parallelize AwsCli#ensureAwsCli across SSH hosts atlassian/aws-infrastructure#35

Merged

stale bot added the stale label Jul 13, 2020

stale bot closed this as completed Aug 3, 2020

wolfs closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test task should use workers that were busy when the test task starts but later become idle #3955

Test task should use workers that were busy when the test task starts but later become idle #3955

adammurdoch commented Jan 4, 2018

ldaley commented Jan 4, 2018

oehme commented Jan 4, 2018

ldaley commented Jan 4, 2018

oehme commented Jan 4, 2018

ldaley commented Jan 5, 2018

oehme commented Jan 5, 2018

eskatos commented Jan 5, 2018 •

edited

ldaley commented Jan 7, 2018

oehme commented Jan 8, 2018

ldaley commented Jan 8, 2018

tomwidmer commented Jan 11, 2018

ldaley commented Jan 12, 2018

oehme commented Jan 12, 2018

adammurdoch commented Jan 14, 2018

tomwidmer commented Jan 25, 2018

ldaley commented Jan 29, 2018

tcfurrer commented Jun 15, 2018

stale bot commented Jul 13, 2020

stale bot commented Aug 3, 2020

Test task should use workers that were busy when the test task starts but later become idle #3955

Test task should use workers that were busy when the test task starts but later become idle #3955

Comments

adammurdoch commented Jan 4, 2018

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

ldaley commented Jan 4, 2018

oehme commented Jan 4, 2018

ldaley commented Jan 4, 2018

oehme commented Jan 4, 2018

ldaley commented Jan 5, 2018

oehme commented Jan 5, 2018

eskatos commented Jan 5, 2018 • edited

ldaley commented Jan 7, 2018

oehme commented Jan 8, 2018

ldaley commented Jan 8, 2018

tomwidmer commented Jan 11, 2018

ldaley commented Jan 12, 2018

oehme commented Jan 12, 2018

adammurdoch commented Jan 14, 2018

tomwidmer commented Jan 25, 2018

ldaley commented Jan 29, 2018

tcfurrer commented Jun 15, 2018

stale bot commented Jul 13, 2020

stale bot commented Aug 3, 2020

eskatos commented Jan 5, 2018 •

edited