Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test task should use workers that were busy when the test task starts but later become idle #3955

Closed
adammurdoch opened this issue Jan 4, 2018 · 19 comments
Labels

Comments

@adammurdoch
Copy link
Member

Given the simplistic way that the test task assigns tests to test workers during test discovery, the test task will effectively ignore workers that were busy (or recently idle) during test discovery. If one of these worker becomes idle after test discovery has completed, it will not run any tests even though it could do so.

For example, I see this pretty much every time I run integration tests on the gradle/gradle build for more than one project. When one test suite is fast and the other is slow, the slow test suite can end up using just one worker even after the fast test suite has completed.

Expected Behavior

The test task should use up to the specific maximum number of workers, as they become idle at any time during the test task execution.

Current Behavior

The test task can get stuck using a small number of workers while other workers are idle.

Steps to Reproduce (for bugs)

./gradlew langNat:check ideNat:check in the gradle/gradle build often leads to this behaviour.

@ldaley
Copy link
Member

ldaley commented Jan 4, 2018

Do we now (conceptually) use the worker pool? When did that happen?

@oehme
Copy link
Contributor

oehme commented Jan 4, 2018

Depends on what you mean by "the worker pool". Tests have always respected "max workers". There is no single "worker pool" in terms of actual processes. Each worker process can only be reused for the same kind of work (i.e. a compiler worker can only compile). So you can have many more processes than "max workers", but only "max workers" of them will be busy at any given time. Tests also use a different kind of worker (use-once) than all other kinds of work (e.g. compilation which is reusable).

@ldaley
Copy link
Member

ldaley commented Jan 4, 2018

Tests have always respected "max workers".

Pretty sure that's not the case. They may have respected it for some time, but it wasn't always like that.

At least some time in the past, running two tests tasks concurrently with maxParallelForks = 4 would result in 8 test execution “workers”, regardless of --max-workers. I didn't realise that had changed.

There is no single "worker pool" in terms of actual processes.

Hence the (conceptually).

@oehme
Copy link
Contributor

oehme commented Jan 4, 2018

I think there was a bug where you could end up with 2*max-workers some time ago, which Gary and Paul fixed late 2016/early 2017. So it's been reliable since at least Gradle 3.4

@ldaley
Copy link
Member

ldaley commented Jan 5, 2018

9445ef4

So this changed happened Mar 2017 and in Gradle 3.5.

@oehme
Copy link
Contributor

oehme commented Jan 5, 2018

That was a different bug where Gradle would deadlock when maxParallelForks was larger than the number of available workers.

@eskatos
Copy link
Member

eskatos commented Jan 5, 2018

The Test task started honoring max-workers in 3.0: https://docs.gradle.org/3.0/release-notes.html#parallel-task-execution-improvements. The deadlock you mention was then fixed in 3.5.

I think what Adam describes is something else: test dispatching limits the number of forked test workers when max-workers cap is hit, but it doesn't fork new test workers later when new slots are available. Or something along those lines.

@ldaley
Copy link
Member

ldaley commented Jan 7, 2018

If we were going to do this, it would make sense to consider the following related improvements:

  1. Test workers should always be busy as long as there are pending tests (e.g. balance tests on demand across workers)
  2. Test workers with no remaining work to do should be “return immediately”, freeing a worker for the rest of the build

@oehme
Copy link
Contributor

oehme commented Jan 8, 2018

These points would "just work" if we let the scheduling threads block for the worker to complete the test before sending over the next one. That might come at the expense of slower unit tests though, where the turnaround between daemon and worker is in the same ballpark as executing the test itself.

@ldaley
Copy link
Member

ldaley commented Jan 8, 2018

For the absolute best of both worlds we’d probably want to either ship work in small batches, and/or add work stealing. That could reduce the book keeping overhead but avoid imbalance when things aren’t actually balanced.

@bmuschko bmuschko added in:core DO NOT USE from:member a:feature A new functionality labels Jan 8, 2018
@tomwidmer
Copy link

@ldaley Your point 1 is #2669 that I raised last year. Work stealing sounds good if you know how to do it - I assume this means that when a test worker becomes idle, it would steal some tests (or just 1 test class at a time) from another test worker. My company suffers from this issue, so they might be willing to allocate a few hours of my time to help out.

@ldaley
Copy link
Member

ldaley commented Jan 12, 2018

Hi @tomwidmer,

I can't see a solution that is not complex. As @oehme points out here, the simple approach of just pulling one class at a time is going to reduce throughput and is likely to be a non negligible reduction. So, something more complex is needed where test workers coordinate only when they have run out of work. A viable solution is probably going to take a good many hours to implement.

@oehme
Copy link
Contributor

oehme commented Jan 12, 2018

You could certainly try the simplistic blocking approach and measure it. We have performance tests for executing fast-running tests, so any regression would quickly be caught.

@adammurdoch
Copy link
Member Author

I'd be keen for this issue to not turn into a "solve the universe" type issue, so that we can address the specific problem in the title in a simple way without necessarily having to fix everything. For example, we might continue to assume that we'll have a fixed number of buckets of tests and assign tests to these buckets up front. Each available worker can grab a bucket and work on it exclusively. When a new worker becomes available, it can grab an unassigned bucket and work on it. When a worker completes a bucket, it can grab another bucket or exit. We might use n * max-parallel-forks buckets or a maximum number of test requests per bucket or some other very simple strategy to assign tests to buckets. We can add work stealing later, if it turns out that fine tuning the current strategy isn't enough.

@tomwidmer
Copy link

@adammurdoch I agree something simple would be good. #2669 has been raised in JIRA and on forums in the past (as I've found out when trying to work out how to fix it), so I think it's important to address that issue as well.

@ldaley As long as the behaviour is opt-in (in the test task config?), I don't think scheduling overhead need be a concern for an initial implementation. Also, whole test classes will be scheduled, not individual tests, so there's an element of batching already built into it. Work stealing or something more complex could wait.

@ldaley
Copy link
Member

ldaley commented Jan 29, 2018

@tomwidmer it shouldn't be opt-in; that's unnecessary configuration. The initial implementation has to be not be non-negligibly worse than the current behaviour.

That's not a claim that what has been proposed as the initial solution will not be good enough.

@tcfurrer
Copy link

Any idea when this issue might get some development action?

I ask because this issue can be a real disappointment for developers who are migrating to Gradle specifically to benefit from its excellent parallelism.

Especially as both the number of workers and test class runtime increases, the unnecessary increase in wall clock time due to this issue can become embarrassingly bad. My daily work server has 32 threads, with some individual tests taking hours. So this issue can result in a real productivity loss for my team. I guess with such long running tests, I could play around with forkEvery to get some benefit... but that shouldn't be required.

@stale
Copy link

stale bot commented Jul 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. Given the limited bandwidth of the team, it will be automatically closed if no further activity occurs. If you're interested in how we try to keep the backlog in a healthy state, please read our blog post on how we refine our backlog. If you feel this is something you could contribute, please have a look at our Contributor Guide. Thank you for your contribution.

@stale stale bot added the stale label Jul 13, 2020
@stale
Copy link

stale bot commented Aug 3, 2020

This issue has been automatically closed due to inactivity. If you can reproduce this on a recent version of Gradle or if you have a good use case for this feature, please feel free to reopen the issue with steps to reproduce, a quick explanation of your use case or a high-quality pull request.

@stale stale bot closed this as completed Aug 3, 2020
@wolfs wolfs closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants