-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test task should use workers that were busy when the test task starts but later become idle #3955
Comments
Do we now (conceptually) use the worker pool? When did that happen? |
Depends on what you mean by "the worker pool". Tests have always respected "max workers". There is no single "worker pool" in terms of actual processes. Each worker process can only be reused for the same kind of work (i.e. a compiler worker can only compile). So you can have many more processes than "max workers", but only "max workers" of them will be busy at any given time. Tests also use a different kind of worker (use-once) than all other kinds of work (e.g. compilation which is reusable). |
Pretty sure that's not the case. They may have respected it for some time, but it wasn't always like that. At least some time in the past, running two tests tasks concurrently with
Hence the |
I think there was a bug where you could end up with 2*max-workers some time ago, which Gary and Paul fixed late 2016/early 2017. So it's been reliable since at least Gradle 3.4 |
So this changed happened Mar 2017 and in Gradle 3.5. |
That was a different bug where Gradle would deadlock when maxParallelForks was larger than the number of available workers. |
The I think what Adam describes is something else: test dispatching limits the number of forked test workers when |
If we were going to do this, it would make sense to consider the following related improvements:
|
These points would "just work" if we let the scheduling threads block for the worker to complete the test before sending over the next one. That might come at the expense of slower unit tests though, where the turnaround between daemon and worker is in the same ballpark as executing the test itself. |
For the absolute best of both worlds we’d probably want to either ship work in small batches, and/or add work stealing. That could reduce the book keeping overhead but avoid imbalance when things aren’t actually balanced. |
@ldaley Your point 1 is #2669 that I raised last year. Work stealing sounds good if you know how to do it - I assume this means that when a test worker becomes idle, it would steal some tests (or just 1 test class at a time) from another test worker. My company suffers from this issue, so they might be willing to allocate a few hours of my time to help out. |
Hi @tomwidmer, I can't see a solution that is not complex. As @oehme points out here, the simple approach of just pulling one class at a time is going to reduce throughput and is likely to be a non negligible reduction. So, something more complex is needed where test workers coordinate only when they have run out of work. A viable solution is probably going to take a good many hours to implement. |
You could certainly try the simplistic blocking approach and measure it. We have performance tests for executing fast-running tests, so any regression would quickly be caught. |
I'd be keen for this issue to not turn into a "solve the universe" type issue, so that we can address the specific problem in the title in a simple way without necessarily having to fix everything. For example, we might continue to assume that we'll have a fixed number of buckets of tests and assign tests to these buckets up front. Each available worker can grab a bucket and work on it exclusively. When a new worker becomes available, it can grab an unassigned bucket and work on it. When a worker completes a bucket, it can grab another bucket or exit. We might use n * max-parallel-forks buckets or a maximum number of test requests per bucket or some other very simple strategy to assign tests to buckets. We can add work stealing later, if it turns out that fine tuning the current strategy isn't enough. |
@adammurdoch I agree something simple would be good. #2669 has been raised in JIRA and on forums in the past (as I've found out when trying to work out how to fix it), so I think it's important to address that issue as well. @ldaley As long as the behaviour is opt-in (in the test task config?), I don't think scheduling overhead need be a concern for an initial implementation. Also, whole test classes will be scheduled, not individual tests, so there's an element of batching already built into it. Work stealing or something more complex could wait. |
@tomwidmer it shouldn't be opt-in; that's unnecessary configuration. The initial implementation has to be not be non-negligibly worse than the current behaviour. That's not a claim that what has been proposed as the initial solution will not be good enough. |
Any idea when this issue might get some development action? I ask because this issue can be a real disappointment for developers who are migrating to Gradle specifically to benefit from its excellent parallelism. Especially as both the number of workers and test class runtime increases, the unnecessary increase in wall clock time due to this issue can become embarrassingly bad. My daily work server has 32 threads, with some individual tests taking hours. So this issue can result in a real productivity loss for my team. I guess with such long running tests, I could play around with forkEvery to get some benefit... but that shouldn't be required. |
This issue has been automatically marked as stale because it has not had recent activity. Given the limited bandwidth of the team, it will be automatically closed if no further activity occurs. If you're interested in how we try to keep the backlog in a healthy state, please read our blog post on how we refine our backlog. If you feel this is something you could contribute, please have a look at our Contributor Guide. Thank you for your contribution. |
This issue has been automatically closed due to inactivity. If you can reproduce this on a recent version of Gradle or if you have a good use case for this feature, please feel free to reopen the issue with steps to reproduce, a quick explanation of your use case or a high-quality pull request. |
Given the simplistic way that the test task assigns tests to test workers during test discovery, the test task will effectively ignore workers that were busy (or recently idle) during test discovery. If one of these worker becomes idle after test discovery has completed, it will not run any tests even though it could do so.
For example, I see this pretty much every time I run integration tests on the
gradle/gradle
build for more than one project. When one test suite is fast and the other is slow, the slow test suite can end up using just one worker even after the fast test suite has completed.Expected Behavior
The test task should use up to the specific maximum number of workers, as they become idle at any time during the test task execution.
Current Behavior
The test task can get stuck using a small number of workers while other workers are idle.
Steps to Reproduce (for bugs)
./gradlew langNat:check ideNat:check
in thegradle/gradle
build often leads to this behaviour.The text was updated successfully, but these errors were encountered: