Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task stuck in 100% CPU loop within FinalizerGroup/DefaultExecutionPlan.finalizePlan on Gradle 7.6+ #25361

Closed
chadlwilson opened this issue Jun 12, 2023 · 6 comments · Fixed by #25468
Labels
a:regression This used to work has:reproducer Indicates the issue has a confirmed reproducer in:scheduler execution plan, task graph, work lease, project lock
Milestone

Comments

@chadlwilson
Copy link

chadlwilson commented Jun 12, 2023

Expected Behavior

./gradlew server:fastUnitTest of https://github.com/gocd/gocd should execute fine for Gradle 7.6+, as it does on Gradle 7.5.1.

This task is defined here, and I cannot identify anything special about it to guess at the root cause or find a workaround.

Current Behavior

On Gradle 7.6, 7.6.1, 8.1.1 and possibly other versions running this task just gets stuck CONFIGURING and never terminates.

Debug logs end with

2023-06-12T21:23:16.180+0800 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationRunner] Completing Build operation 'Calculate task graph'
2023-06-12T21:23:16.180+0800 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationRunner] Build operation 'Calculate task graph' completed
2023-06-12T21:23:24.474+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire shared lock on daemon addresses registry.
2023-06-12T21:23:24.476+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
2023-06-12T21:23:24.476+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
2023-06-12T21:23:24.477+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire shared lock on daemon addresses registry.
2023-06-12T21:23:24.477+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
2023-06-12T21:23:24.477+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
2023-06-12T21:23:34.473+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire shared lock on daemon addresses registry.
2023-06-12T21:23:34.474+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
2023-06-12T21:23:34.474+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
2023-06-12T21:23:34.475+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire shared lock on daemon addresses registry.
2023-06-12T21:23:34.475+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
2023-06-12T21:23:34.475+0800 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
.... and on ever 10 seconds

Profiler samples always look something like

image
image

Context (optional)

The intent here is to just run a custom test task against an alternate java source set defined in a custom configuration earlier. I have not found a workaround for this.

Steps to Reproduce

Requires JDK 17 on path or locatable via Gradle toolchain

git clone https://github.com/gocd/gocd
git checkout 23d5610cee3682e8e6e3375b58a6f240416954ec
./gradlew server:fastUnitTest --tests GoConfigDaoBasicTest
# See that it is stuck

Downgrade and see that it gets past configuring

./gradlew wrapper --gradle-version=7.5.1 --gradle-distribution-sha256-sum=db9c8211ed63f61f60292c69e80d89196f9eb36665e369e7f00ac4cc841c2219 --distribution-type=ALL
./gradlew server:fastUnitTest --tests GoConfigDaoBasicTest
# See that configuring completes and thinks start compiling/executing

Gradle version

8.1.1

Gradle version that used to work

7.5.1

Build scan URL (optional)

No response

Your Environment (optional)

  • Apple MacBook M1 Pro
  • MacOS Ventura 13.4
  • Java 17.0.7
@chadlwilson chadlwilson added a:regression This used to work to-triage labels Jun 12, 2023
@chadlwilson
Copy link
Author

I note various work in this broad area, e.g in #21116 #21997 #21542 #21975

It has the broad same observable behaviour (livelock) as #20455 but understand that ticket is a generic issue and not really resolvable.

chadlwilson added a commit to chadlwilson/gocd that referenced this issue Jun 12, 2023
Can't find a workaround to gradle/gradle#25361 right now.
@eskatos eskatos added in:scheduler execution plan, task graph, work lease, project lock 👋 team-triage Issues that need to be triaged by a specific team has:reproducer Indicates the issue has a confirmed reproducer and removed to-triage labels Jun 14, 2023
@eskatos
Copy link
Member

eskatos commented Jun 14, 2023

Thank you for your interest in Gradle!

This issue needs a decision from the team responsible for that area. They have been informed. Response time may vary.

@ghale
Copy link
Member

ghale commented Jun 15, 2023

I've taken a look at this and, in fact, this it not a hang, but just some really inefficient code. It will actually finish calculating the task graph after about an hour and a half (depending on the horsepower of your machine). There are about 1400 tasks that get executed with fastUnitTest and it appears that we have some logic that runs in at least O(n^2) time (maybe even O(n^3) time, I'm not sure) relative to the number of tasks.

In any event, we need to optimize this. In the meantime, the following workaround will get you around the problem (i.e. the issue is related to the fact that the test tasks are both dependencies of and finalized by the report task):

@@ -679,7 +679,7 @@ task allTestsJunitHtmlReport(type: TestReport) { TestReport thisTask ->
   description = "Run all api tests"

   forAllTask { Test eachTestTask ->
-    testResults.from(eachTestTask)
+    testResults.from { eachTestTask.binaryResultsDirectory.get().asFile }
     eachTestTask.finalizedBy(thisTask)
   }
   destinationDirectory = file("${project.buildDir}/reports/tests/allTests")

@ghale
Copy link
Member

ghale commented Jun 15, 2023

Actually, a simpler way to remove the task dependency is to use:

testResults.from(eachTestTask.binaryResultsDirectory.locationOnly)

@chadlwilson
Copy link
Author

chadlwilson commented Jun 16, 2023

Thank you very much! Will give that a go. (edit: seems to work fine, thank you!)

Do you think this might bes something to do with the changes for finalizedBy and mustRunAfter in Gradle 7.6 or so? #10549 (we have some mustRunAfters I don't quite understand)

I don't quite understand why the same behavior isn't observed when running ./gradlew allTests, or other individual test tasks, i.e what is special about trying to run this one single task that triggers the O(n^2+) processing.

Perhaps I could consider removing allTestsJunitHtmlReport as relevant when running an individual test task; i.e make it only apply when running allTests. I think it was changed to finalize all tasks always for some reason. Perhaps the issue is triggered by the weird afterEvaluate stuff?

I do note however that we have a

  • whole lot of warnings about needing to use tasks.register and tasks.withType(..).configureEach to defer configuration which may relate to why this is painful.
  • some hacks in there still lingering around that seemed to be due to long-removed use of gradle-testsets-plugin including the forAllTasks afterEvaluate
  • we also have other problems with some of the custom stuff being done here, like filtering test tasks with --tests not working properly and running tests for dependent tasks (!!), so probably I need to review/simplify everything in this "test task" area.

chadlwilson added a commit to gocd/gocd that referenced this issue Jun 16, 2023
Workaround dependencies created by the multi-module test reports creating issues with Gradle 7.6+ task configuration/finalization based on suggestion at gradle/gradle#25361 (comment)

Seems to cause no ill side effects. To review these dependencies later.
@ghale
Copy link
Member

ghale commented Jun 20, 2023

There was definitely a change in 7.6 that attempted to more properly handle finalizer tasks and introduces the problem you've found. In the task graph, mustRunAfter relationships and dependsOn relationships are extremely similar, so it's likely the same problem as discussed in the issue you linked.

The reason it's not an issue with allTests is because allTests pulls all of the finalized Test tasks into the graph. The problematic code comes into play when we are looking at the finalizer task and searching for un-scheduled dependencies that might also need to be pulled into the graph. Since allTests causes all of the finalizer's dependencies to be scheduled, we never actually descend into the problematic logic - it's only when you request some subset of those tasks that we trigger the problem. By removing the task dependency between the finalizer and the test tasks, we also avoid it as we just rely on the finalization relationship to ensure that the test tasks run first. It does mean that running allTestsJunitHtmlReport by itself will not cause any of the test tasks to execute, but I don't get the impression that's something that you would need.

One option to think about with respect to test report aggregation might be to use the test report aggregation plugin introduced in Gradle 7.4. It's a little bit different than your current configuration in that it uses test suites and aggregates by test type (which can be any arbitrary string). You'd have to convert your tests to use test suites, but your use case is pretty much what test suites were designed to address. It would likely simplify the configuration a good bit, I think, but not without some effort during conversion.

ghale added a commit that referenced this issue Jun 21, 2023
When a finalizer finalizes multiple tasks, and also has dependencies, performance
for calculating the task graph can be exceedingly poor for large task graphs.  This
addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.

Fixes #25361
ghale added a commit that referenced this issue Jun 26, 2023
When a finalizer finalizes multiple tasks, and also has dependencies, performance
for calculating the task graph can be exceedingly poor for large task graphs.  This
addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.

Fixes #25361
ghale added a commit that referenced this issue Jun 26, 2023
When a finalizer finalizes multiple tasks, and also has dependencies, performance
for calculating the task graph can be exceedingly poor for large task graphs.  This
addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.

Fixes #25361
bot-gradle added a commit that referenced this issue Jul 1, 2023
…set of finalized tasks

When a finalizer finalizes multiple tasks, and also has dependencies, performance for calculating the task graph can be exceedingly poor for large task graphs when a subset of the finalized tasks are executed.  For instance, if a task finalizes all of the test tasks in a build, and also sets them as dependencies, and then only one test task is requested on the command line, it will trigger this behavior (oddly, requesting all test tasks avoids the problem).

This addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.  The performance is still not great, but it's at least manageable even for very large builds.  We can improve this more over time.

Fixes #25361

Co-authored-by: Gary Hale <gary@gradle.com>
@bot-gradle bot-gradle added this to the 8.3 RC1 milestone Jul 1, 2023
@ov7a ov7a removed to-triage 👋 team-triage Issues that need to be triaged by a specific team labels Jul 3, 2023
ljacomet pushed a commit that referenced this issue Sep 29, 2023
When a finalizer finalizes multiple tasks, and also has dependencies, performance
for calculating the task graph can be exceedingly poor for large task graphs.  This
addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.

Fixes #25361
ljacomet pushed a commit that referenced this issue Sep 29, 2023
When a finalizer finalizes multiple tasks, and also has dependencies, performance
for calculating the task graph can be exceedingly poor for large task graphs.  This
addresses the problem in two ways:

- Only schedule the members of a finalizer group once.
- Instead of iterating over every member and deciding if it is a dependency of a finalized node, iterate over the finalized nodes and calculate its dependencies that are also members.

This dramatically reduces the time to calculate the task graph in large builds that exhibit this behavior.

Fixes #25361
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:regression This used to work has:reproducer Indicates the issue has a confirmed reproducer in:scheduler execution plan, task graph, work lease, project lock
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants