Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-72325] Define an executor and scheduler for SandboxResolvingClassLoader #543

Merged
merged 1 commit into from Nov 27, 2023

Conversation

basil
Copy link
Member

@basil basil commented Nov 17, 2023

Environment

Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
Maven home: /usr/share/apache-maven-3.9.5
Java version: 21.0.1, vendor: Eclipse Adoptium, runtime: /opt/jdk-21
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.10.197-186.748.amzn2.x86_64", arch: "amd64", family: "unix"

Steps to reproduce

See JENKINS-72325. Run a jenkinsci/bom job with the following patch:

diff --git a/Jenkinsfile b/Jenkinsfile
index 2f2ff395..6e8b5e29 100644
--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -6,10 +6,6 @@ if (BRANCH_NAME == 'master' && currentBuild.buildCauses*._class == ['jenkins.bra
 }
 
 def mavenEnv(Map params = [:], Closure body) {
-  def attempt = 0
-  def attempts = 6
-  retry(count: attempts, conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()]) {
-    echo 'Attempt ' + ++attempt + ' of ' + attempts
   // no Dockerized tests; https://github.com/jenkins-infra/documentation/blob/master/ci.adoc#container-agents
   node(params['nodePool'] ? 'maven-bom': 'maven-' + params['jdk']) {
     timeout(120) {
@@ -27,7 +23,6 @@ def mavenEnv(Map params = [:], Closure body) {
       }
     }
   }
-  }
 }
 
 @NonCPS
@@ -90,6 +85,9 @@ if (BRANCH_NAME == 'master' || fullTestMarkerFile || weeklyTestMarkerFile || env
       return
     }
     pluginsByRepository.each { repository, plugins ->
+      if (line != 'weekly' || repository != 'pipeline-groovy-lib-plugin') {
+        return
+      }
       branches["pct-$repository-$line"] = {
         def jdk = line == 'weekly' ? 21 : 11
         if (jdk == 21) {
@@ -103,11 +101,13 @@ if (BRANCH_NAME == 'master' || fullTestMarkerFile || weeklyTestMarkerFile || env
           withEnv([
             "PLUGINS=${plugins.join(',')}",
             "LINE=$line",
-            'EXTRA_MAVEN_PROPERTIES=maven.test.failure.ignore=true:surefire.rerunFailingTestsCount=1'
+            'EXTRA_MAVEN_PROPERTIES=test=org.jenkinsci.plugins.workflow.libs.LibraryMemoryTest'
           ]) {
             sh '''
             mvn -v
+            while true; do
               bash pct.sh
+            done
             '''
           }
           withCredentials([string(credentialsId: 'launchable-jenkins-bom', variable: 'LAUNCHABLE_TOKEN')]) {

This patch runs org.jenkinsci.plugins.workflow.libs.LibraryMemoryTest in a loop against the latest Jenkins weekly release on Java 21.

Expected results

The test runs in a loop without failure. This is the actual result when running the test on Java 17.

Actual results

The test eventually fails after 1-25 iterations with

java.lang.AssertionError: 
{org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3=static final java.util.concurrent.ForkJoinPool java.util.concurrent.ForkJoinPool.common->
java.util.concurrent.ForkJoinPool@3badfecc-queues->
[Ljava.util.concurrent.ForkJoinPool$WorkQueue;@69955964-[3]->
java.util.concurrent.ForkJoinPool$WorkQueue@85db49a-owner->
java.util.concurrent.ForkJoinWorkerThread@78c1d6a7-inheritedAccessControlContext->
java.security.AccessControlContext@571d981c-context->
[Ljava.security.ProtectionDomain;@6a6b4de5-[14]->
java.security.ProtectionDomain@3b9a0b17-classloader->
org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3}
	at org.junit.Assert.fail(Assert.java:89)
	at org.jvnet.hudson.test.MemoryAssert.assertGC(MemoryAssert.java:211)
	at org.jvnet.hudson.test.MemoryAssert.assertGC(MemoryAssert.java:149)
	at org.jenkinsci.plugins.workflow.libs.LibraryMemoryTest.loaderReleased(LibraryMemoryTest.java:85)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:607)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.lang.Thread.run(Thread.java:1583)

From standard error:

  14.291 [p #1] [Pipeline] Start of Pipeline
registering WorkflowScript@52cd846f
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$TimingLoader@35033c7a
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@124b8d74
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$TimingLoader@25eee7d5
…from org.jenkinsci.plugins.workflow.cps.GroovySourceFileAllowlist$ClassLoaderImpl@24462138
…from org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader@500d3630
  14.493 [p #1] [Pipeline] node
  14.543 [p #1] Running on Jenkins in /home/jenkins/agent/workspace/Tools_bom_PR-2655/target/pct-work/pipeline-groovy-lib-plugin/target/tmp/j h13732391855676351643/workspace/p
  14.594 [p #1] [Pipeline] {
  14.644 [p #1] [Pipeline] }
  14.644 [p #1] [Pipeline] // node
registering leak@e550996
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@124b8d74
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$TimingLoader@25eee7d5
…from org.jenkinsci.plugins.workflow.cps.GroovySourceFileAllowlist$ClassLoaderImpl@24462138
…from org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader@500d3630
registering p.C@702b98c2
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@124b8d74
…from org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$TimingLoader@25eee7d5
…from org.jenkinsci.plugins.workflow.cps.GroovySourceFileAllowlist$ClassLoaderImpl@24462138
…from org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader@500d3630
  14.661 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#optimizeStorage: Migrating CpsFlowExecution[Owner[p/1:p #1]] to BulkFlowNodeStorage
  14.672 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#optimizeStorage: Copied nodes to workflow-completed
  14.672 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#optimizeStorage: Deleted /home/jenkins/agent/workspace/Tools_bom_PR-2655/target/pct-work/pipeline-groovy-lib-plugin/target/tmp/j h13732391855676351643/jobs/p/builds/1/workflow
  14.679 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#cleanUpHeap: cleanUpHeap on Owner[p/1:p #1]
  14.680 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#closeShells: closing main class loader from Owner[p/1:p #1]
  14.680 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#closeShells: closing trusted class loader from Owner[p/1:p #1]
  14.680 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: ignoring org.jenkinsci.plugins.workflow.cps.GroovySourceFileAllowlist$ClassLoaderImpl@24462138
  14.680 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: found org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@124b8d74
  14.682 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#cleanUpGlobalClassValue: cleaning up [class p.C, class leak] associated with org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@124b8d74
  14.683 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: found leak
  14.683 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpClassHelperCache: cleaning up leak from ClassHelperCache? false
  14.683 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: found p.C
  14.683 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpClassHelperCache: cleaning up p.C from ClassHelperCache? true
  14.683 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: found org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3
  14.684 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#cleanUpGlobalClassValue: cleaning up [class WorkflowScript] associated with org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3
  14.684 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpLoader: found WorkflowScript
  14.684 [id=155]	FINER	o.j.p.w.cps.CpsFlowExecution#cleanUpClassHelperCache: cleaning up WorkflowScript from ClassHelperCache? false
  14.695 [p #1] [Pipeline] End of Pipeline
  14.721 [p #1] Finished: SUCCESS
  14.722 [id=155]	FINE	o.j.p.w.cps.CpsFlowExecution#logTimings: timings for Owner[p/1:p #1]: {classLoad=1775ms, flowNode=34ms, parse=1413ms, run=412ms, saveProgram=99ms}
Trying to collect org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3…
GC after allocation of size 1331
GC after allocation of size 1730
GC after allocation of size 2249
GC after allocation of size 2923
GC after allocation of size 3799
GC after allocation of size 4938
GC after allocation of size 6419
GC after allocation of size 8344
GC after allocation of size 10847
GC after allocation of size 14101
GC after allocation of size 18331
GC after allocation of size 23830
GC after allocation of size 30979
GC after allocation of size 40272
GC after allocation of size 52353
GC after allocation of size 68058
GC after allocation of size 88475
GC after allocation of size 115017
GC after allocation of size 149522
GC after allocation of size 194378
GC after allocation of size 252691
GC after allocation of size 328498
GC after allocation of size 427047
GC after allocation of size 555161
GC after allocation of size 721709
GC after allocation of size 938221
GC after allocation of size 1219687
GC after allocation of size 1585593
GC after allocation of size 2061270
GC after allocation of size 2679651
GC after allocation of size 3483546
GC after allocation of size 4528609
GC after allocation of size 5887191
GC after allocation of size 7653348
GC after allocation of size 9949352
GC after allocation of size 12934157
GC after allocation of size 16814404
GC after allocation of size 21858725
GC after allocation of size 28416342
GC after allocation of size 36941244
GC after allocation of size 48023617
Failed to collect org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@769578f3, looking for strong references…
  27.516 [id=30]	INFO	hudson.lifecycle.Lifecycle#onStatusUpdate: Stopping Jenkins

The failure cannot be easily reproduced locally; however, it failed in CI (with the above patch) after about 1-25 iterations against commit 2d062012 (Nov 13), commit 7ae60b1 (Nov 5), commit 430707e (October 29), commit c300386 (October 16), commit 155e2e7 (October 9), and commit 2798690 (October 2). So this is not a very recent flake.

Evaluation

Care must be taken to avoid leaking instances of GroovyClassLoader when computing cached values in SandboxResolvingClassLoader. In its default configuration, Caffeine uses ForkJoinPool#commonPool as its Executor. As of recent Java versions, ForkJoinPool can capture a reference to GroovyClassLoader by creating a ForkJoinWorkerThread whose Thread#inheritedAccessControlContext refers to an AccessControlContext whose ProtectionDomain refers to GroovyClassLoader.

This is arguably a Java Platform regression, likely caused by JEP-425. However, as of JEP-411, Thread#inheritedAccessControlContext is deprecated for removal. Therefore, we explicitly decline to report this issue upstream, although we must contend with it in the short to medium term in Jenkins.

Rejected Solution

We initially tried configuring Caffeine with the Executor returned by Executors#newCachedThreadPool. However, this merely traded one set of problems for another. This Executor was seen to also capture a reference to GroovyClassLoader by creating a Thread whose Thread#inheritedAccessControlContext refers to an AccessControlContext whose ProtectionDomain refers to GroovyClassLoader:

{org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@fe818a4=java.lang.Thread@2ff2e747-group->
java.lang.ThreadGroup@74179f7a-threads->
[Ljava.lang.Thread;@327d48a0-[46]->
java.lang.Thread@7c8cb179-inheritedAccessControlContext->
java.security.AccessControlContext@151d1d2a-context->
[Ljava.security.ProtectionDomain;@62028308-[14]->
java.security.ProtectionDomain@427cac0-classloader->
org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@fe818a4}

Additionally, when the thread pool's ThreadFactory is not wrapped by ClassLoaderSanityThreadFactory (see JENKINS-49206), the Executor can sometimes create Thread instances whose Thread#contextClassLoader refers to GroovyClassLoader:

{org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@7349b2b9=java.lang.Thread@543d4cbe-group->
java.lang.ThreadGroup@cad4371-threads->
[Ljava.lang.Thread;@6a215fea-[50]->
java.lang.Thread@44b6b924-contextClassLoader->
org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader@7349b2b9}

Given the problems with this approach, we rejected a solution of Executors#newCachedThreadPool.

Solution

We create a dedicated Executors#newSingleThreadExecutor, which is safe for use with Caffeine from a memory perspective because:

  • In contrast to ForkJoinPool#commonPool, the thread is eagerly created and avoids references to GroovyClassLoader in Thread#inheritedAccessControlContext.
  • In contrast to Executors#newCachedThreadPool, the thread is eagerly created and avoids references to GroovyClassLoader in Thread#inheritedAccessControlContext.
  • In contrast to Executors#newCachedThreadPool, the thread is eagerly created and avoids references to GroovyClassLoader in Thread#contextClassLoader, thereby avoiding the need for ClassLoaderSanityThreadFactory.

A single-threaded Executor is safe for use with Caffeine from a CPU perspective because the cache's work is implemented with cheap O(1) algorithms.

Future work

In the medium term, once Thread#inheritedAccessControlContext is removed upstream, we could possibly switch to a combination of Executors#newCachedThreadPool and ClassLoaderSanityThreadFactory.

The comments state:

The inner cache […] expires entries 15 minutes after they are added to the cache, so that classes defined by dynamically installed plugins become available even if there were negative cache hits prior to the installation (ideally this would be done with a listener).

If this were done, then this two-level Caffeine cache could possibly be replaced by something based on ClassValue. See, for example, org.kohsuke.stapler.ClassLoaderValue. We explicitly decline to perform this cleanup, but we have updated the existing wish list comment to add this cleanup to the list.

Implementation notes

See the code comments.

Testing done

We tested by running the test in a loop in CI (the only environment where the problem can be reliably reproduced in under 50 iterations) in four configurations:

  1. Java 17, without the changes in this PR
  2. Java 17, with the changes in this PR
  3. Java 21, without the changes in this PR
  4. Java 21, with the changes in this PR

In configuration 3, the test always failed within 50 iterations with the error shown above. In configuration 1, the test passed reliably after two hours of iterations. After this PR, configurations 2 and 4 are passing reliably after two hours of iterations on both Java 17 and Java 21.

In addition to the above tests, I also ran a successful jenkinsci/bom test with the weekly-test label to test PCT for all plugins against the weekly release line.

@basil basil force-pushed the cacheExecutor2 branch 3 times, most recently from 4683a5a to c275367 Compare November 17, 2023 21:22
@basil basil changed the title Define an executor for SandboxResolvingClassLoader Define an executor and scheduler for SandboxResolvingClassLoader Nov 17, 2023
basil added a commit to basil/bom that referenced this pull request Nov 17, 2023
@basil basil changed the title Define an executor and scheduler for SandboxResolvingClassLoader [JENKINS-72325] Define an executor and scheduler for SandboxResolvingClassLoader Nov 17, 2023
@basil basil marked this pull request as ready for review November 17, 2023 22:36
@basil basil requested a review from a team as a code owner November 17, 2023 22:36
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. Sounds like the appropriate short-term fix.

(For reference, some important historical changes to this class: #160, #253, #265)

Comment on lines +73 to +78
* <p>In the medium term, once {@link Thread#inheritedAccessControlContext} is removed upstream, we could possibly
* switch to a combination of {@link Executors#newCachedThreadPool} and {@link ClassLoaderSanityThreadFactory}.
*
* <p>In the long term, a listener should be added to inform this class when dynamically installed plugins become
* available, as described in the comments to {@link #makeParentCache(boolean)}, in which case the use of Caffeine
* could possibly be removed entirely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the listener approach could probably be implemented by us at any time, whereas the removal of Thread.inheritedAccessControlContext may take years I guess.

@jglick jglick merged commit 0431701 into jenkinsci:master Nov 27, 2023
14 checks passed
@basil basil deleted the cacheExecutor2 branch November 27, 2023 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants