[JENKINS-55287] Improve error messages for nonresumable Pipelines #363

dwnusbaum · 2020-06-18T21:23:32Z

See JENKINS-55287. Subsumes #354 (and will subsume #234 eventually).

Based on the cases I've seen, I think the root cause of JENKINS-55287 is that Jenkins is crashing, so it is expected that PERFORMANCE_OPTIMIZED Pipelines are unable to resume. From that perspective, the issue is more that the error message is misleading and causes users to think they have hit a bug in Pipeline, when really this case is expected and the fix is to investigate and understand why Jenkins is crashing.

The field CpsFlowExecution.persistedClean was supposed to help detect these cases and print a more relevant error message, but there were two issues:

It was always set to true right after a build started because of this call to CpsFlowExecution.saveOwner that occurs right after the build starts. In the cases of JENKINS-55287 I have seen, the only persisted flow node is always the FlowStartNode, meaning that this is the only time the build was ever saved, and so the build is always considered resumable unless Jenkins shuts down normally and there is an error while saving the build in CpsFlowExecution.suspendAll. To fix this we only modify persistedClean if Jenkins is shutting down.
The logic that checks if a Pipeline can be resumed used to only happen after the flow graph was initialized, but in most (all?) cases the flow graph will be incomplete and so that will fail. That means that we also need to check if the Pipeline is resumable either before trying to initializing flow nodes (pessimistic) and fail immediately, or after we receive an error loading flow nodes (optimistic) and then change the reported error as necessary. EDIT: Part of the problem here is that the flow graph is persisted after every step for PERFORMANCE_OPTIMIZED pipelines, see [JENKINS-53358] Remove bogus persistence calls due to notifyListeners #234. I don't think it makes sense to try to update the logic here without fixing that first, so I will look into fixing that separately.

Also fixes tracking of whether PERFORMANCE_OPTIMIZED Pipelines are able to resume or not. Previously, they were always marked as resumable because right after adding the FlowStartNode, the build was saved, cuasing the persistedClean flag to be set to true. Now we only modify persistedClean when the build has already completed or when Jenkins is shutting down.

dwnusbaum · 2020-06-18T21:25:14Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

-                initializeStorage();  // Throws exception and bombs out if we can't load FlowNodes
-            } catch (Exception ex) {
-                LOGGER.log(Level.WARNING, "Error initializing storage and loading nodes, will try to create placeholders for: "+this, ex);
+                if (!canResume()) {


Alternatively, we could check this even before we attempt to initialize storage and just bail out right away. I don't know if that is too pessimistic, maybe there is a case where the Pipeline is not resumable but flow nodes are all up to date and so the optimistic behavior here is useful because you will still be able to see the existing flow graph in those cases?

dwnusbaum · 2020-06-18T21:28:23Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

                createPlaceholderNodes(ex);
-                return;
+            } catch (Exception ex2) {


This is all mostly refactoring. The old code had this structure:

try { try { // code that can throw } catch (first) { // code that can throw return } } catch (second) { throw second // first exception, if any, is totally lost }

Now it has this structure:

try { // code that can throw } catch (first) { try { // code that can throw } catch (second) { second.addSuppressed(first) throw second } return }

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

dwnusbaum · 2020-06-18T21:29:38Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

@@ -1526,10 +1540,6 @@ public boolean isPaused() {
        return false;
    }

-    private void setPersistedClean(boolean persistedClean) {  // Workaround for some issues with anonymous classes.


src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

dwnusbaum · 2020-06-18T21:31:34Z

src/test/java/org/jenkinsci/plugins/workflow/cps/PersistenceProblemsTest.java

@@ -244,7 +243,6 @@ public void inProgressNormal() throws Exception {
    }

    @Test
-    @Ignore


These tests have been ignored since they were added. I am not sure why, but I went ahead and unignored them.

You probably already knew this, but I just noticed that these test cases seem to be duplicated in workflow-job-plugin's CpsPersistenceTest. Once the dust settles, it might be worth doing a sweep of the two copies to ensure that they are in sync with each other.

Yeah, I think I will probably delete the copies in workflow-job after this. Ideally we would run the PCT in the CI builds here with just the core Pipeline plugins so that changes are tested against the newest versions of everything.

dwnusbaum · 2020-06-18T21:32:02Z

src/test/java/org/jenkinsci/plugins/workflow/cps/PersistenceProblemsTest.java

    @Test
-    @Ignore
    public void inProgressMaxPerfDirtyShutdown() throws Exception {


This is the exact situation reported in JENKINS-55287 as far as I can tell.

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

… being paused

…out after resuming or unpausing a build

dwnusbaum · 2020-06-19T18:30:33Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+                    // Before we resume, we need to unset persistedClean in case Jenkins restarts again.
+                    persistedClean = null;
+                    saveOwner();


Should add a test case for this. The test would use a PERFORMANCE_OPTIMIZED Pipeline, restart Jenkins normally, then restart Jenkins again with a hard shutdown and make sure we get the error message that the Pipeline can't be resumed rather than a raw exception.

I came up with a test case, but it doesn't fail without this line because of a random save that happens when CpsFlowExecution.notifyListeners is called. IIUC, #234 was trying to stop that save from occurring for PERFORMANCE_OPTIMIZED Pipelines, so I will look into making related changes either in this PR or before this PR is merged.

dwnusbaum · 2020-06-19T18:32:09Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+                    // Pausing the build sets persistedClean to true so the build can resume, so if we unpause the build
+                    // we need to unset persistedClean again.
+                    persistedClean = null;
+                    saveOwner();


We should add a test case for this. The Pipeline would need to be paused, then unpaused, and then we do a hard shutdown and make sure we get the error message that the Pipeline can't be resumed rather than a raw exception.

dwnusbaum · 2020-06-19T18:33:25Z

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

        if (isComplete() || this.getDurabilityHint().isPersistWithEveryStep()) {
            // Nothing to persist OR we've already persisted it along the way.
            return;
        }
+        LOGGER.log(Level.INFO, "Attempting to checkpoint all data for {0}{1}", new Object[] {


The logging-related changes were originally part of #354 (there are some slight modifications here), so it might be easier to review that PR first.

…flaky

dwnusbaum · 2020-06-29T19:26:13Z

Authorization failed for https://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/plugin/4.1/plugin-4.1.pom 403 Forbidden

Underlying issue was fixed by jenkinsci#368.

dwnusbaum · 2021-07-12T16:40:44Z

I don't remember exactly what was wrong with this approach, but back in February I spent a bit more time looking into this and decided it would be better to just immediately throw an exception from CpsFlowExecution.onLoad if CpsFlowExecution.canResume was false to avoid issues with a partially-initialized CpsFlowExecution, especially since in these scenarios the data that was persisted is unlikely to be useful anyway since it will be very incomplete.

See my JENKINS-55287-2 branch for what that approach looked like. I can't remember why I didn't file a PR for that branch at the time. I think that I might have thought we needed a simultaneous patch to WorkflowRun.getExecution in workflow-job to clean up the error handling so that we don't have multiple exceptions printed and to make it possible for the error message from the exception thrown in workflow-cps to be printed to the build logs rather than just showing up in the Jenkins system logs. I also think I wanted to do some more testing in relation to #234, since both of my fixes to improve the log messages pick up that fix as well, and I recall that we saw some test failures when that patch was filed originally. I think I also thought a few more tests were needed to check some of the edge cases that I noticed while investigating this issue.

dwnusbaum commented Jun 18, 2020

View reviewed changes

dwnusbaum commented Jun 19, 2020

View reviewed changes

src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java Outdated Show resolved Hide resolved

dwnusbaum added 5 commits June 19, 2020 12:35

Jenkins.isTerminating() cannot be used from terminators

b360ae0

Merge branch 'JENKINS-55287-logging' into JENKINS-55287

f631a44

Clarify log messages in CpsFlowExecution.checkpoint when the build is…

686a6e4

… being paused

Don't set persistedClean to true in CpsFlowExecution.saveOwner at all

eb08b5b

Always modify persistedClean in CpsFlowExecution.checkpoint, null it …

2214c42

…out after resuming or unpausing a build

dwnusbaum commented Jun 19, 2020

View reviewed changes

Ignore RestartingLoadStepTest.updatedBindingsOnRestart because it is …

b3a6847

…flaky

dwnusbaum closed this Jun 29, 2020

dwnusbaum reopened this Jun 29, 2020

dwnusbaum mentioned this pull request Jun 30, 2020

Expose assertGenerateSnippet to pipeline-model-definition #362

Merged

dwnusbaum closed this Jun 30, 2020

dwnusbaum reopened this Jun 30, 2020

bitwiseman approved these changes Jul 1, 2020

View reviewed changes

wip

2ad7e70

dwnusbaum mentioned this pull request Jul 17, 2020

Try to diagnose why RestartingLoadStepTest.updatedBindingsOnRestart is flaky #366

Closed

dwnusbaum mentioned this pull request Aug 10, 2020

[JENKINS-55287] Improve shutdown-related logging #354

Merged

dwnusbaum added 2 commits August 11, 2020 16:54

Merge branch 'master' into JENKINS-55287

d0c638c

Unignore RestartingLoadStepTest.updatedBindingsOnRestart

22e7086

Underlying issue was fixed by jenkinsci#368.

dwnusbaum closed this Jul 12, 2021

dwnusbaum mentioned this pull request Sep 28, 2022

[JENKINS-55287] update exception warning for failure to load flow node #589

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-55287] Improve error messages for nonresumable Pipelines #363

[JENKINS-55287] Improve error messages for nonresumable Pipelines #363

dwnusbaum commented Jun 18, 2020 •

edited

dwnusbaum Jun 18, 2020

dwnusbaum Jun 18, 2020 •

edited

dwnusbaum Jun 18, 2020

dwnusbaum Jun 18, 2020

basil Jun 19, 2020

dwnusbaum Jun 19, 2020

dwnusbaum Jun 18, 2020

dwnusbaum Jun 19, 2020

dwnusbaum Jul 1, 2020 •

edited

dwnusbaum Jun 19, 2020

dwnusbaum Jun 19, 2020

dwnusbaum commented Jun 29, 2020

dwnusbaum commented Jul 12, 2021 •

edited

@@ @@ -244,7 +243,6 @@ public void inProgressNormal() throws Exception { @@
                   }
                   @Test
-                  @Ignore

[JENKINS-55287] Improve error messages for nonresumable Pipelines #363

[JENKINS-55287] Improve error messages for nonresumable Pipelines #363

Conversation

dwnusbaum commented Jun 18, 2020 • edited

Choose a reason for hiding this comment

dwnusbaum Jun 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwnusbaum Jul 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwnusbaum commented Jun 29, 2020

dwnusbaum commented Jul 12, 2021 • edited

dwnusbaum commented Jun 18, 2020 •

edited

dwnusbaum Jun 18, 2020 •

edited

dwnusbaum Jul 1, 2020 •

edited

dwnusbaum commented Jul 12, 2021 •

edited