Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-55287] update exception warning for failure to load flow node #589

Merged
merged 3 commits into from
Sep 29, 2022

Conversation

car-roll
Copy link
Contributor

@car-roll car-roll commented Sep 27, 2022

See https://issues.jenkins.io/browse/JENKINS-55287

Update the exception message when FlowNode fails to load because Jenkins did not shut down cleanly. Suggest pipeline durability MAX_SURVIVABILITY if not already set.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@car-roll car-roll requested a review from a team September 27, 2022 15:58
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that this message shows up in this case at all is a bug, and it might be possible to hit this exception in other cases, but this seems like the most prudent fix at this point.

The intention seems to have been for

throw new IOException("Cannot resume build -- was not cleanly saved when Jenkins shut down.");
to be thrown in this case, but once you start trying to change the code to make that work as intended you run into other problems; see #363 and the branch mentioned in the final comment of that PR, both of which I abandoned.

@@ -676,7 +676,7 @@ protected synchronized void initializeStorage() throws IOException {
h.setForDeserialize(storage.getNode(entry.getValue()));
heads.put(h.getId(), h);
} else {
throw new IOException("Tried to load head FlowNodes for execution "+this.owner+" but FlowNode was not found in storage for head id:FlowNodeId "+entry.getKey()+":"+entry.getValue());
throw new IOException("Flow node could not be loaded because Jenkins did not shut down cleanly and the Pipeline Durability was not set to MAX_SURVIVABILITY");
Copy link
Member

@dwnusbaum dwnusbaum Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might tweak the message a bit to mirror the terminology on https://www.jenkins.io/doc/book/pipeline/scaling-pipeline/ and log the FlowHead:FlowNode info if the logger has Level.FINE enabled in case someone is debugging things at some point (untested):

Suggested change
throw new IOException("Flow node could not be loaded because Jenkins did not shut down cleanly and the Pipeline Durability was not set to MAX_SURVIVABILITY");
throw new IOException("Cannot resume build because Jenkins did not shut down cleanly and the Pipeline durability setting was not set to MAX_SURVIVABILITY" + (LOGGER.isLoggable(Level.FINE) ? " (Missing FlowNode " + entry.getValue() + " for FlowHead " + entry.getKey() + ")" : ""));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this might happen even if you are using MAX_SURVIVABILITY in obscure cases, but I'm not sure. Maybe check the actual durability setting and only add the "and the Pipeline durability setting was not set to MAX_SURVIVABILITY" part of the message if that is in fact true. Probably only matters if anyone in the Jira issue mentioned seeing this exception when using MAX_SURVIVABILITY, otherwise it could be improved later as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also hesitate to suggest a remediation without actually knowing that it suffices.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

if (getDurabilityHint() != FlowDurabilityHint.MAX_SURVIVABILITY) {
suggestDurability = "Try setting pipeline durability to MAX_SURVIVABILITY. ";
}
throw new IOException("Cannot resume build because Jenkins did not shut down cleanly. " + suggestDurability + (LOGGER.isLoggable(Level.FINE) ? "(Missing FlowNode " + entry.getValue() + " for FlowHead " + entry.getKey() + ")" : ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. Jenkins may have shut down correctly, we can not know.

All we know is a file we expect to be there is not. This could be disk corruption outside Jenkins or a restore from a non atomic backup....

We should say we can not find a file (and which file, log it add a chained FileNotFoundExcption?). If you are not using a durable mode we should say to use it, otherwise this is expected for a non clean shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the specific XML file that is missing is completely irrelevant to users (and probably even to plugin developers). I would guess that 95% or more of the time users only see this exception when Jenkins crashed while using the PERFORMANCE_OPTIMIZED durability setting, and in that scenario the new message is strictly an improvement over the old one since the behavior is typical and expected.

If you really want to improve the logic to diagnose the scenario more precisely, then I would recommend looking into #363 or master...dwnusbaum:JENKINS-55287-2 like I mentioned in #589 (review), but it gets complex fast because this code is undertested and there are many subtle issues with the existing logic and its interaction with WorkflowRun.onLoad, and I am not confident that either of those changes would fix the existing bugs without introducing new issues, which is why I abandoned them.

If you want, you could keep the existing logic more or less but make the message a lot more detailed and hedge all recommendations, but in practice IDK if it really makes a difference:

if (durabilitySetting != MAX_SURVIVABILITY) {
    throw new AbortException("Cannot resume build because flow node X could not be loaded. " +
            "This is expected to happen when using the " + durabilitySetting + " durability setting and Jenkins is not shut down cleanly. " +
            "Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases." );
} else {
    throw new AbortException("Cannot resume build because flow node X could not be loaded.");
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you could keep the existing logic more or less but make the message a lot more detailed and hedge all recommendations, but in practice IDK if it really makes a difference:

that was what I was trying to say - thanks for reading my mind :)

@car-roll car-roll merged commit 5ea6281 into jenkinsci:master Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants