New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix more subtle durability issues #223
Fix more subtle durability issues #223
Conversation
…rsist at shutdown after threads idle to avoid concurrentModification
…lus remove redundant writes
@@ -1901,8 +1928,7 @@ private void checkAndAbortNonresumableBuild() { | |||
/** Ensures that even if we're limiting persistence of data for performance, we still write out data for shutdown. */ | |||
@Override | |||
protected void notifyShutdown() { | |||
checkAndAbortNonresumableBuild(); | |||
checkpoint(); | |||
// No-op, handled in the suspendAll terminator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids a ConcurrentModificationException I saw since we ensure threads are done and there's only a single Terminator.
The NPE from |
@jglick Yeah, I knew about that one, just including it for completeness here -- the more alarming issue is that the other error blocks the shutdown-time save of the build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a serious review since I only groggily understand what is being changed here.
@@ -0,0 +1,39 @@ | |||
# The Pipeline Persistence Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation, WUT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @jglick, pure absurdity, what can I say?
doc/persistence.md
Outdated
1. The FlowNodes - stored by a FlowNodeStorage - this holds the FlowNodes created to map to Steps, and for block scoped Steps, start and end of blocks | ||
2. The CpsFlowExecution - this is currently stored in the WorkflowRun, and the primary pieces of interest are: | ||
* heads - the current "tips" of the Flow Graph, i.e. the FlowNodes that represent running steps and are appended to | ||
- A head maps to a CpsThread in the Pipeline program, within the CPSThreadGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW if you put this kind of thing into package-info.java
, well the HTML is harder to type than Markdown admittedly, but the @link
tags will let you know when you misspell a class name like you did here. FWIW.
doc/persistence.md
Outdated
2. The CpsFlowExecution - this is currently stored in the WorkflowRun, and the primary pieces of interest are: | ||
* heads - the current "tips" of the Flow Graph, i.e. the FlowNodes that represent running steps and are appended to | ||
- A head maps to a CpsThread in the Pipeline program, within the CPSThreadGroup | ||
* starts - the BlockStartNodes marking the start(s) of the currently executing blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprised to find that this has been in the serial form since the beginning of Workflow. Not sure why—this information should be reconstructible from other metadata AFAIK. Maybe I am missing some subtlety.
doc/persistence.md
Outdated
* | ||
* various other boolean flags & settings for the execution (durability setting, user that started the build, is it sandboxed, etc) | ||
3. The Program -- this is the current execution state of the Pipeline | ||
* This holds the Groovy state (transformed by CPS) plus the CPSThreadGroup for the running branches of the Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CpsThreadGroup
is the Groovy state. There is nothing else.
} | ||
return myHead; | ||
} else { | ||
LOGGER.log(Level.WARNING, "Tried to load a FlowHead from program with no Execution in PROGRAM_STATE_SERIALIZTION"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
} | ||
} | ||
}); | ||
story.addStep(new Statement() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or then
+ lambda
|
||
/** Test interrupting build by randomly restarting *cleanly* at unpredictable times and verify we stick pick up and resume. */ | ||
@Test | ||
@Ignore //Too long to run as part of main suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we need a different annotation (categories?) to use to tell Surefire not to run this from mvn test
but do run it if explicitly selected via -Dtest=FlowDurabilityTest#fuzzTimingNonDurableWithCleanRestart
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you've proposed something special with the Maven build here, I have a pseudo-sample in one of my browser tabs (somewhere between 10 and 10^3 at the moment). One of those rainy-day projects -- you can always explicitly run via the IDE anyway, or just remove the Ignore annotations
…zing the execution
…istence is not correct
…node data is corrupted rather than continuing
…let it be nulled + fix tests that delete all flownodes
…seems for persistence verification
…Node present and save + ensure we don't block a Jenkins restart if the programPromise errors rather than loading
…rebuild flow graph if possible
…d placeholder nodes
…le trying to persist state at end of execution
LOGGER.log(Level.WARNING, "Error waiting for Pipeline to suspend: "+exec, ex); | ||
} | ||
} | ||
cpsExec.checkpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick Any idea why I see ConcurrentModificationExceptions on the FlowGraph when initially trying to flush the FlowNodeStorage at the start of taking the state checkpoint? I feel like at this point our threads should be paused from running the Program but they clearly are not because CpsThreadGroup#run()
is calling CpsExecution#onProgramEnd()
and finishing the execution. Checking for jenkins.isTerminating()
in CpsThreadGroup#scheduleRun
before running a chunk doesn't seem to help either.
Is there something I'm missing in how this shutdown works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wallpapering over it by using locking in the TimingFlowNodeStorage
to prevent actual concurrent modification but that's a filthy, dirty hack.
@svanoort Before I do a detailed review, just thought I'd request a squash before this actually gets merged, so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, but I'm not gonna claim I followed all of it perfectly. =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few miscellaneous comments, but basically I have little grasp of what is going on here, so…good luck!
pom.xml
Outdated
@@ -141,7 +141,8 @@ | |||
<dependency> | |||
<groupId>org.jenkins-ci.plugins.workflow</groupId> | |||
<artifactId>workflow-job</artifactId> | |||
<version>2.20</version> | |||
<!-- Snapshot for better handling of onLoad errors --> | |||
<version>2.21-20180427.223321-5</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick Since you only added that a few weeks ago and thus it's not "obvious" maybe you meant something besides 'pro tip'?
@@ -367,7 +365,6 @@ public Next receive(Object o) { | |||
for (BodyExecutionCallback c : callbacks) { | |||
c.onSuccess(sc, o); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be reverted
@@ -609,95 +610,89 @@ private synchronized String getHeadsAsString() { | |||
} else if (myHeads.size() == 0) { | |||
return "empty-heads"; | |||
} else { | |||
return myHeads.entrySet().stream().map(h->h.getKey()+"::"+h.getValue()).collect(Collectors.joining(",")); | |||
return myHeads.entrySet().stream().map(h -> h.getKey() + "::" + h.getValue()).collect(Collectors.joining(",")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated
Queue.Executable ex = owner.getExecutable(); | ||
if (ex instanceof Run) { | ||
Result res = ((Run)ex).getResult(); | ||
setResult(res != null ? res : Result.FAILURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems weird. Normally null
is a synonym for SUCCESS
. Why would this not be unconditionally FAILURE
?
|
||
FlowNode n = storage.getNode(entry.getValue()); | ||
if (n != null) { | ||
h.setForDeserialize(storage.getNode(entry.getValue())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of reindentation…not reading.
saveOwner(); | ||
} | ||
} | ||
|
||
/** Save the owner that holds this execution. */ | ||
void saveOwner() { | ||
try { | ||
if (this.owner != null && this.owner.getExecutable() instanceof Saveable) { // Null-check covers some anomalous cases we've seen | ||
Saveable saveable = (Saveable)(this.owner.getExecutable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could introduce variable
@@ -577,6 +577,9 @@ private void invokeBody(CpsThread cur) { | |||
// the first one can reuse the current thread, but other ones need to create new heads | |||
// we want to do this first before starting body so that the order of heads preserve | |||
// natural ordering. | |||
|
|||
// FIXME give this javadocs worth a darn, because this is how we create parallel branches and the docs are crypic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
preferably
if (run.isBuilding()) { | ||
try { | ||
story.j.waitUntilNoActivityUpTo(timeOutMillis); | ||
} catch (AssertionError ase) { // Allows attaching a debugger here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete now. Debuggers allow you to set exception breakpoints anyway.
@@ -865,19 +877,17 @@ public void evaluate() throws Throwable { | |||
@Override | |||
public void evaluate() throws Throwable { | |||
WorkflowRun run = story.j.jenkins.getItemByFullName(jobName, WorkflowJob.class).getLastBuild(); | |||
if (run == null) { | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happened in testing -- if the fuzzer kills things too early the Run may not even have had time to be created yet (rare but happens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably deserves a comment to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(resolved)
FlowExecution exec = listener.get(); | ||
while(exec.getCurrentHeads().isEmpty() || (exec.getCurrentHeads().get(0) instanceof FlowStartNode)) { // Wait until input step starts | ||
System.out.println("Waiting for input step to begin"); | ||
Thread.sleep(50); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. SemaphoreStep
is generally preferable for these kinds of tests. Is there a good reason to use input
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick Yes, Semaphore steps block the CPS thread - input steps follow a more normal pattern. The durability tests use sleep + semaphore for similar reasons (though sleep should be replaced with input to reduce test times in the near future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, SemaphoreStep
does not block the CPS thread. Semantically it is very similar to input
, but much easier to manipulate from test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick It definitely blocks something and the GraphListeners don't evaluate as they ought to.
@abayer I can understand wanting the squash but for this on I actually do want to show the history, because the commit messages were written to help explain why particular things are being done. |
CI failure is environmental from the looks of it |
@@ -729,10 +730,10 @@ public void onLoad(FlowExecutionOwner owner) throws IOException { | |||
try { | |||
if (isComplete()) { | |||
if (done == Boolean.TRUE && !super.isComplete()) { | |||
LOGGER.log(Level.WARNING, "Completed flow without FlowEndNode: "+this+" heads:"+getHeadsAsString()); | |||
LOGGER.log(Level.INFO, "Completed flow without FlowEndNode: "+this+" heads:"+getHeadsAsString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly sounds like it should be a warning to me, but maybe I am missing something?
@@ -865,19 +877,17 @@ public void evaluate() throws Throwable { | |||
@Override | |||
public void evaluate() throws Throwable { | |||
WorkflowRun run = story.j.jenkins.getItemByFullName(jobName, WorkflowJob.class).getLastBuild(); | |||
if (run == null) { | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(resolved)
I was affected by this. Installing Pipeline: Groovy 2.52 did not fix my symptoms:
|
Addresses the following root causes of problems in Pipeline persistence/durability:
CPSThread.head
FlowHead
and itsFlowNode
without ensuring the CpsFlowExecution is saved (viasaveOwner()
that saves theWorkflowRun
). This causes an NPE loading the program, because the FlowHead may map to a head in the CpsFlowExecution that will not exist at time of deserialization.FlowHead#setNewHead(FlowNode)
calls becausesetNewHead
now forces flownodes to be saved before the head is mutated, guaranteeing an always-usable state.CpsFlowExecution#suspendAll()
terminator - may help avoid the occasional ConcurrentModificationException we were seeing at serialize, and is more "correct" and safer generallyTARGETTED AT:
TODO: