Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more subtle durability issues #223

Merged
merged 35 commits into from May 2, 2018

Conversation

svanoort
Copy link
Member

@svanoort svanoort commented Apr 20, 2018

Addresses the following root causes of problems in Pipeline persistence/durability:

  • Pipelines which modify the heads in an execution without ensuring the corresponding FlowNodes are also saved to disk - this will cause issues when trying to load the execution -- applies only to the fully durable modes (since non-durable just won't persist period)
  • Pipelines which modify the CPSThread.head FlowHead and its FlowNode without ensuring the CpsFlowExecution is saved (via saveOwner() that saves the WorkflowRun). This causes an NPE loading the program, because the FlowHead may map to a head in the CpsFlowExecution that will not exist at time of deserialization.
  • Provides some documentation on the persistence model so others do not endure my pain
  • Removes explicit persistence calls after FlowHead#setNewHead(FlowNode) calls because setNewHead now forces flownodes to be saved before the head is mutated, guaranteeing an always-usable state.
  • Expand the fuzzTesting cases
  • Consolidate all shutdown-time persistence into the main CpsFlowExecution#suspendAll() terminator - may help avoid the occasional ConcurrentModificationException we were seeing at serialize, and is more "correct" and safer generally

TARGETTED AT:

  • JENKINS-49686 specifically, and I no longer see NPEs thanks to this

TODO:

  • Fix this error seen at shutdown-timer persistence in live testing:

Apr 18, 2018 1:01:08 AM hudson.model.Executor finish1
SEVERE: Executor threw an exception
java.lang.IllegalStateException: Jenkins has not been started, or was already shut down
at jenkins.model.Jenkins.getActiveInstance(Jenkins.java:762)
at com.cloudbees.hudson.plugins.folder.AbstractFolder.getDescriptor(AbstractFolder.java:605)
at jenkins.branch.MultiBranchProject.getDescriptor(MultiBranchProject.java:863)
at jenkins.branch.MultiBranchProject.getDescriptor(MultiBranchProject.java:124)
at com.cloudbees.hudson.plugins.folder.AbstractFolder.childNameGenerator(AbstractFolder.java:597)
at com.cloudbees.hudson.plugins.folder.AbstractFolder.getRootDirFor(AbstractFolder.java:640)
at jenkins.branch.MultiBranchProject.getRootDirFor(MultiBranchProject.java:843)
at jenkins.branch.MultiBranchProject.getRootDirFor(MultiBranchProject.java:124)
at hudson.model.AbstractItem.getRootDir(AbstractItem.java:192)
at hudson.model.Job.getBuildDir(Job.java:845)
at hudson.model.Run.getRootDir(Run.java:1028)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.setResult(WorkflowRun.java:738)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:747)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:325)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:421)

Apr 18, 2018 1:01:08 AM jenkins.plugins.git.AbstractGitSCMSource getCacheDir
SEVERE: Jenkins instance is null in AbstractGitSCMSource.getCacheDir
Apr 18, 2018 1:01:08 AM jenkins.util.AtmostOneTaskExecutor$1 call
WARNING: null
java.lang.NullPointerException
at hudson.model.Queue.maintain(Queue.java:1426)
at hudson.model.Queue$1.call(Queue.java:320)
at hudson.model.Queue$1.call(Queue.java:317)
at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108)
at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
at java.lang.Thread.run(Thread.java:748)

@@ -1901,8 +1928,7 @@ private void checkAndAbortNonresumableBuild() {
/** Ensures that even if we're limiting persistence of data for performance, we still write out data for shutdown. */
@Override
protected void notifyShutdown() {
checkAndAbortNonresumableBuild();
checkpoint();
// No-op, handled in the suspendAll terminator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids a ConcurrentModificationException I saw since we ensure threads are done and there's only a single Terminator.

@jglick jglick self-requested a review April 20, 2018 15:31
@jglick
Copy link
Member

jglick commented Apr 20, 2018

The NPE from Queue.maintain should have been addressed by jenkinsci/jenkins#3365 I think. Just set your core dep higher.

@svanoort
Copy link
Member Author

@jglick Yeah, I knew about that one, just including it for completeness here -- the more alarming issue is that the other error blocks the shutdown-time save of the build.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a serious review since I only groggily understand what is being changed here.

@@ -0,0 +1,39 @@
# The Pipeline Persistence Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation, WUT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @jglick, pure absurdity, what can I say?

1. The FlowNodes - stored by a FlowNodeStorage - this holds the FlowNodes created to map to Steps, and for block scoped Steps, start and end of blocks
2. The CpsFlowExecution - this is currently stored in the WorkflowRun, and the primary pieces of interest are:
* heads - the current "tips" of the Flow Graph, i.e. the FlowNodes that represent running steps and are appended to
- A head maps to a CpsThread in the Pipeline program, within the CPSThreadGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you put this kind of thing into package-info.java, well the HTML is harder to type than Markdown admittedly, but the @link tags will let you know when you misspell a class name like you did here. FWIW.

2. The CpsFlowExecution - this is currently stored in the WorkflowRun, and the primary pieces of interest are:
* heads - the current "tips" of the Flow Graph, i.e. the FlowNodes that represent running steps and are appended to
- A head maps to a CpsThread in the Pipeline program, within the CPSThreadGroup
* starts - the BlockStartNodes marking the start(s) of the currently executing blocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised to find that this has been in the serial form since the beginning of Workflow. Not sure why—this information should be reconstructible from other metadata AFAIK. Maybe I am missing some subtlety.

*
* various other boolean flags & settings for the execution (durability setting, user that started the build, is it sandboxed, etc)
3. The Program -- this is the current execution state of the Pipeline
* This holds the Groovy state (transformed by CPS) plus the CPSThreadGroup for the running branches of the Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CpsThreadGroup is the Groovy state. There is nothing else.

}
return myHead;
} else {
LOGGER.log(Level.WARNING, "Tried to load a FlowHead from program with no Execution in PROGRAM_STATE_SERIALIZTION");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

}
}
});
story.addStep(new Statement() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or then + lambda


/** Test interrupting build by randomly restarting *cleanly* at unpredictable times and verify we stick pick up and resume. */
@Test
@Ignore //Too long to run as part of main suite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we need a different annotation (categories?) to use to tell Surefire not to run this from mvn test but do run it if explicitly selected via -Dtest=FlowDurabilityTest#fuzzTimingNonDurableWithCleanRestart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you've proposed something special with the Maven build here, I have a pseudo-sample in one of my browser tabs (somewhere between 10 and 10^3 at the moment). One of those rainy-day projects -- you can always explicitly run via the IDE anyway, or just remove the Ignore annotations

…node data is corrupted rather than continuing
…let it be nulled + fix tests that delete all flownodes
…Node present and save + ensure we don't block a Jenkins restart if the programPromise errors rather than loading
LOGGER.log(Level.WARNING, "Error waiting for Pipeline to suspend: "+exec, ex);
}
}
cpsExec.checkpoint();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick Any idea why I see ConcurrentModificationExceptions on the FlowGraph when initially trying to flush the FlowNodeStorage at the start of taking the state checkpoint? I feel like at this point our threads should be paused from running the Program but they clearly are not because CpsThreadGroup#run() is calling CpsExecution#onProgramEnd() and finishing the execution. Checking for jenkins.isTerminating() in CpsThreadGroup#scheduleRun before running a chunk doesn't seem to help either.

Is there something I'm missing in how this shutdown works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wallpapering over it by using locking in the TimingFlowNodeStorage to prevent actual concurrent modification but that's a filthy, dirty hack.

@svanoort svanoort requested review from abayer and jglick May 1, 2018 15:05
@svanoort svanoort changed the title [WIP] Fix more subtle durability issues Fix more subtle durability issues May 1, 2018
@abayer
Copy link
Member

abayer commented May 1, 2018

@svanoort Before I do a detailed review, just thought I'd request a squash before this actually gets merged, so that git blame isn't too painful. =)

Copy link
Member

@abayer abayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, but I'm not gonna claim I followed all of it perfectly. =)

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few miscellaneous comments, but basically I have little grasp of what is going on here, so…good luck!

pom.xml Outdated
@@ -141,7 +141,8 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-job</artifactId>
<version>2.20</version>
<!-- Snapshot for better handling of onLoad errors -->
<version>2.21-20180427.223321-5</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick Since you only added that a few weeks ago and thus it's not "obvious" maybe you meant something besides 'pro tip'?

@@ -367,7 +365,6 @@ public Next receive(Object o) {
for (BodyExecutionCallback c : callbacks) {
c.onSuccess(sc, o);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be reverted

@@ -609,95 +610,89 @@ private synchronized String getHeadsAsString() {
} else if (myHeads.size() == 0) {
return "empty-heads";
} else {
return myHeads.entrySet().stream().map(h->h.getKey()+"::"+h.getValue()).collect(Collectors.joining(","));
return myHeads.entrySet().stream().map(h -> h.getKey() + "::" + h.getValue()).collect(Collectors.joining(","));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated

Queue.Executable ex = owner.getExecutable();
if (ex instanceof Run) {
Result res = ((Run)ex).getResult();
setResult(res != null ? res : Result.FAILURE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird. Normally null is a synonym for SUCCESS. Why would this not be unconditionally FAILURE?


FlowNode n = storage.getNode(entry.getValue());
if (n != null) {
h.setForDeserialize(storage.getNode(entry.getValue()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of reindentation…not reading.

saveOwner();
}
}

/** Save the owner that holds this execution. */
void saveOwner() {
try {
if (this.owner != null && this.owner.getExecutable() instanceof Saveable) { // Null-check covers some anomalous cases we've seen
Saveable saveable = (Saveable)(this.owner.getExecutable());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could introduce variable

@@ -577,6 +577,9 @@ private void invokeBody(CpsThread cur) {
// the first one can reuse the current thread, but other ones need to create new heads
// we want to do this first before starting body so that the order of heads preserve
// natural ordering.

// FIXME give this javadocs worth a darn, because this is how we create parallel branches and the docs are crypic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO preferably

if (run.isBuilding()) {
try {
story.j.waitUntilNoActivityUpTo(timeOutMillis);
} catch (AssertionError ase) { // Allows attaching a debugger here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete now. Debuggers allow you to set exception breakpoints anyway.

@@ -865,19 +877,17 @@ public void evaluate() throws Throwable {
@Override
public void evaluate() throws Throwable {
WorkflowRun run = story.j.jenkins.getItemByFullName(jobName, WorkflowJob.class).getLastBuild();
if (run == null) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happened in testing -- if the fuzzer kills things too early the Run may not even have had time to be created yet (rare but happens).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably deserves a comment to that effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(resolved)

FlowExecution exec = listener.get();
while(exec.getCurrentHeads().isEmpty() || (exec.getCurrentHeads().get(0) instanceof FlowStartNode)) { // Wait until input step starts
System.out.println("Waiting for input step to begin");
Thread.sleep(50);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. SemaphoreStep is generally preferable for these kinds of tests. Is there a good reason to use input instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick Yes, Semaphore steps block the CPS thread - input steps follow a more normal pattern. The durability tests use sleep + semaphore for similar reasons (though sleep should be replaced with input to reduce test times in the near future).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, SemaphoreStep does not block the CPS thread. Semantically it is very similar to input, but much easier to manipulate from test code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick It definitely blocks something and the GraphListeners don't evaluate as they ought to.

@svanoort
Copy link
Member Author

svanoort commented May 1, 2018

@abayer I can understand wanting the squash but for this on I actually do want to show the history, because the commit messages were written to help explain why particular things are being done.

@svanoort svanoort requested a review from kshultzCB May 1, 2018 22:30
@svanoort svanoort closed this May 2, 2018
@svanoort svanoort reopened this May 2, 2018
@svanoort
Copy link
Member Author

svanoort commented May 2, 2018

CI failure is environmental from the looks of it

@jglick jglick self-requested a review May 2, 2018 13:25
@@ -729,10 +730,10 @@ public void onLoad(FlowExecutionOwner owner) throws IOException {
try {
if (isComplete()) {
if (done == Boolean.TRUE && !super.isComplete()) {
LOGGER.log(Level.WARNING, "Completed flow without FlowEndNode: "+this+" heads:"+getHeadsAsString());
LOGGER.log(Level.INFO, "Completed flow without FlowEndNode: "+this+" heads:"+getHeadsAsString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly sounds like it should be a warning to me, but maybe I am missing something?

@@ -865,19 +877,17 @@ public void evaluate() throws Throwable {
@Override
public void evaluate() throws Throwable {
WorkflowRun run = story.j.jenkins.getItemByFullName(jobName, WorkflowJob.class).getLastBuild();
if (run == null) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(resolved)

@svanoort svanoort merged commit ca4f539 into jenkinsci:master May 2, 2018
@svanoort svanoort deleted the fix-more-durability-issues-2 branch May 2, 2018 21:20
@florent-vial
Copy link

I was affected by this. Installing Pipeline: Groovy 2.52 did not fix my symptoms:

May 05, 2018 1:57:17 PM jenkins.metrics.api.Metrics$HealthChecker execute WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [Running CpsFlowExecution[Owner[MY_PIPELINE_JOB/7521:MY_PIPELINE_JOB #7521]] locked on org.jenkinsci.plugins.workflow.job.WorkflowRun@232fe592 (owned by WorkflowRun.copyLogs [#4] (MY_PIPELINE_JOB #7521)): at org.jenkinsci.plugins.workflow.job.WorkflowRun.save(WorkflowRun.java:1245) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.saveOwner(CpsFlowExecution.java:1925) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.saveExecutionIfDurable(CpsFlowExecution.java:1906) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.removeHead(CpsFlowExecution.java:1082) at org.jenkinsci.plugins.workflow.cps.FlowHead.remove(FlowHead.java:177) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.subsumeHead(CpsFlowExecution.java:1097) at org.jenkinsci.plugins.workflow.cps.CpsStepContext$2.onSuccess(CpsStepContext.java:436) at org.jenkinsci.plugins.workflow.cps.CpsStepContext$2.onSuccess(CpsStepContext.java:395) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:909) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) , WorkflowRun.copyLogs [#4] (MY_PIPELINE_JOB #7521) locked on org.jenkinsci.plugins.workflow.cps.CpsFlowExecution@757d1262 (owned by Running CpsFlowExecution[Owner[MY_PIPELINE_JOB/7521:MY_PIPELINE_JOB #7521]]): at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$ConverterImpl.marshal(CpsFlowExecution.java:1603) at hudson.util.XStream2$AssociatedConverterImpl.marshal(XStream2.java:461) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:69) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:58) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller$1.convertAnother(AbstractReferenceMarshaller.java:84) at hudson.util.RobustReflectionConverter.marshallField(RobustReflectionConverter.java:265) at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:252) at hudson.util.RobustReflectionConverter$2.visit(RobustReflectionConverter.java:224) at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:138) at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:209) at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:150) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:69) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:58) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:43) at com.thoughtworks.xstream.core.TreeMarshaller.start(TreeMarshaller.java:82) at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.marshal(AbstractTreeMarshallingStrategy.java:37) at com.thoughtworks.xstream.XStream.marshal(XStream.java:1026) at com.thoughtworks.xstream.XStream.marshal(XStream.java:1015) at com.thoughtworks.xstream.XStream.toXML(XStream.java:988) at hudson.XmlFile.write(XmlFile.java:193) at org.jenkinsci.plugins.workflow.support.PipelineIOUtils.writeByXStream(PipelineIOUtils.java:30) at org.jenkinsci.plugins.workflow.job.WorkflowRun.save(WorkflowRun.java:1256) at org.jenkinsci.plugins.workflow.job.WorkflowRun.saveWithoutFailing(WorkflowRun.java:1236) at org.jenkinsci.plugins.workflow.job.WorkflowRun.copyLogs(WorkflowRun.java:612) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$500(WorkflowRun.java:144) at org.jenkinsci.plugins.workflow.job.WorkflowRun$3.run(WorkflowRun.java:410) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants