Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more subtle durability issues #223

Merged
merged 35 commits into from May 2, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
2338f88
Ensure FlowHead always has persisted FlowNodes before mutating it, pe…
svanoort Apr 17, 2018
ee259e9
Extended fuzzing test coverage
svanoort Apr 18, 2018
7c4faa9
Code comments to explain
svanoort Apr 18, 2018
cc61b6f
Save notes on persistence
svanoort Apr 20, 2018
40dd2c4
Force persistence of the Pipeline build when the heads are mutated, p…
svanoort Apr 20, 2018
e173bca
Refactor and harden the fuzz tests
svanoort Apr 20, 2018
6069196
Merge branch 'master' into fix-more-durability-issues-2
svanoort Apr 20, 2018
dde404b
Ensure we always initialize heads for FlowNodes and handle additional…
svanoort Apr 20, 2018
7b9ebe0
Add logging for some odd circumstances with deserializing and intiali…
svanoort Apr 20, 2018
e0f0025
Fix one cause of builds showing up as incomplete when can't be loaded
svanoort Apr 20, 2018
646b1d2
Add a bunch of persistence test cases that will fail because our pers…
svanoort Apr 24, 2018
06e56f2
Additional persistence problems testcases
svanoort Apr 24, 2018
8d1b7f5
Refine testcases a bit more
svanoort Apr 24, 2018
3cbc60c
Obligatory save when ending the execution, error out when stored flow…
svanoort Apr 25, 2018
3fa7b8e
Simply error out loading an execution with unpersisted flownodes and …
svanoort Apr 25, 2018
6ffcf33
Pull in wf-job fix for not setting build result
svanoort Apr 25, 2018
a1083fc
Fix test wait-for-input step and remove test that doesn't do what it …
svanoort Apr 25, 2018
2d3a219
Fix test for bogus done status: set done flag on execution if FlowEnd…
svanoort Apr 25, 2018
12d6c8e
Simplify the flownodeStorage initialization and note that we need to …
svanoort Apr 25, 2018
85e99f1
Mostly-working way to build placeholder flownodes for inconsistently …
svanoort Apr 27, 2018
2ec83cc
Pull in wf-job fixes to notifications and ensure done is set if we ad…
svanoort Apr 27, 2018
d4eb956
Fix result handling for build
svanoort Apr 27, 2018
9c9e958
Comment out unusable test
svanoort Apr 27, 2018
a87587a
Handle fuzzers for durability killing run before it can be saved
svanoort Apr 27, 2018
adc109f
Try to save other FlowExecutions even if one fails
svanoort Apr 30, 2018
fc93844
Make findbugs hush
svanoort Apr 30, 2018
c7f44a7
Update docs
svanoort Apr 30, 2018
7567a0a
Use locking to avoid ConcurrentModificationException on FlowGraph whi…
svanoort May 1, 2018
297d820
Fix typo
svanoort May 1, 2018
b51d1ba
Cleanup
svanoort May 1, 2018
2516111
Review changes
svanoort May 1, 2018
474f325
Update SNAPSHOT
svanoort May 1, 2018
46cd988
Fix findbugs kvetching about synchronization
svanoort May 2, 2018
5e16362
Reduce logging verbosity a bit
svanoort May 2, 2018
d824047
Pick up released workflow-job version
svanoort May 2, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
40 changes: 40 additions & 0 deletions doc/persistence.md
@@ -0,0 +1,40 @@
# The Pipeline Persistence Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation, WUT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @jglick, pure absurdity, what can I say?


# Data Model
Running pipelines persist in 3 pieces:

1. The `FlowNode`s - stored by a `FlowNodeStorage` - this holds the FlowNodes created to map to `Step`s, and for block scoped Steps, start and end of blocks
2. The `CpsFlowExecution` - this is currently stored in the WorkflowRun, and the primary pieces of interest are:
* heads - the current "tips" of the Flow Graph, i.e. the FlowNodes that represent running steps and are appended to
- A head maps to a `CpsThread` in the Pipeline program, within the `CpsThreadGroup`
* starts - the `BlockStartNode`s marking the start(s) of the currently executing blocks
* scripts - the loaded Pipeline script files (text)
* persistedClean
- If true, Pipeline saved its execution cleanly to disk and we *might* be able to resume it
- If false, something went wrong saving the execution, so we cannot resume even if we'd otherwise be able to
- If null, probably the build dates back to before this field was added - we check to see if this is running in a highly persistent DurabilityMode (Max_survivability generally)
* done - if true, this execution completed, if false or un-set, the pipeline is a candidate to resume unless its only head is a FlowEndNode
- The handling of false is for legacy reasons, since it was only recently made persistent.
*
* various other boolean flags & settings for the execution (durability setting, user that started the build, is it sandboxed, etc)
3. The Program -- this is the current execution state of the Pipeline
* This holds the Groovy state - the `CpsThreadGroup` - with runtime calls transformed by CPS so they can persist
* The `CpsThread`s map to the running branches of the Pipeline
* The program depends on the FlowNodes from the FlowNodeStorage, since it reads them by ID rather than storing them in the program
* This also depends on the heads in the CpsFlowExecution, because its FlowHeads are loaded from the heads of the CpsFlowExecution
* Also holds the CpsStepContext, i.e. the variables such as EnvVars, Executor and Workspace uses (the latter stored as Pickles)
- The pickles will be specially restored when the Pipeline resumes since they don't serialize/deserialize normally

## Persistence Issues And Logic

Some basic rules:

1. If the FlowNodeStorage is corrupt, incomplete, or un-persisted, all manner of heck will break loose
- In terms of Pipeline execution, the impact is like the Resonance Cascade from the Half-Life games
- The pipeline can never be resumed (the key piece is missing)
- Usually we fake up some placeholder FlowNodes to cover this situation and save them
2. Whenever persisting data, the Pipeline *must* have the FlowNodes persisted on disk (via `storage.flush()` generally)
in order to be able to load the heads and restore the program.
3. Once we've set persistedClean as false and saved the FlowExecution, then it doesn't matter what we do -- the Pipeline will assume
it already has incomplete persistence data (as with 1) when trying to resume. This is how we handle the low-durability modes, to
avoid resuming a stale state of the Pipeline simply because we have old data persisted and are not updating it.
2 changes: 1 addition & 1 deletion pom.xml
Expand Up @@ -141,7 +141,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-job</artifactId>
<version>2.20</version>
<version>2.21</version>
<scope>test</scope>
</dependency>
<dependency>
Expand Down
Expand Up @@ -120,7 +120,6 @@ class CpsBodyExecution extends BodyExecution {
}

head.setNewHead(sn);
CpsFlowExecution.maybeAutoPersistNode(sn);

StepContext sc = new CpsBodySubContext(context, sn);
for (BodyExecutionCallback c : callbacks) {
Expand Down Expand Up @@ -337,7 +336,6 @@ public Next receive(Object o) {
FlowHead h = CpsThread.current().head;
StepStartNode ssn = addBodyStartFlowNode(h);
h.setNewHead(ssn);
CpsFlowExecution.maybeAutoPersistNode(ssn);
}

StepEndNode en = addBodyEndFlowNode();
Expand Down Expand Up @@ -367,7 +365,6 @@ public Next receive(Object o) {
for (BodyExecutionCallback c : callbacks) {
c.onSuccess(sc, o);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be reverted

return Next.terminate(null);
}

Expand Down