Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain a lockfile before we write pickle data. #190

Merged
merged 1 commit into from
Aug 18, 2021

Conversation

adamgoossens
Copy link
Contributor

Without this it's possible for two concurrent PSR runs to overwrite each other's pickle files on disk. This will result in the artifacts from one of those runs being lost further down the pipeline.

This ensures that we:

  1. Obtain an exclusive lock around the pickle file to manage concurrent access. If we hold the lock, we can safely read and write the pickle file.
  2. Once we hold the lock, re-read the pickle file from disk before writing out the new one, merging the on-disk data with our in-memory data.

@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #190 (539020e) into main (e41d9c9) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #190   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           76        76           
  Lines         3109      3140   +31     
=========================================
+ Hits          3109      3140   +31     
Flag Coverage Δ
pytests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ploigos_step_runner/results/step_result.py 100.00% <100.00%> (ø)
src/ploigos_step_runner/results/workflow_result.py 100.00% <100.00%> (ø)
src/ploigos_step_runner/step_runner.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e41d9c9...539020e. Read the comment docs.

@itewk
Copy link
Contributor

itewk commented Aug 11, 2021

@adamgoossens unless i am reading this wrong we are only going to be locking the file just before we write, but if we want to prevent one step overwriting another step, we have to:

  1. lock
  2. read
  3. write
  4. unlock

currently this can happen:

  1. psr step 1 - read pickle
  2. psr step 2 - read pickle
  3. psr step 1 - update step results in memory
  4. psr step 2 - update step results in memory
  5. psr step 1 - get lock
  6. psr step 2 - block on lock
  7. psr step 1 - write pickle with psr step 1 updates but not psr step 2 updates since it was read before the lock
  8. psr step 1 - release lock
  9. psr step 2 - get lock
  10. psr step 2 - write pickle with psr step 2 updates, but not psr step 1 updates since it was read before the lock and the psr 1 updates
  11. psr step 2 - release lock

@adamgoossens adamgoossens force-pushed the support-concurrent-pickling branch 2 times, most recently from eb7211e to ead0c87 Compare August 11, 2021 23:47
@adamgoossens
Copy link
Contributor Author

adamgoossens commented Aug 11, 2021

The sequence of events is:

  1. StepRunner acquires an exclusive lock on the pickle file.
  2. WorkflowResult.write_to_pickle_file will read the on-disk pickle, add any in-memory StepResult objects that are missing, then pickle to disk.
  3. The YAML file is also written to disk, whilst continuing to hold the pickle lock.
  4. StepRunner releases the exclusive lock.

@adamgoossens adamgoossens marked this pull request as draft August 13, 2021 00:09
@adamgoossens adamgoossens force-pushed the support-concurrent-pickling branch 2 times, most recently from 075d5ef to a742e84 Compare August 15, 2021 04:06
@adamgoossens adamgoossens marked this pull request as ready for review August 15, 2021 04:15
@itewk
Copy link
Contributor

itewk commented Aug 16, 2021

@adamgoossens its looking really good. just a couple nit pick things.

Without this it's possible for two concurrent PSR runs to overwrite
each other's pickle files on disk. This will result in the artifacts
from one of those runs being lost further down the pipeline.

This ensures that we:
1) re-read the pickle file from disk before writing out the new one,
   merging the on-disk data with our in-memory data.
2) adds an exclusive lock around the pickle file to manage concurrent
   access.

We also include a new StepResult.merge method that handles
merging two StepResults together if they have the same step name,
sub-step name and environment. The StepResult passed to merge takes
priority for any duplicate artifact or evidence keys.
@adamgoossens
Copy link
Contributor Author

@itewk the last of the nits was resolved. I also re-added the use of the .lock suffix as I discovered a bug - the open() call truncated the pickle file on disk when it was opened so we could acquire the lock, due to it being opened in write mode.

Otherwise I think the rest is done. Let me know :)

@itewk itewk requested a review from dwinchell August 17, 2021 12:12
@itewk
Copy link
Contributor

itewk commented Aug 17, 2021

@adamgoossens thanks so much. Since this is a more involved/core change I would like @dwinchell to give it a look over too.

Copy link
Contributor

@dwinchell dwinchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@itewk itewk merged commit 0a3c13f into ploigos:main Aug 18, 2021
@adamgoossens adamgoossens deleted the support-concurrent-pickling branch August 18, 2021 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel step processing results in missing step artifacts/evidence later in the pipeline
3 participants