Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 6683: LTS - Fix thread visibility issues #6687

Conversation

sachin-j-joshi
Copy link
Contributor

@sachin-j-joshi sachin-j-joshi commented Apr 8, 2022

Signed-off-by: Sachin Joshi sachin.joshi@emc.com

Change log description
Issue 6683: LTS - Fix possible thread visibility issue with SystemJournal serialization.

Purpose of the change
Fixes #6683

What the code does

  1. The default RevisionDataInput::readCollection returns ArrayList which is not thread safe. Return Vector instead.
  2. Replace problematic Collections.synchronizedList(new ArrayList<ExtendedChunkInfo>()); with new Vector() in all SLTS code
  3. Make use of iterator with Futures.loop more thread-safe

Note -

  • It is possible in rare case that the data updated by one thread is not visible to another thread. This can lead to validation failure for snapshot.
  • On my machine I can repro about 1 in 10000 times. Fix is to use synchronized list (eg. Vector).
  • This kind of error does not mean there is a race to read/update same variable at the same time from multiple threads, instead it means that when different lamdas are executed serially one by one they can be executed on different thread and potentially different processor cores. That means data cached in first threads processor is not yet flushed to main memory, so second processor reads outdated value from main memory. By using volatile fields (via Vector) data guaranteed to be flushed correctly.

How to verify it
All tests should pass.

@sachin-j-joshi sachin-j-joshi changed the title Issue 6683: LTS - Fix thread visibility issue Issue 6683: LTS - Fix thread visibility issue in SystemJournal serialization Apr 8, 2022
@sachin-j-joshi sachin-j-joshi changed the title Issue 6683: LTS - Fix thread visibility issue in SystemJournal serialization Issue 6683: LTS - Fix thread visibility issue with SystemJournal serialization Apr 8, 2022
@codecov
Copy link

codecov bot commented Apr 9, 2022

Codecov Report

Merging #6687 (f465f9b) into master (0205da8) will increase coverage by 0.00%.
The diff coverage is 97.29%.

@@            Coverage Diff            @@
##             master    #6687   +/-   ##
=========================================
  Coverage     86.35%   86.36%           
- Complexity    15805    15808    +3     
=========================================
  Files          1024     1024           
  Lines         59036    59035    -1     
  Branches       5968     5967    -1     
=========================================
+ Hits          50979    50984    +5     
+ Misses         4933     4925    -8     
- Partials       3124     3126    +2     
Impacted Files Coverage Δ
...segmentstore/storage/chunklayer/SystemJournal.java 87.70% <96.29%> (+0.11%) ⬆️
...gmentstore/storage/chunklayer/ConcatOperation.java 97.22% <100.00%> (ø)
...tstore/storage/chunklayer/DefragmentOperation.java 92.55% <100.00%> (ø)
...segmentstore/storage/chunklayer/ReadOperation.java 88.78% <100.00%> (ø)
...entstore/storage/chunklayer/TruncateOperation.java 94.71% <100.00%> (ø)
.../segmentstore/storage/chunklayer/UtilsWrapper.java 97.01% <100.00%> (ø)
...egmentstore/storage/chunklayer/WriteOperation.java 94.19% <100.00%> (ø)
...hared/security/crypto/StrongPasswordProcessor.java 87.50% <0.00%> (-6.25%) ⬇️
...gmentstore/server/tables/WriterTableProcessor.java 86.45% <0.00%> (-1.05%) ⬇️
...segmentstore/server/logs/DataFrameInputStream.java 78.57% <0.00%> (-0.80%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0205da8...f465f9b. Read the comment docs.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
@sachin-j-joshi sachin-j-joshi force-pushed the issue-6683-LTS-fix-system-jounal-S3-flaky-test branch from e2f3635 to 366b900 Compare April 9, 2022 00:12
@sachin-j-joshi
Copy link
Contributor Author

image

Copy link
Contributor

@RaulGracia RaulGracia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that there could be other points in SLTS that may suffer from the same issue? Also, please run system tests on this change.

@sachin-j-joshi
Copy link
Contributor Author

We did fix several of these during beta testing - #5665
Looks like I still missed a couple places back then. I'll double check if there is anything else I missed.

@RaulGracia
Copy link
Contributor

@sachin-j-joshi apart from running system tests, is there anything missing from this PR? I assume you have synced up with @tkaitchuck on this change, right?

@sachin-j-joshi
Copy link
Contributor Author

sachin-j-joshi commented Apr 20, 2022

Putting it in draft mode -
@tkaitchuck suggested to change it to use Vector instead of Collections.synchronizedList(new ArrayList<>())
But I was able to still repro the failures after I changed it to Vector. So there is still something else that is missing.

@sachin-j-joshi sachin-j-joshi marked this pull request as draft April 20, 2022 17:26
Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…f Collections.synchronizedList(new ArrayList<>())

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…e-6683-LTS-fix-system-jounal-S3-flaky-test
@sachin-j-joshi sachin-j-joshi changed the title Issue 6683: LTS - Fix thread visibility issue with SystemJournal serialization Issue 6683: LTS - Fix thread visibility issues Apr 25, 2022
@sachin-j-joshi sachin-j-joshi marked this pull request as ready for review April 25, 2022 23:12
…ronized blocks.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…nge.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…e-6683-LTS-fix-system-jounal-S3-flaky-test
Copy link
Member

@tkaitchuck tkaitchuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with comments

…e-6683-LTS-fix-system-jounal-S3-flaky-test
@sachin-j-joshi
Copy link
Contributor Author

System tests test-pravega-continuous-1/2136/ passed

Copy link
Contributor

@RaulGracia RaulGracia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor questions before merging.

@RaulGracia RaulGracia merged commit e72018d into pravega:master May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test S3ChunkStorageSystemJournalTests.testSimpleOperationSequence
4 participants