Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 5154: (SegmentStore) Fixed a stats concurrency bug in the Read Index #5170

Merged

Conversation

andreipaduroiu
Copy link
Member

Change log description
Fixed a concurrency bug in StreamSegmentReadIndex which could allow index summaries to get out of sync, thus causing the Cache Manager to evict more than needed.

Purpose of the change
Fixes #5154.

What the code does
The Cache Manager relies on accurate data from its clients to determine whether to advance its "Oldest Generation" (i.e., tell clients to evict data) or not. If the reported Oldest Generation from the clients does not change as a result of this process, it will keep advancing it in subsequent iterations until it can no longer do it (equals Current Generation). If Oldest Generation == Current Generation, then cache entries will not live too long in the cache, thus rendering it mostly useless.

The StreamSegmentReadIndex keeps a parallel statistic of how many entries it has per generation (this is because it can have tens, if not hundreds of thousands of entries - and querying them every time the Cache Manager needs it would be too time consuming). This statistic should be updated in lockstep with the rest of the index (upon inserts, removals, appends and retrievals). For consistency, all these updates (except evictions) should be done while holding the read index lock as they may end up changing the Generation of the entry (i.e., it was touched). It seems that one code path was not doing it under the lock, which meant that a "touch" could alter the stats of the wrong generation (should a concurrent touch/update execute).

This change moves the problematic stats update under the read index lock, which ensures that no other update on that entry may produce adverse side effects.

NOTE: there may be more efficient ways of operating on the Read Index, probably with finer grained locks. However this has not posed a problem so far and hasn't gotten in our way. The goal of this PR is to fix a bug, and not improve theoretical performance.

How to verify it
All tests must pass. Unfortunately the nature of this change means it cannot be tested via unit tests.

…ndex summaries to get out of sync, thus causing the Cache Manager to evict more than needed.

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>
@codecov
Copy link

codecov bot commented Sep 11, 2020

Codecov Report

Merging #5170 into master will decrease coverage by 0.02%.
The diff coverage is 85.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5170      +/-   ##
============================================
- Coverage     84.78%   84.75%   -0.03%     
- Complexity    12229    12233       +4     
============================================
  Files           795      795              
  Lines         45067    45066       -1     
  Branches       4693     4693              
============================================
- Hits          38210    38196      -14     
- Misses         4364     4376      +12     
- Partials       2493     2494       +1     
Impacted Files Coverage Δ Complexity Δ
...a/io/pravega/segmentstore/server/CacheManager.java 83.33% <0.00%> (ø) 49.00 <0.00> (ø)
.../segmentstore/server/reading/ReadIndexSummary.java 91.66% <100.00%> (+0.36%) 9.00 <2.00> (+1.00)
...ntstore/server/reading/StreamSegmentReadIndex.java 81.48% <100.00%> (-0.08%) 116.00 <0.00> (ø)
...ravega/client/control/impl/CancellableRequest.java 82.14% <0.00%> (-7.15%) 12.00% <0.00%> (-1.00%)
...o/pravega/client/stream/impl/ReaderGroupState.java 91.32% <0.00%> (-2.32%) 48.00% <0.00%> (ø%)
...ga/client/connection/impl/TcpClientConnection.java 88.32% <0.00%> (-2.19%) 19.00% <0.00%> (ø%)
...a/segmentstore/server/logs/OperationProcessor.java 88.04% <0.00%> (-2.00%) 35.00% <0.00%> (-2.00%)
...main/java/io/pravega/storage/hdfs/HDFSStorage.java 73.33% <0.00%> (+0.63%) 54.00% <0.00%> (ø%)
...mentstore/server/host/stat/AutoScaleProcessor.java 84.89% <0.00%> (+0.71%) 46.00% <0.00%> (+1.00%)
.../server/logs/SegmentMetadataUpdateTransaction.java 92.73% <0.00%> (+0.85%) 85.00% <0.00%> (+1.00%)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8adf643...06f7492. Read the comment docs.

@sachin-j-joshi sachin-j-joshi merged commit b1df1ae into pravega:master Sep 14, 2020
@andreipaduroiu andreipaduroiu deleted the issue-5154-cache-eviction branch September 14, 2020 23:25
andreipaduroiu added a commit to andreipaduroiu/pravega that referenced this pull request Sep 14, 2020
…Index (pravega#5170)

Fixed a concurrency bug in StreamSegmentReadIndex which could allow index summaries to get out of sync, thus causing the Cache Manager to evict more than needed.

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>

Co-authored-by: Tom Kaitchuck <tkaitchuck@users.noreply.github.com>
ravisharda pushed a commit that referenced this pull request Sep 15, 2020
…Index (#5170) (#5187)

CherryPick #5170. 
Issue 5154: (SegmentStore) Fixed a stats concurrency bug in the Read Index (#5170)

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>
andreipaduroiu added a commit to andreipaduroiu/pravega that referenced this pull request Sep 24, 2020
…Index (pravega#5170)

Fixed a concurrency bug in StreamSegmentReadIndex which could allow index summaries to get out of sync, thus causing the Cache Manager to evict more than needed.

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>

Co-authored-by: Tom Kaitchuck <tkaitchuck@users.noreply.github.com>
andreipaduroiu added a commit that referenced this pull request Sep 28, 2020
…ranch r0.7 (#5217)

Cherry-pick #5207, #5154, #5155 and #5119 into branch r0.7:
* Issue 5119: (SegmentStore) Copy-on-Read for Table Segment Compaction
* Issue 5155: (SegmentStore) Enabling copy-on-read for all Segment Reads
* Issue 5154: (SegmentStore) Fixed a stats concurrency bug in the Read Index (#5170)
* Issue 5207: (SegmentStore) Read Index Bug Fixes

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>
tkaitchuck added a commit to tkaitchuck/pravega-1 that referenced this pull request Feb 15, 2021
…Index (pravega#5170)

Fixed a concurrency bug in StreamSegmentReadIndex which could allow index summaries to get out of sync, thus causing the Cache Manager to evict more than needed.

Signed-off-by: Andrei Paduroiu <andrei.paduroiu@emc.com>

Co-authored-by: Tom Kaitchuck <tkaitchuck@users.noreply.github.com>
Signed-off-by: Tom Kaitchuck <tom.kaitchuck@emc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in cache entry management process
3 participants