Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 5673: (SLTS) Fix wrong GC config #5674

Merged

Conversation

sachin-j-joshi
Copy link
Contributor

Signed-off-by: Sachin Joshi sachin.joshi@emc.com

Change log description
Change unit from seconds to milliseconds.
Also change log.info to log.debug to reduce noise.

Purpose of the change
Fixes #5673

What the code does

  • Change default settings from 60 seconds to 10 milliseconds. (Note that - Every 10 millisecond, new iteration starts and waits for items to be available for GC on its dedicated thread, and deletes 10 chunks max per iteration. We need small wait time between iterations.)
  • Fix spelling mistake.
  • Also change log.info to log.debug to reduce noise.

How to verify it
No regression

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
@codecov
Copy link

codecov bot commented Feb 8, 2021

Codecov Report

Merging #5674 (61ce70b) into master (96be552) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5674      +/-   ##
============================================
+ Coverage     84.73%   84.74%   +0.01%     
- Complexity    13884    13889       +5     
============================================
  Files           917      917              
  Lines         51445    51441       -4     
  Branches       5305     5305              
============================================
+ Hits          43592    43595       +3     
+ Misses         4962     4959       -3     
+ Partials       2891     2887       -4     
Impacted Files Coverage Δ Complexity Δ
...torage/chunklayer/ChunkedSegmentStorageConfig.java 98.14% <100.00%> (ø) 3.00 <0.00> (ø)
...mentstore/storage/chunklayer/GarbageCollector.java 93.93% <100.00%> (-0.18%) 46.00 <7.00> (-1.00)
...a/segmentstore/server/logs/OperationProcessor.java 87.83% <0.00%> (-1.91%) 41.00% <0.00%> (-2.00%)
...o/pravega/client/stream/impl/ReaderGroupState.java 92.04% <0.00%> (-1.27%) 49.00% <0.00%> (ø%)
...ver/containers/StreamSegmentContainerMetadata.java 97.45% <0.00%> (+0.63%) 54.00% <0.00%> (+1.00%)
...egmentstore/server/reading/ContainerReadIndex.java 86.62% <0.00%> (+1.16%) 41.00% <0.00%> (+1.00%)
...gmentstore/storage/metadata/BaseMetadataStore.java 88.43% <0.00%> (+1.25%) 96.00% <0.00%> (+1.00%)
...pravega/client/connection/impl/CommandEncoder.java 89.47% <0.00%> (+1.75%) 33.00% <0.00%> (ø%)
...segmentstore/storage/chunklayer/ReadOperation.java 90.36% <0.00%> (+1.80%) 39.00% <0.00%> (+1.00%)
...a/segmentstore/server/logs/MemoryStateUpdater.java 77.02% <0.00%> (+2.70%) 14.00% <0.00%> (ø%)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96be552...61ce70b. Read the comment docs.

@@ -41,8 +41,8 @@
public static final Property<Integer> GARBAGE_COLLECTION_DELAY = Property.named("garbage.collection.delay.seconds", 60);
public static final Property<Integer> GARBAGE_COLLECTION_MAX_CONCURRENCY = Property.named("garbage.collection.concurrency.max", 10);
public static final Property<Integer> GARBAGE_COLLECTION_MAX_QUEUE_SIZE = Property.named("garbage.collection.queue.size.max", 16 * 1024);
public static final Property<Integer> GARBAGE_COLLECTION_SLEEP = Property.named("garbage.collection.sleep.seconds", 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to flood our logs if we ever want to use DEBUG level for a period of time to have more details about the activity of the system. In our system tests, this will for instance make the size of logs much larger without need. Also, it looks a super low interval for something that is supposed to periodically delete garbage, which is going to induce a significant amount of CPU usage without a clear justification. I could agree on lowering this value, but anything below 1-second I think that deserves a justification that I don't see in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. GC polling runs on its own dedicated thread.
  2. During each iteration the code blocks until at least one metadata item is eligible for deletion. (So iteration don't finish until there is something to delete. This way there are no empty iterations creating lots of useless logs.)
  3. Each iteration it only deletes GARBAGE_COLLECTION_MAX_CONCURRENCY number of items.
  4. However when there are lots of chunks to be deleted the queue becomes full quickly if we don't drain the queue fast enough.
  5. By reducing interval between iterations we are responding to increased delete queue faster, at the same time when queue is empty we are blocked thus adjusting delete speed to queue length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so my next question is: if the behavior of the GC thread is blocking as you mention (this could be mentioned in the parameter javadoc, by the way), what is the point of having a sleep at all? That is, it will delete chucks as long as there are chunks to delete, or it will remain blocked otherwise. This parameter seems a candidate for removal, so we could simplify the configuration of SLTS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter allow us to throttle the rate at which background delete tasks are run. However I don't know what is a good value. In such cases it's better to have config value that you can tweak and experiment with.

@RaulGracia RaulGracia changed the title Issue5673: (SLTS) Fix wrong GC config. Issue 5673: (SLTS) Fix wrong GC config. Feb 10, 2021
@RaulGracia RaulGracia changed the title Issue 5673: (SLTS) Fix wrong GC config. Issue 5673: (SLTS) Fix wrong GC config Feb 10, 2021
@RaulGracia RaulGracia merged commit 21ce6f3 into pravega:master Feb 11, 2021
tkaitchuck pushed a commit to tkaitchuck/pravega-1 that referenced this pull request Feb 15, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
tkaitchuck pushed a commit to tkaitchuck/pravega-1 that referenced this pull request Feb 15, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
Signed-off-by: Tom Kaitchuck <tom.kaitchuck@emc.com>
co-jo pushed a commit to co-jo/pravega that referenced this pull request Feb 16, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
sachin-j-joshi added a commit to sachin-j-joshi/pravega that referenced this pull request Feb 23, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
co-jo pushed a commit to co-jo/pravega that referenced this pull request Mar 1, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
co-jo pushed a commit to co-jo/pravega that referenced this pull request Mar 9, 2021
Changed SLTS GC time unit from seconds to milliseconds. Also changed the level of some log messages to reduce noise.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
sachin-j-joshi added a commit that referenced this pull request Mar 19, 2021
Cherry-picks following changes 0.9

Issue 5518: (SLTS) Fix Integer overflow. (#5520)
Issue 5606: BoundedInputStream::markSupported() should always return false.. (#5615)
Issue 5597 : ExtendedS3ChunkStorage and ExtendedS3Storage should close S3Client. (#5598)
Issue 5535: (SegmentStore) Refactoring ExtendedS3 Test Mocks (#5537)
Issue 5456: (SLTS) - Handle possible failure during deletion of metadata keys can cause version mismatch in next attempt (#5457)
Issue 5458: (SLTS) For read operation, read all chunks in parallel. (#5459)
Issue 4967: (SLTS) Add createWithContent overload that creates object and writes to it in a single call. (#5455)
Issue 5664: SLTS - fix possible thread visibility issues.. (#5665)
Issue 5475: Eliminate unnecessary pessimistic checks. (#5662)
Issue 5570: (SLTS) AsyncBaseChunkStorage latency should not include time spent in waiting to be executed. (#5661)
Issue 4967: (SLTS) No need to check last chunk length for claimOwnership for no append mode. (#5663)
Issue 5673: (SLTS) Fix wrong GC config (#5674)
Issue 5737: (SLTS) Improve metrics (#5746)
Issue 5460: (SLTS) Add read index virtual block entries to metadata. (#5461)
Issue 5772: (SLTS) - BaseMetadataStore does not evict entries from buffer. (#5773)
Issue 5798: (SLTS) Fix close in GarbageCollector, ExtendedS3ChunkStorage and HDFSChunkStorage. (#5800)
Issue 5788: (SLTS) Remove ACL operations from ExtendedS3ChunkStorage (#5790)
Issue 5808: SLTS - Fix Preconditions checks to include enough information in message. (#5809)
Issue 5853: SLTS - BaseMetadataStore.get does not return deep copy when loading from store. #5858
Issue 5866: (SLTS) ReadIndexCache stats not reported. (#5867)


Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
@sachin-j-joshi sachin-j-joshi deleted the issues-5673-SLTS-fix-gc-config branch August 23, 2021 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SLTS - Wrong config for garbage collector sleep.
3 participants