[server] Do not insert records into transient record cache while the …#14
[server] Do not insert records into transient record cache while the …#14ZacAttack wants to merge 5 commits intolinkedin:masterfrom
Conversation
…drainer queue is full This is a tactical fix. This change prevents the transient record from growing if the memory budget for the drainer queue is close to full or exhausted.
| } | ||
|
|
||
| public void blockOnDrainerCapacity() { | ||
| while (storeBufferService.getTotalRemainingMemory() < 1000) { |
There was a problem hiding this comment.
Can we replace it with wait-notify?
There was a problem hiding this comment.
I agree... would be great to avoid sleeping, if possible. I've been meaning to get rid of sleeps (e.g., did it in 6db7c05) and I hope we continue getting rid of them, rather than add more.
| // either. So, there is no need to tell the follower replica to do anything. | ||
| } | ||
| } else { | ||
| blockOnDrainerCapacity(); |
There was a problem hiding this comment.
CallingblockOnDrainerCapacity on almost every record will impact performance negatively as it calls AtomicLong::get and these atomic variables are being accessed/updated by threads on different cores. It will be interesting to see the ingestion perf after this change.
There was a problem hiding this comment.
Yeah. We're already examining and updating the capacity for every submission to the drainer, though admittedly, this method calls every single drainer's atomic variable. We could make the argument of augmenting it to only refer to the specific drainer variable thats related to the ingestion task. That would free it up perhaps.
There was a problem hiding this comment.
Nevermind, it's an array not a map, so getting the aggregate is probably the best way.
There was a problem hiding this comment.
Maybe feasible to add a function to the StoreBufferService along the lines of boolean isDrainerNearCapacity(ConsumerRecord<KafkaKey, KafkaMessageEnvelope> consumerRecord, int subPartition) which would then be able to call SBS::getDrainerIndexForConsumerRecord and look up that specific drainer's capacity. Could also be achieved by passing just a String consumedTopic rather than a full record, given a minor refactoring within SBS.
There was a problem hiding this comment.
We're already examining and updating the capacity for every submission to the drainer
IIUC, this happens in the drainer threads which are limited in number; however, we have far more shared consumer threads than the drainer threads. Hence the impact of updating atomic variables millions of times per second might be even more pronounced.
Maybe I'm being paranoid here. Let's keep this code change as one of the reference points if we see a substantial impact on ingestion performance.
FelixGV
left a comment
There was a problem hiding this comment.
Comment about the unmodified code (which unfortunately GH does not allow me to comment on directly...), I think we should consider renaming the other overload of PCS::setTransientRecord (the one where you do not call the new blocking function before) so that it's more clear that it sets the transient record to null...? It would also make the intent more clear as to why we're not blocking before that one (as presumably a delete may relieve memory pressure? – although that is actually uncertain... a delete heavy workload may in fact still cause a lot of garbage, even if the value part is null...).
| } | ||
|
|
||
| public void blockOnDrainerCapacity() { | ||
| while (storeBufferService.getTotalRemainingMemory() < 1000) { |
There was a problem hiding this comment.
I agree... would be great to avoid sleeping, if possible. I've been meaning to get rid of sleeps (e.g., did it in 6db7c05) and I hope we continue getting rid of them, rather than add more.
| maybeBlockOnDrainerCapacity(); | ||
| partitionConsumptionState.setTransientRecord( |
There was a problem hiding this comment.
Is it ideal to precede every call to PartitionConsumptionState::setTransientRecord with this new function? Would it make sense to include this functionality inside of the setTransientRecord function itself? This may require passing a closure to evaluate the buffer service's remaining capacity into the PCS, which is maybe not ideal from an encapsulation standpoint. On the other hand, having this convention that calls to setTransientRecord should always be preceded by maybeBlockOnDrainerCapacity seems a bit fragile as well. Maybe there should be a function in LFSIT to wrap both of these functions?
There was a problem hiding this comment.
Yeah, I think we should do the closure. Will add.
| } | ||
|
|
||
| public void maybeBlockOnDrainerCapacity() { | ||
| while (storeBufferService.getTotalRemainingMemory() < 1000) { |
There was a problem hiding this comment.
Should this 1000 be configurable? Or even if left un-configurable, should it at least be made into a constant?
There was a problem hiding this comment.
I was torn on adding another tunable parameter. But maybe theres a happy middle? Maybe instead of using a strict measure, we can change it to wait if it's like 90-95% at capacity? That way it's at least tunable relative to the size of the configured drainer, but doesn't add extra noise.
There was a problem hiding this comment.
Why block at some margin below the actual limit? Can we instead block if (remaining_capacity - incoming_payload_size < 0) or something like that? This is not fully precise since the memory overhead of the transient record is more than just the payload size, but probably close enough to not matter?
There was a problem hiding this comment.
If the issue is due to drainer records being uncompressed and producer buffers are not. just checking drainer capacity still could lead very memory usage. We probably should make tunable and add a metric to check the transient cache usage.
|
This method won't actually solve our problem. We'll need to do a more in depth fix. |
…opilot comments Adds `checkRollbackOriginVersionCapacityForNewPush` which rejects a new push while any ROLLED_BACK (or parent-side rollback-origin PARTIALLY_ONLINE) version is still within its retention window. The check runs on both the parent (`VeniceParentHelixAdmin.incrementVersionIdempotent`) and the child (`VeniceHelixAdmin.addVersion`), so the rejection surfaces synchronously to the VPJ instead of failing only in async admin-message consumption on the child. Integration tests cover both the block (within retention) and the release (after retention expires) paths. Also addresses Copilot comments on PR linkedin#2688: - linkedin#11 `assumeRolledBackIfUnreachable` is now empty for full-cluster rollbacks so unreachable regions don't inflate the ROLLED_BACK count. - linkedin#14 If the region filter contains only unknown regions, skip the parent status update instead of falling through into full-cluster behavior. - linkedin#15 If zero regions confirm ROLLED_BACK, leave parent status unchanged rather than downgrading to PARTIALLY_ONLINE on no evidence.
…drainer queue is full
This is a tactical fix. This change prevents the transient record from growing if the memory budget for the drainer queue is close to full or exhausted.
Summary, imperative, start upper case, don't end with a period
Resolves #XXX
How was this PR tested?
Does this PR introduce any user-facing changes?