-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple Replication lag from logic to mark replicas as stale #8453
Comments
I think we should provide a new setting here to toggle the lag tolerance leading to failure. We currently use a multiple of |
Agree, we should not account for the upload time in the failure computation. Now that we are including that in "replication lag" we need to decouple from replication lag entirely when determining when to apply segrep backpressure and failing the shard. Today we fail replicas if replication lag reaches 2x a configured limit. These are set to generous time intervals, the default is 5 mins so at 10m the shard would be marked for failure. We should also factor in remote store pressure here and consolidate the two pressure mechanisms, it is confusing to have multiple pressure mechanisms in the same flow. Then we can subtract the duration for which remote pressure is applied from the replication lag computation. At a minimum I think we now need to track a separate metric for the time between a checkpoint publish & a replica sync and use that instead of the existing lag computation. |
Completely agree on consolidating the 2 pressure mechanisms. Like we discussed, we can start with an additional setting for lag acceptance. If the lag is higher, we start failing the replicas. Another thing we were discussing was whether we should move the logic of "failing the replica" from primary shard to the replica shard. One benefit of doing so is that it won't include the time taken to upload the segments to remote store. Thinking out loud, it also makes sense from readers and writers separation POV. |
Resolved with #9507 |
For Segrep, we rely on replication lag to determine if writes needs to be throttled or if replica needs to be marked as stale. With Segrep's integration with remote store, this will include the time taken by primary to upload the segments to remote store as well. This makes sense and it should be accounted in the replication lag. However this can create problem with replicas if the segment upload time is high (e.g. merges). We shouldn't be kicking out the replicas due to high segment upload times to remote store. This can further aggravate the situation in case of a large scale event(let's say with remote store) by kicking out all the replicas from a cluster.
We need to decouple the logic for marking replicas as stale from replication lag.
The text was updated successfully, but these errors were encountered: