Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Replication lag from logic to mark replicas as stale #8453

Closed
ankitkala opened this issue Jul 5, 2023 · 4 comments
Closed

Decouple Replication lag from logic to mark replicas as stale #8453

ankitkala opened this issue Jul 5, 2023 · 4 comments
Assignees
Labels
bug Something isn't working discuss Issues intended to help drive brainstorming and decision making distributed framework Storage Issues and PRs relating to data and metadata storage v2.10.0

Comments

@ankitkala
Copy link
Member

For Segrep, we rely on replication lag to determine if writes needs to be throttled or if replica needs to be marked as stale. With Segrep's integration with remote store, this will include the time taken by primary to upload the segments to remote store as well. This makes sense and it should be accounted in the replication lag. However this can create problem with replicas if the segment upload time is high (e.g. merges). We shouldn't be kicking out the replicas due to high segment upload times to remote store. This can further aggravate the situation in case of a large scale event(let's say with remote store) by kicking out all the replicas from a cluster.

We need to decouple the logic for marking replicas as stale from replication lag.

@ankitkala ankitkala added bug Something isn't working untriaged labels Jul 5, 2023
@mch2
Copy link
Member

mch2 commented Jul 5, 2023

I think we should provide a new setting here to toggle the lag tolerance leading to failure. We currently use a multiple of MAX_REPLICATION_TIME_SETTING to determine failure. We could provide an additional setting that if set to -1 implies replicas will never be failed.

@dreamer-89 dreamer-89 added discuss Issues intended to help drive brainstorming and decision making v2.10.0 labels Jul 20, 2023
@mch2
Copy link
Member

mch2 commented Jul 20, 2023

"We shouldn't be kicking out the replicas due to high segment upload times to remote store."

Agree, we should not account for the upload time in the failure computation. Now that we are including that in "replication lag" we need to decouple from replication lag entirely when determining when to apply segrep backpressure and failing the shard.

Today we fail replicas if replication lag reaches 2x a configured limit. These are set to generous time intervals, the default is 5 mins so at 10m the shard would be marked for failure.

We should also factor in remote store pressure here and consolidate the two pressure mechanisms, it is confusing to have multiple pressure mechanisms in the same flow. Then we can subtract the duration for which remote pressure is applied from the replication lag computation.

At a minimum I think we now need to track a separate metric for the time between a checkpoint publish & a replica sync and use that instead of the existing lag computation.

@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
@ankitkala
Copy link
Member Author

Completely agree on consolidating the 2 pressure mechanisms.

Like we discussed, we can start with an additional setting for lag acceptance. If the lag is higher, we start failing the replicas.

Another thing we were discussing was whether we should move the logic of "failing the replica" from primary shard to the replica shard. One benefit of doing so is that it won't include the time taken to upload the segments to remote store.

Thinking out loud, it also makes sense from readers and writers separation POV.

@mch2
Copy link
Member

mch2 commented Aug 31, 2023

Resolved with #9507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discuss Issues intended to help drive brainstorming and decision making distributed framework Storage Issues and PRs relating to data and metadata storage v2.10.0
Projects
Status: Done
Development

No branches or pull requests

5 participants