Decouple Replication lag from logic to mark replicas as stale #8453

ankitkala · 2023-07-05T16:39:27Z

For Segrep, we rely on replication lag to determine if writes needs to be throttled or if replica needs to be marked as stale. With Segrep's integration with remote store, this will include the time taken by primary to upload the segments to remote store as well. This makes sense and it should be accounted in the replication lag. However this can create problem with replicas if the segment upload time is high (e.g. merges). We shouldn't be kicking out the replicas due to high segment upload times to remote store. This can further aggravate the situation in case of a large scale event(let's say with remote store) by kicking out all the replicas from a cluster.

We need to decouple the logic for marking replicas as stale from replication lag.

mch2 · 2023-07-05T17:22:39Z

I think we should provide a new setting here to toggle the lag tolerance leading to failure. We currently use a multiple of MAX_REPLICATION_TIME_SETTING to determine failure. We could provide an additional setting that if set to -1 implies replicas will never be failed.

mch2 · 2023-07-20T22:30:33Z

"We shouldn't be kicking out the replicas due to high segment upload times to remote store."

Agree, we should not account for the upload time in the failure computation. Now that we are including that in "replication lag" we need to decouple from replication lag entirely when determining when to apply segrep backpressure and failing the shard.

Today we fail replicas if replication lag reaches 2x a configured limit. These are set to generous time intervals, the default is 5 mins so at 10m the shard would be marked for failure.

We should also factor in remote store pressure here and consolidate the two pressure mechanisms, it is confusing to have multiple pressure mechanisms in the same flow. Then we can subtract the duration for which remote pressure is applied from the replication lag computation.

At a minimum I think we now need to track a separate metric for the time between a checkpoint publish & a replica sync and use that instead of the existing lag computation.

ankitkala · 2023-08-03T15:28:30Z

Completely agree on consolidating the 2 pressure mechanisms.

Like we discussed, we can start with an additional setting for lag acceptance. If the lag is higher, we start failing the replicas.

Another thing we were discussing was whether we should move the logic of "failing the replica" from primary shard to the replica shard. One benefit of doing so is that it won't include the time taken to upload the segments to remote store.

Thinking out loud, it also makes sense from readers and writers separation POV.

mch2 · 2023-08-31T15:34:22Z

Resolved with #9507

ankitkala added bug Something isn't working untriaged labels Jul 5, 2023

anasalkouz added distributed framework and removed untriaged labels Jul 13, 2023

dreamer-89 added discuss Issues intended to help drive brainstorming and decision making v2.10.0 labels Jul 20, 2023

anasalkouz assigned ankitkala Jul 27, 2023

Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023

ankitkala mentioned this issue Aug 9, 2023

Move logic to fail stale replicas to Replica nodes #9201

Closed

6 tasks

ankitkala mentioned this issue Aug 23, 2023

Decouple replication lag from logic to fail stale replicas #9507

Merged

6 tasks

mch2 closed this as completed Aug 31, 2023

Poojita-Raj mentioned this issue Sep 21, 2023

[DOC] Add a new request field to segment replication backpressure opensearch-project/documentation-website#5075

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Replication lag from logic to mark replicas as stale #8453

Decouple Replication lag from logic to mark replicas as stale #8453

ankitkala commented Jul 5, 2023

mch2 commented Jul 5, 2023

mch2 commented Jul 20, 2023

ankitkala commented Aug 3, 2023

mch2 commented Aug 31, 2023

Decouple Replication lag from logic to mark replicas as stale #8453

Decouple Replication lag from logic to mark replicas as stale #8453

Comments

ankitkala commented Jul 5, 2023

mch2 commented Jul 5, 2023

mch2 commented Jul 20, 2023

ankitkala commented Aug 3, 2023

mch2 commented Aug 31, 2023