Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Update RefreshPolicy.WAIT_UNTIL to wait for replica refresh. #6045

Closed
Tracked by #5147
mch2 opened this issue Jan 27, 2023 · 2 comments · Fixed by #6464
Closed
Tracked by #5147

[Segment Replication] Update RefreshPolicy.WAIT_UNTIL to wait for replica refresh. #6045

mch2 opened this issue Jan 27, 2023 · 2 comments · Fixed by #6464
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mch2
Copy link
Member

mch2 commented Jan 27, 2023

Current behavior of RefreshPolicy.WAIT_UNTIL waits to ack an index request until the operation is visible/searchable to the client. It does this by registering listeners that complete when the last written translog location after a refresh is at least the furthest translog location from the writes executed in the request. This logic breaks with SR because replicas do not index before writes to the translog, so the last written location is not guaranteed to be visible.

Rather than comparing xlog locations, we should compare the latest max seqNo that the replica has refreshed on. This is made possible with this change, where replicas now receive an accurate seqNo.

@mch2
Copy link
Member Author

mch2 commented Jan 27, 2023

An idea of how to accomplish this:

  • Update TransportShardbulkAction#performOnReplica to also return the max seqNo from a BulkShardRequest in a Tuple or new class so that both segrep & non segrep behavior can be supported by this method. The max seqno from a set of operations is not guaranteed to be the latest xlog write location, so we must compute it from the primary's BulkItemResponse.
  • Update WriteReplicaResult to include the maxSeqNo.
  • Update RefreshListeners to register refresh listeners with the seqNo instead of xlog location, and fire off the listeners after refresh iff the latest replicationCheckpoint's seqNo is at least that #.
  • Overload IndexShard.addRefreshListener to register listeners with updated method in RefreshListeners.
  • Update AsyncAfterWriteAction to optionally include the seqNo. In its run method pass the seqno (new tuple/class) to indexShard.addRefreshListener.

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Mar 16, 2023

For opensearch 2.7 release with segment replication we are not supporting wait_until requests. Users are not guaranteed that docs will be visible on replica after refresh. So, we only provide read (only on primary) after write consistency for wait_until with segment replication.

More details:

Wait_until Requests with Segment Replication:

  • With segment replication segments are copied from primary to replica. So, new docs indexed will be visible to search only after a round of replication is completed with replica. Depending on size of segments copied, the time taken by replication can vary a lot.
  • This existing wait_until behaviour breaks with Segment Replication because replicas do not index before writes to the translog, so the last written location is not guaranteed to be visible.
  • To support wait_until with segment replication we decided that rather than comparing translog locations, we should compare the latest max sequence number that the replica has refreshed on. This is made possible with this change, where replicas now receive an accurate seqNo. [But later the idea of comparing max sequence number was discarded, explanation is provided below Problem 3.]
  • Issue created: [Segment Replication] Update RefreshPolicy.WAIT_UNTIL to wait for replica refresh. #6045

Problem Encountered while trying to support Wait_Until with Segment Replication:

*** Problem 1 (Failover Scenario) :**

→ Consider a scenario where we are having 1 primary shard with 2 replicas in it’s replication group.
→ Now suppose we indexed few docs with wait_until, replica shard will register listeners and wait for segment replication to complete so that it can catch up.
→ Now if primary shard goes down after replica registering listeners and before start of segment replication event, then cluster will try to elect new primary from either of two replicas.
→ All of the replicas receives in replication group receives primarytermbump, which will only process after the existing listeners on replicas have catch up or released. As primary went down before start of segment replication, the replica would never receive latest segments and will not be able to catch up and release registered listeners.
→ This is a dead lock situation, to avoid this if we forcefully release all registered listeners on replicas so that primarytermbump can be processed, then we are compromising of behaviour of wait_until.
→ Although after sometime when new primary is elected and sends checkpoints, all replicas will catch up, but until then we will be lying to users by forcefully releasing listeners (saying docs are ready to searchable but infact we don’t have those docs indexed on replicas until next replication takes place).

*** Problem 2 (Relocation Scenario) :**

→ Consider a scenario where we are having 1 primary shard with 2 replicas in it’s replication group.
→ Now suppose we indexed few docs with wait_until, replica shard will register listeners and wait for segment replication to complete so that it can catch up.
→ Now if existing primary is relocating to another node, after replica registering listeners and before start of segment replication event. Then existing primary shard goes through a process of relocation handoff. In this relocation handoff we again do a primarytermbump which will only process if all existing listeners on replicas have catch up or released.
→ Once the replicas reach primarytermbump stage, it will be blocked completely and will not able to even receive any new checkpoints from primary shard. So the replica will be blocked along with primary relocation handoff being blocked and unless we forcefully release registered listeners, everything will be in a blocked state.
→ Again same as failover scenario above to avoid this situation if we forcefully release all registered listeners on replicas so that primarytermbump can be processed, then we are compromising of behaviour of wait_until.
→ Although after sometime when primary is relocated and new primary sends checkpoints, all replicas will catch up, but until then we will be lying to users by forcefully releasing listeners (saying docs are ready to searchable but infact we don’t have those docs indexed on replicas until next replication takes place).

*** Problem 3 (max Sequence Number):**

→ Suppose we indexed few docs with wait_until, replica shard will register listeners and wait for segment replication to complete so that it can catch up.
→ We need some kind of variable/state on replica shard that updates after segment replication event is completed, so that we can release the registered listeners. So, we decided to use max sequence number as that state/variable as described in this issue: #6045
→ But later we realized there is bug and realized max sequence number is not reliable state to compare for releasing listeners on replicas as on delete operations the max sequence number can be lowered.
→ More description about bug: #6588
→ To overcome this problem we need to find a new state/variable that would be updated after segment replication event and will let us know if replica has catched up and can registered listeners be released.

Possible Solution to Support Wait_until:

→ A possible solution to support wait_until with segment replication is to make use of Streaming Index API to index wait_until requests.
→ We need to design Streaming Index API in such a way that we acknowledge back to users only after all index docs are ready to be searchable on both replicas and primary. But again we might face similar problems mentioned above, so we need more discussion on how streaming index API should be designed.

Current Situation of Wait_Until with Segment Replication :

  • Due to above discussed problems, we know that wait_until requests give incorrect response because of replica shards. So we currently don’t support Wait_Until requests with Segment Replication, only primary shards will give correct response on searches from user.
  • So for users to use wait_until requests with segment replication, they should only query on primary shards. So basically we provide read (only on primary shards) after write consistency with segment replication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
3 participants