Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS - Segment Replication] SegRep consistency limitations #8700

Closed
mch2 opened this issue Jul 14, 2023 · 2 comments
Closed

[DISCUSS - Segment Replication] SegRep consistency limitations #8700

mch2 opened this issue Jul 14, 2023 · 2 comments
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feedback needed Issue or PR needs feedback Storage Issues and PRs relating to data and metadata storage v2.10.0

Comments

@mch2
Copy link
Member

mch2 commented Jul 14, 2023

Background:

Segment Replication (SegRep) currently has a known limitation in that it does not support strong read after write mechanisms available to Document Replication (DocRep) indices. These mechanisms are: a read request with get/multi-get by ID and writes with RefreshPolicy.WAIT_UNTIL followed by a search. Currently, the only way to achieve a strong read with Segment Replication is to use Preference.Prefer_Primary to route requests to primary shards.

The Problem:

The issue with these mechanisms with SegRep is they hold resources in memory. GET requires a “version map” to be maintained in the engine that maps doc ids to their translog location while writes with WAIT_UNTIL hold open listeners. With DocRep these resources can be cleared locally with a refresh when a limit is reached. With SegRep only primaries can issue a local refresh to clear these resources because replicas only refresh after receiving copied segments. This means we will continue to fill without bound until segment copy completes.

Issues where we explored supporting these existing mechanisms for context:
WAIT_UNTIL - #6045
GET/MGET - #8536

A streaming Index API can provide a resolution to this limitation by acknowledging a write request once a certain consistency level has been reached. However, until this exists and for get requests I’d like to list some ideas and start the discussion on how to deal with this. Pls do comment if there is any option I’ve missed. I think Option 1 provides the best shorter term solution until we have the streaming API for search.

1. Primary shard based reads

Internally route all get/mget requests to primary shards only if SegRep is enabled and the realtime param is true (default). More on this option provided in #8536 and clearly document that this could hurt performance in read heavy cases or if primary is overloaded. Require users to update to prefer _primary for any search that is currently following a WAIT_UNTIL write.

Pros:

  • A strong read is achieved for both search & get/mget.

Cons:

  • Potential for higher latency if primary is overloaded or ready heavy use-cases.
  • Adding the _preference param requires conditional logic based on replication strategy in plugin/client code to avoid any impact to DocRep performance. This will require another campaign with plugins that currently use wait_until to make these changes prior to 2.10. From a search I see this would impact Geo and ISM of the bundled plugins.

2. Do nothing

All requests requiring strong reads require client update to prefer primary shards with segment replication.

Pros:

  • Already implemented in core.

Cons:

  • Confusing to provide this param on get/mget requests that already by default specify realtime=true.
  • Requires conditional logic in clients coupled to replication strategy.
  • Still a risk for heavy read cases or latency hit when primary is already overloaded.

3. Constrained GET and WAIT_UNTIL requests.

In this approach we would update GET and WAIT_UNTIL by enforcing hard caps to safeguard against memory issues.

Get - For get/mget this means we would need to throttle writes until replicas are caught up so that the memory footprint mapping doc to translog location does not grow unbounded. This is similar to our SegRep backpressure mechanism today that enforces pressure when replica falls too far behind based on replication lag & checkpoint count. However, it would also include a primary computed memory threshold.

WAIT_UNTIL - We would put a hard cap on the amount of open wait_until writes per shard. In this case we would support the refresh policy but rather than solely triggering a local refresh to clear requests we would track the amount of open wait_until requests open to replicas and reject writes if this exceeds a limit.

Pros:

  • Solves the problem for get/mget when remote translog is not used.
  • Solves problem for search without change.

Cons:

  • Solution for Get/mGet does not work when remote translog is enabled. With Remote translog only the primary has a local copy of the translog written to disk, so replicas would not be able to read without the added latency of a network call to fetch the translog.
  • Significant latency added to the wait_until writes to wait for segment copy.
  • Would implement stringent checks that would result in degraded write throughput.
  • Existing failover flow with wait requests also breaks down with SegRep and would require a logic change, more here. Today if a primary goes down with open wait request the new promoted primary will refresh locally to clear any outstanding requests before promotion while all replicas must clear any outstanding listeners before primary term bump. The solution here would require force releasing the open listeners on the new primary and have each replica sync with the new primary so that listeners can be freed & term bumped.
@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request discuss Issues intended to help drive brainstorming and decision making feedback needed Issue or PR needs feedback v2.10.0 labels Jul 14, 2023
@mch2 mch2 removed the untriaged label Jul 14, 2023
@andrross
Copy link
Member

I know why you're discussing remote store and system indexes here, but really these are just specific cases where the subtly different behavior of segrep compared to docrep causes problems. I believe our long term goal should be the removal of docrep and using segrep everywhere instead, and there will be users that run into the same problems as plugins if they have use cases that rely on the read-after-write behavior of get/mget and WAIT_UNTIL.

  • get/mget: I agree with your recommendation to internally route all these requests to the primary if the realtime parameter is true. These operations are cheap, and looking a document up by its ID is not generally why one uses a search engine like OpenSearch. I really doubt there are use cases that will be adversely impacted by routing these requests to primary. I would love to hear others' opinions about this point though.
  • WAIT_UNTIL: Requiring a change in user behavior to prefer the primary for searches after a write that specifies WAIT_UNTIL may be more problematic. It might be tractable to chase down the instances of this for our bundled plugins, but this will be a pain point for any user as well. I might lean towards the "option 4" functionality for this one, because it might make adoption of segrep easier for all users if they don't have to change their workload to get the same functionality. There are obviously scaling implications, but I don't think WAIT_UNTIL is a good strategy for high scale workloads even with docrep.

@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
@mch2 mch2 changed the title [DISCUSS - Segment Replication] SegRep consistency limitation and impact for remote store + system indices [DISCUSS - Segment Replication] SegRep consistency limitations Jul 31, 2023
@mch2
Copy link
Member Author

mch2 commented Jul 31, 2023

Thanks for the feedback @andrross. You are right this is really an overall limitation of segrep and not specific to remote store / system indices. I've trimmed this a bit to reflect this.

For the immediate future I think primary based routing is the simplest solution until we have a streaming API. Will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feedback needed Issue or PR needs feedback Storage Issues and PRs relating to data and metadata storage v2.10.0
Projects
None yet
Development

No branches or pull requests

3 participants