Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple replication lag from logic to fail stale replicas #9507

Merged
merged 12 commits into from Aug 31, 2023

Conversation

ankitkala
Copy link
Member

@ankitkala ankitkala commented Aug 23, 2023

Description

Current implementation relies on the replication timer tracked by primary's checkpoint tracker to evaluation replica's staleness. While this correctly provide the replication_lag, it is not ideal to rely on replication lag as it also includes the timer taken by primary shard to upload the data to remote store. We shouldn't penalize & fail replicas if the primary is slow in uploading the segments. To get around this, we need to track two times separately:

  • replication lag: total time taken by replica to catchup after primary refreshes.
  • replication time:total time taken by replica to catchup after primary publishes the checkpoint(excludes the time to upload to remote store)

Changes done:

  • Added a wrapper timer object which tracks 2 separate times(when timer was created v/s when timer started)
  • Modified the logic for indexShard.updateReplicationCheckpoint to only create the timers (and not start)
  • Added hook to start the timers after checkpoint is published.
  • Used the replication time for failing stale replicas in SegRep backpressure service.
  • Introduced a setting to configure the staleness time limit before primary starts failing stale replica shards.
  • Updated the Replication stats tracker to use the newer replication lag metric.

Related Issues

#8453

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Ankit Kala <ankikala@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 5d3633c

Incompatible components

Incompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 5d3633c

Incompatible components

Incompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseAllShards

@codecov
Copy link

codecov bot commented Aug 30, 2023

Codecov Report

Merging #9507 (4fe6846) into main (ff65403) will decrease coverage by 0.02%.
Report is 1 commits behind head on main.
The diff coverage is 66.03%.

@@             Coverage Diff              @@
##               main    #9507      +/-   ##
============================================
- Coverage     71.14%   71.12%   -0.02%     
- Complexity    57552    57584      +32     
============================================
  Files          4782     4783       +1     
  Lines        271408   271444      +36     
  Branches      39633    39639       +6     
============================================
- Hits         193084   193063      -21     
- Misses        62123    62176      +53     
- Partials      16201    16205       +4     
Files Changed Coverage Δ
...rg/opensearch/common/settings/ClusterSettings.java 93.18% <ø> (ø)
...st/action/cat/RestCatSegmentReplicationAction.java 25.74% <0.00%> (-19.81%) ⬇️
...opensearch/index/SegmentReplicationShardStats.java 30.43% <40.00%> (-8.59%) ⬇️
...replication/common/SegmentReplicationLagTimer.java 40.00% <40.00%> (ø)
...earch/index/SegmentReplicationPressureService.java 76.69% <69.23%> (-0.63%) ⬇️
...org/opensearch/index/seqno/ReplicationTracker.java 68.91% <80.95%> (+0.25%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.41% <100.00%> (-0.19%) ⬇️
...ckpoint/SegmentReplicationCheckpointPublisher.java 100.00% <100.00%> (ø)

... and 464 files with indirect coverage changes

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Ankit Kala <ankikala@amazon.com>
@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 75ad56c

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/reporting.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change b921169

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ankit Kala <ankikala@amazon.com>
@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 4fe6846

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.CloneSnapshotIT.testCloneAfterRepoShallowSettingDisabled

@mch2 mch2 merged commit d66df10 into opensearch-project:main Aug 31, 2023
10 of 12 checks passed
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Aug 31, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-9507-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d66df10b248457d3d9778131d6939dd1a2185e39
# Push it to GitHub
git push --set-upstream origin backport/backport-9507-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-9507-to-2.x.

ankitkala added a commit to ankitkala/OpenSearch that referenced this pull request Aug 31, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit d66df10)
ankitkala added a commit to ankitkala/OpenSearch that referenced this pull request Aug 31, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
(cherry picked from commit d66df10)
ankitkala added a commit to ankitkala/OpenSearch that referenced this pull request Aug 31, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
mch2 pushed a commit to mch2/OpenSearch that referenced this pull request Sep 1, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
mch2 added a commit that referenced this pull request Sep 2, 2023
…9705)

* Decouple replication lag from replication timer logic used to fail stale replicas



* Added changelog entry



* Addressed comments



* Addressed comments 2



* Addressed comments



* Retry gradle



* fix UT



* Addressed comments



* Retry Gradle



---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Co-authored-by: Ankit Kala <ankikala@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…h-project#9507)

* Decouple replication lag from replication timer logic used to fail stale replicas

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Added changelog entry

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments 2

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* fix UT

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Addressed comments

Signed-off-by: Ankit Kala <ankikala@amazon.com>

* Retry Gradle

Signed-off-by: Ankit Kala <ankikala@amazon.com>

---------

Signed-off-by: Ankit Kala <ankikala@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants