Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Handoff refreshes, translog uploads during relocation from old to new primary #11330

Merged
merged 11 commits into from
Nov 28, 2023

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Nov 24, 2023

Description

This PR fixes an edge case during primary relocation that can lead to both existing primary and new primary uploading segments/translog metadata having same combination of primary term, segment info generation, translog generation under following scenarios.

Situations -

  1. Internal flush triggered during shard inactivity or snapshot just before the primary mode of the existing primary is set to false. This is race condition where the new primary starts itself with generation n, but n+1 generation gets uploaded to remote store by existing primary.
  2. New primary does a syncTranslog in here before actually becoming a primary from cluster manager’s perspective. It is possible that the old primary can continue to be primary due to relocation handoff failure.

This can lead to following impact:

situation no. 1 -

  1. Red or Yellow index if the primary or replica shard goes unassigned due to any reasons (like node leaving cluster).
  2. Primary relocation failure if there is no indexing happening.

The above can be mitigated by flushing multiple times if the index is not red. This can still be fixed by deleting the metadata file uploaded by old primary.

situation no. 2 -

  1. In cases where the translog gets uploaded from new primary, but metadata files fails. If a node drop happens, remote store restore will restore a translog that has been overwritten by new primary

The above issue can go away if the indexing continues for some more time.

Related Issues

Resolves #11320, #11322, #11323

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for a26708b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0 labels Nov 24, 2023
Copy link
Contributor

github-actions bot commented Nov 24, 2023

Compatibility status:

Checks if related components are compatible with change ea629f4

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

@ashking94 ashking94 changed the title [Remote Store] Drain ongoing refreshes, translog uploads during primary relocation [Remote Store] Handoff refreshes, translog uploads during relocation from old to new primary Nov 24, 2023
Copy link
Contributor

✅ Gradle check result for 1fca1a7: SUCCESS

Copy link

codecov bot commented Nov 24, 2023

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (ec5a0f9) 71.37% compared to head (ea629f4) 71.29%.
Report is 2 commits behind head on main.

Files Patch % Lines
...rg/opensearch/index/translog/RemoteFsTranslog.java 72.00% 3 Missing and 4 partials ⚠️
...search/index/shard/RemoteStoreRefreshListener.java 25.00% 0 Missing and 3 partials ⚠️
...ndex/shard/ReleasableRetryableRefreshListener.java 88.23% 0 Missing and 2 partials ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 90.90% 0 Missing and 1 partial ⚠️
...opensearch/index/translog/NoOpTranslogManager.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11330      +/-   ##
============================================
- Coverage     71.37%   71.29%   -0.08%     
- Complexity    58964    58983      +19     
============================================
  Files          4890     4890              
  Lines        277468   277501      +33     
  Branches      40313    40323      +10     
============================================
- Hits         198029   197853     -176     
- Misses        62945    63239     +294     
+ Partials      16494    16409      -85     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ashking94
Copy link
Member Author

Need to validate that we are able to not reproduce the issue mentioned in #10839

Copy link
Contributor

❌ Gradle check result for 827d0f0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for c2c8e4d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9e39bc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for f7375fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 25bc81b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for bdb92d4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
…n step

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94
Copy link
Member Author

Did a rebase with main as there were conflicts preventing merge.

Copy link
Contributor

✅ Gradle check result for be750ea: SUCCESS

Copy link
Contributor

✅ Gradle check result for a27e030: SUCCESS

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

❕ Gradle check result for ea629f4: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@sachinpkale sachinpkale merged commit ed1c8b7 into opensearch-project:main Nov 28, 2023
29 checks passed
@ashking94
Copy link
Member Author

❕ Gradle check result for ea629f4: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Flaky test - #10558

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-11330-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ed1c8b7f53176d1669b8c9189c9c0835a9f17fab
# Push it to GitHub
git push --set-upstream origin backport/backport-11330-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-11330-to-2.x.

@ashking94
Copy link
Member Author

Raising manual backport since the automated one failed.

ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Nov 28, 2023
…from old to new primary (opensearch-project#11330)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
sachinpkale pushed a commit that referenced this pull request Nov 28, 2023
…from old to new primary (#11330) (#11361)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
fahadshamiinsta pushed a commit to fahadshamiinsta/OpenSearch270 that referenced this pull request Dec 4, 2023
…from old to new primary (opensearch-project#11330)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request Dec 11, 2023
…from old to new primary (opensearch-project#11330)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…from old to new primary (opensearch-project#11330)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…from old to new primary (opensearch-project#11330)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed bug Something isn't working skip-changelog Storage:Durability Issues and PRs related to the durability framework Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Close refresh listeners during primary relocation of remote enabled indexes
4 participants