Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication. #9471

Merged
merged 5 commits into from
Aug 30, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Aug 22, 2023

Description

This test is yet again failing but only with remote store. This time a concurrent flush can wipe out an old commit file while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n that has been deleted by an incoming flush.

To fix this, InternalEngine will conditionally acquire the previous commit point and preserve it until a new commit is loaded onto the reader. This guarantees a commit point is not deleted while inside of an active refresh.

Related Issues

Resolves ##8059

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #9471 (67d0701) into main (012c4fa) will increase coverage by 0.63%.
Report is 2 commits behind head on main.
The diff coverage is 83.33%.

@@             Coverage Diff              @@
##               main    #9471      +/-   ##
============================================
+ Coverage     70.40%   71.03%   +0.63%     
- Complexity    56861    57490     +629     
============================================
  Files          4781     4781              
  Lines        271231   271230       -1     
  Branches      39599    39601       +2     
============================================
+ Hits         190947   192667    +1720     
+ Misses        63974    62364    -1610     
+ Partials      16310    16199     -111     
Files Changed Coverage Δ
...ensearch/indices/replication/common/CopyState.java 91.66% <ø> (+2.77%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.91% <50.00%> (+1.13%) ⬆️
...va/org/opensearch/index/engine/InternalEngine.java 74.68% <85.71%> (+8.85%) ⬆️
.../org/opensearch/action/get/TransportGetAction.java 62.22% <100.00%> (-3.78%) ⬇️
...main/java/org/opensearch/cluster/ClusterState.java 97.87% <100.00%> (+0.02%) ⬆️

... and 620 files with indirect coverage changes

@mch2
Copy link
Member Author

mch2 commented Aug 23, 2023

pushed a rebase here - tagging @sachinpkale @ashking94 for review on this.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 requested a review from sachinpkale August 29, 2023 05:30
@github-actions

This comment was marked as outdated.

@mch2 mch2 requested a review from ashking94 August 29, 2023 05:30
@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 9cd58d3

Incompatible components

Incompatible components: [https://github.com/opensearch-project/asynchronous-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

…tion.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

Signed-off-by: Marc Handalian <handalm@amazon.com>
…efreshed on.

Signed-off-by: Marc Handalian <handalm@amazon.com>
…ement getSegmentInfosSnapshot.

This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

@mch2
Copy link
Member Author

mch2 commented Aug 29, 2023

Recent GC failures are all #9407

@mch2
Copy link
Member Author

mch2 commented Aug 29, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

#5030
and
#5329

@sachinpkale sachinpkale merged commit 6cd576f into opensearch-project:main Aug 30, 2023
11 of 12 checks passed
@sachinpkale sachinpkale added the backport 2.x Backport to 2.x branch label Aug 30, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-9471-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6cd576f2d69b9c7d05d22aecff3fd9a6e6d335c9
# Push it to GitHub
git push --set-upstream origin backport/backport-9471-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-9471-to-2.x.

mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 30, 2023
…tion. (opensearch-project#9471)

* Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* update InternalEngine to preserve commit file until a new commit is refreshed on.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update ReadOnlyEngine inside of resetEngineToGlobalCheckpoint to implement getSegmentInfosSnapshot.
This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update javadoc.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit 6cd576f)
@mch2 mch2 deleted the dropagain branch August 30, 2023 23:32
tlfeng pushed a commit that referenced this pull request Aug 31, 2023
…tion. (#9471) (#9648)

* Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

* update InternalEngine to preserve commit file until a new commit is refreshed on.

* Update ReadOnlyEngine inside of resetEngineToGlobalCheckpoint to implement getSegmentInfosSnapshot.
This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

* Update javadoc.
---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit 6cd576f)
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…tion. (opensearch-project#9471)

* Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* update InternalEngine to preserve commit file until a new commit is refreshed on.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update ReadOnlyEngine inside of resetEngineToGlobalCheckpoint to implement getSegmentInfosSnapshot.
This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update javadoc.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…tion. (opensearch-project#9471)

* Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* update InternalEngine to preserve commit file until a new commit is refreshed on.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update ReadOnlyEngine inside of resetEngineToGlobalCheckpoint to implement getSegmentInfosSnapshot.
This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update javadoc.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…tion. (opensearch-project#9471)

* Fix SegmentReplicationUsingRemoteStoreIT#testDropPrimaryDuringReplication.

This test is failing because a concurrent flush can wipe out an old commit file
while we are in the remote store refresh listener. The listener will fetch the latest infos from the reader which will reference a segments_n tht has been deleted by an incoming flush.

To fix this, InternalEngine will preserve the latest commit until a new commit is loaded onto the readerManager.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* update InternalEngine to preserve commit file until a new commit is refreshed on.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update ReadOnlyEngine inside of resetEngineToGlobalCheckpoint to implement getSegmentInfosSnapshot.
This ensures access to this function is not permitted on the ReadOnlyEngine and is delegated to the new IE once opened.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update javadoc.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants