Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test SegmentReplicationTargetServiceTests#testShardAlreadyReplicating #13248

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Apr 16, 2024

Description

This test is flaky because it is incorrectly passing a checkpoint with a higher primary term on the second invocation. This will cancel the first replication and start another. The test sometimes passes because it is only asserting on processLatestReceivedCheckpoint. If the cancellation quickly completes before attempting second replication event the test will fail, otherwise it will pass.

Fixed this test by ensuring the pterm is the same, but the checkpoint is ahead (higher sis verison). Also added assertion that replication is not started with the exact ahead checkpoint instead of only processLatestReivedCheckpoint. Tests already exist for ahead primary term "testShardAlreadyReplicating_HigherPrimaryTermReceived".

Related Issues

Resolves #8928

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…eplicating

This test is flaky because it is incorrectly passing a checkpoint with a higher primary term on the second invocation.
This will cancel the first replication and start another.  The test sometimes passes because it is only asserting on processLatestReceivedCheckpoint.
If the cancellation quickly completes before attempting second replication event the test will fail, otherwise it will pass.

Fixed this test by ensuring the pterm is the same, but the checkpoint is ahead.  Also added assertion that replication is not started with the exact ahead checkpoint
instead of only processLatestReivedCheckpoint. Tests already exist for ahead primary term "testShardAlreadyReplicating_HigherPrimaryTermReceived".

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Remote Search labels Apr 16, 2024
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Apr 16, 2024
Copy link
Contributor

❌ Gradle check result for b8877bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Apr 16, 2024

❌ Gradle check result for b8877bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

#12651
#13249

Copy link
Contributor

❌ Gradle check result for b8877bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Apr 17, 2024

❌ Gradle check result for b8877bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

#13240

Copy link
Contributor

✅ Gradle check result for b8877bf: SUCCESS

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.51%. Comparing base (b15cb0c) to head (b8877bf).
Report is 182 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13248      +/-   ##
============================================
+ Coverage     71.42%   71.51%   +0.09%     
- Complexity    59978    60707     +729     
============================================
  Files          4985     5040      +55     
  Lines        282275   285432    +3157     
  Branches      40946    41335     +389     
============================================
+ Hits         201603   204119    +2516     
- Misses        63999    64438     +439     
- Partials      16673    16875     +202     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mch2 mch2 merged commit 1fcb79d into opensearch-project:main Apr 17, 2024
75 checks passed
@mch2 mch2 deleted the alreadyreplicating branch April 17, 2024 05:37
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-13248-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1fcb79de07498005fea9a9e6148ecdf44f484e7b
# Push it to GitHub
git push --set-upstream origin backport/backport-13248-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-13248-to-2.x.

mch2 added a commit to mch2/OpenSearch that referenced this pull request Apr 17, 2024
…eplicating (opensearch-project#13248)

This test is flaky because it is incorrectly passing a checkpoint with a higher primary term on the second invocation.
This will cancel the first replication and start another.  The test sometimes passes because it is only asserting on processLatestReceivedCheckpoint.
If the cancellation quickly completes before attempting second replication event the test will fail, otherwise it will pass.

Fixed this test by ensuring the pterm is the same, but the checkpoint is ahead.  Also added assertion that replication is not started with the exact ahead checkpoint
instead of only processLatestReivedCheckpoint. Tests already exist for ahead primary term "testShardAlreadyReplicating_HigherPrimaryTermReceived".

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit 1fcb79d)
dblock pushed a commit that referenced this pull request Apr 17, 2024
…eplicating (#13248) (#13265)

This test is flaky because it is incorrectly passing a checkpoint with a higher primary term on the second invocation.
This will cancel the first replication and start another.  The test sometimes passes because it is only asserting on processLatestReceivedCheckpoint.
If the cancellation quickly completes before attempting second replication event the test will fail, otherwise it will pass.

Fixed this test by ensuring the pterm is the same, but the checkpoint is ahead.  Also added assertion that replication is not started with the exact ahead checkpoint
instead of only processLatestReivedCheckpoint. Tests already exist for ahead primary term "testShardAlreadyReplicating_HigherPrimaryTermReceived".

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit 1fcb79d)
@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Apr 19, 2024

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Remote Search skip-changelog >test-failure Test failure from CI, local build, etc.
Projects
None yet
3 participants