Skip to content

Conversation

@FGasper
Copy link
Collaborator

@FGasper FGasper commented Nov 19, 2024

Previously to finalize its change stream the verifier would just TryNext() until no responses arrived. This works fine for unsharded clusters, but for sharded clusters it can potentially miss events.

This changeset adopts more “tried-and-true” logic: grab a timestamp via appendOplogNote, then iterate the change stream until the stream’s resume token’s timestamp meets or exceeds that timestamp. As verification of that change, this changeset removes the test skips that REP-5306 (PR #48) imposed for sharded clusters.

The timestamp-grabbing logic is unconditionally retryable, which requires an update to the retryer module to implement. This changeset refactors that module a bit to prevent further duplication.

Additional changes:

  • go.sum is now part of the repo.
  • Go 1.22 is now a minimum requirement.
  • Tests no longer spew output when the worker delay is 0. The log message for that delay is also tweaked to show the delay.
  • Mention in code of the defunct “refetch” collection is removed.
  • The change stream no longer logs whenever a batch of events arrives. For busy migrations it would be excessive.
  • Verifier instances in tests now wait 10ms rather than 15s between generations, which makes a few tests much quicker.
  • A new CheckRunner improves the ergonomics of running CheckDriver in tests.
  • TestGenerationalRechecking used to verify that an insert after WritesOff() gets handled as a recheck. This doesn’t make a lot of sense since WritesOff() means, by definition, that no more events should happen on the server. Thus, this changeset removes that test, which was proving rather race-prone with the rest of these changes.
  • The tests now use majority read & write concern.
  • CI now parallelizes a bit more of its setup.
  • A new CheckRunner struct smooths over running Check from e2e tests.
  • Tests now use a 10ms verificationStatusCheckInterval rather than 15-minute.

@FGasper
Copy link
Collaborator Author

FGasper commented Nov 22, 2024

@tdq45gj and @mmcclimon: I’ve made several updates to this PR that ideally should get a re-review before merge, but to ensure MV’s availability for MF testing on Friday (in EMEA) I’m going to proceed with the merge.

For posterity: I spent a good while trying to make TestGenerationalRechecking work as it is. I kept getting inconsistent results from the very last bit where it calls WritesOff then alters the server. I eventually concluded that this test is invalid because altering the server right after WritesOff is nonsense, which means expectations of a consistent response to it are unfounded. Thus, I’ve removed that part of the test.

I think commit 79d5675 is where you both last approved this. Since then:

  • I parallelized a bit of the CI workflow.
  • I made some trivial-ish variable-name changes.
  • I wrote a test, TestEventBeforeWritesOff, against the bug that this PR fixes.
  • I made tests use a 10ms verificationStatusCheckInterval rather than 15-minute.
  • I moved the sending of the final/writesOff timestamp into Verifier.WritesOff. i initially did this to try to placate TestGenerationalRechecking, which ended up being unnecessary, but I think this is a more sensible workflow anyhow. (It matches up a bit better with how mongosync’s Commit works.)
  • I wrote a CheckRunner that smooths over running Check from e2e tests.
  • I switched GetNewClusterTime to logic like mongosync’s, with two consecutive appendOplogNote calls.
  • I set majority read & write concern in tests.

@FGasper FGasper merged commit 8497935 into mongodb-labs:main Nov 22, 2024
33 checks passed
Comment on lines +258 to +261
// This has to happen under the lock because the change stream
// might be inserting docs into the recheck queue, which happens
// under the lock.
verifier.changeStreamWritesOffTsChan <- finalTs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sending finalTs need to be under the lock? We unlocked on L256 but this comment says this has to happen under the lock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, a mis-worded comment. It should read “This has to happen outside the lock …”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants