Skip to content

Conversation

@rcownie
Copy link
Contributor

@rcownie rcownie commented Mar 12, 2025

REP-5475 - Changes required for the mongodump passthrough testing

New mongodump option --internalOnlySourceWritesDoneBarrier to synchronize with resmoke tests

mongodump/barrier.go implements the wait-until-named-file-exists barrier

@rcownie
Copy link
Contributor Author

rcownie commented Mar 13, 2025

@rcownie rcownie marked this pull request as ready for review March 13, 2025 15:04
Copy link
Collaborator

@ajayvijayakumar123 ajayvijayakumar123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions!

// test infrastructure. The tests will create the barrier file when they have
// finshed writes to the source cluster.
func waitForSourceWritesDoneBarrier(barrierName string) {
log.Logvf(log.Always, "waitForSourceWritesDoneBarrier: %s", barrierName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Logvf(log.Always, "waitForSourceWritesDoneBarrier: %s", barrierName)
log.Logvf(log.DebugHigh, "waitForSourceWritesDoneBarrier: %s", barrierName)

I'm not too familiar with how we do logging in the tools. But, should this be at the debug level since we won't use this flag in production?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the log message so that it's clear that this log means we are entering the blocking state and the later one means we are still in the blocking state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to disagree with @ajayvijayakumar123: if this should never run in production this it's actually critical that it is as noisy as possible when running in production. Basically, if we hide the logs when this is running in production then we'll not be able to detect that something which should never run in production is running in production until it breaks.

If possible, we should not have this code compile into production binaries so that it's actually impossible to run in production :).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup good point. I agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undone and undone, and I added a comment to explain the choice of log.Always.

@@ -0,0 +1,44 @@
// Copyright (C) MongoDB, Inc. 2014-present.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this barrier system only used to provide a signal between mongodump and the integration testing framework?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The testing framework has to signal when the source cluster is in its final state. Otherwise there will be a race and mongodump may capture the state too early, when it's still being changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this may already be addressed with discussions you had with Dave)
My opinion here is that we should have avoid having code that exists just for testing get compiled into the binaries that are released. So, is it possible to have this mechanism only be a part of the integration testing build?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go build tags would facilitate this, FYI.

I don’t feel terribly strongly about it, though, because it’s clearly marked as “internal-only”, and even if a user did use it, the worst that happens is that their backup stalls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed in the review of the technical design and the consensus was to use a command-line option that is always present, but has an obscure name and isn't visible in the command-line help documentation.

return
}
if os.IsNotExist(err) {
if time.Since(prevLogTime) >= logInterval {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand, this setup will not print a log statement until 1 minute after it has actually started waiting on the barrier file. And then this works in tandem with the log statement at the top of the function to indicate we have entered and are still in the blocking state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We get a log message after 1 minute, and (roughly) once a minute from then on, to say that we're waiting. Enough to see what's happening, but without flooding the logfile with the same message on every poll.

// test infrastructure. The tests will create the barrier file when they have
// finshed writes to the source cluster.
func waitForSourceWritesDoneBarrier(barrierName string) {
log.Logvf(log.Always, "waitForSourceWritesDoneBarrier: %s", barrierName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the log message so that it's clear that this log means we are entering the blocking state and the later one means we are still in the blocking state.

@rcownie rcownie requested a review from tdq45gj March 21, 2025 20:37
Copy link
Collaborator

@FGasper FGasper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good. I’ve made some notes; see what you think.

Thanks!

// Poll for existence of the barrier file every 500msec
time.Sleep(500 * time.Millisecond)
} else {
panic(errors.Wrapf(err, "failed to open barrier file %s", barrierName))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicking seems a bit extreme here. Can we return an error instead?

(And use %#q in the error string to get a nicely-quoted name?)

Copy link
Contributor Author

@rcownie rcownie Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think panic() is appropriate here, since this code is only used in the resmoke test environment and anything strange happening to that file would indicate some level of chaos in the test. I added a comment to make that clear.

Changed %s to %#q throughout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Go practice is to avoid panic() except for indicating purely-internal errors, usually programmer mistakes.

Maybe ping another reviewer for a third opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

return nil
}

if !dump.OutputOptions.Oplog && (dump.InputOptions.SourceWritesDoneBarrier != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're checking collection exists before we're sure that writes are done, it seems that if a collection created after verify collection exists but before the barrier will be ignored. Should the barrier be before verifyCollectionExists?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quibble: verifyCollectionExists is for backups of a specific collection. It’s a no-op otherwise. So either the backup will fail, or that collection will be included.

That said, I agree that the existence check should probably follow the barrier.

Copy link
Contributor Author

@rcownie rcownie Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the non-oplog case, moved the waitForBarrier before the check.

I added a comment about the oplog case.

// A test with the combination of 
//    1. --oplog and
//    2. --internalSourceWritesOnly and
//    3. a specified collection 
//    4. the collection didn't exist at the time mongodump was started but is
//       created later and possibly captured in the oplog
// would do this check too early and thus fail.
//
// That's out of scope for mongodump passthrough testing so we don't try to handle it.
exists, err := dump.verifyCollectionExists()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we note this as a non-goal on the scope? Can we create a ticket to support this case eventually so we don't lose track of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The medium/long-term goal is to get rid of mongodump and mongorestore, by supporting dump-to-file(s) and restore-from-file(s) in mongosync. So I don't think there's a need to plan for extending the mongodump testing.

Copy link
Collaborator

@ajayvijayakumar123 ajayvijayakumar123 Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I still think we note this as a non-goal on the scope to call out something that is supported but not tested

Copy link
Contributor Author

@rcownie rcownie Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "mongodump --collection" as a non-goal on the scope.

Copy link
Collaborator

@FGasper FGasper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d still like to see the panic converted to a returned error. @ajayvijayakumar123, @tdq45gj? (Erich, you don’t pop up as easily in @ completion …)

// Poll for existence of the barrier file every 500msec
time.Sleep(500 * time.Millisecond)
} else {
panic(errors.Wrapf(err, "failed to open barrier file %s", barrierName))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Go practice is to avoid panic() except for indicating purely-internal errors, usually programmer mistakes.

Maybe ping another reviewer for a third opinion?

@rcownie rcownie requested a review from FGasper March 26, 2025 23:01
@ege-mgdb
Copy link
Collaborator

ege-mgdb commented Mar 27, 2025

I’d still like to see the panic converted to a returned error. @ajayvijayakumar123, @tdq45gj? (Erich, you don’t pop up as easily in @ completion …)

I've kind of intentionally backed off these PRs bc there were getting to be too many people involved and I didn't to create distractions. I'm reading the comments to learn but not asking questions or making comments so that the discussion can stay focused on getting things across the line.

My general opinion on panics or faults is that they should be used when you have entered a state that is unrecoverable (that is, the program has entered a state that should be impossible and where it can no longer trust any decision it makes after that).

Copy link
Collaborator

@ajayvijayakumar123 ajayvijayakumar123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a suggestion !

Copy link
Contributor

@tdq45gj tdq45gj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

time.Sleep(500 * time.Millisecond)
} else {
// Any other error implies that the resmoke test environment is
// irretrievably confused, so it is appropriate to panic here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment might need to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@FGasper FGasper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % @tdq45gj’s request to update the comment about panicking.

@rcownie rcownie merged commit 505d575 into master Mar 31, 2025
36 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants