Skip to content

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Mar 20, 2024

Fixes #5292

"observed_target_version" => %found_target_version,
"attempted_target_version" => %target_version,
);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set prior_step_version before continueing? If so, could we rework the tests to fail without doing so? (That might be easier after #5274, when we can pass in the schema versions instead of a directory to read.) This sequence seems like it won't work:

  • Update to N.0.0-step.1 is successful
  • Update to N.0.0-step.2 fails; we bail out
  • Next attempt, we land here and note that N.0.0-step.2 > N.0.0-step.1, so we continue; prior_step_version is still None
  • We move on to N.0.0-step.2, and call prepare_schema_update. Because prior_step_version is None, its valid_prior_targets list only includes N.0.0, not N.0.0-step.1, so it fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also be worth factoring out into another function to make it so that prior_step_version only needs to be updated once, and it would still be correct. continue with mutable state like this always makes me a little worried.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, keep me honest here - I'm on board with a way to make this simpler, but:

I think it's okay to keep the prior_step_version as "the latest version we think our Nexus has applied".

@jgallagher , in the case you mention, once we get to that fourth step (calling prepare_schema_update) , you're right that our prior_step_version is None, but the to_version parameter would still be N.0.0-step.2 (we're trying to incrementally work on that step) so it would be included in valid_prior_targets. That would also be the target version stored in the DB, since that's the last thing we updated before failing on the prior attempt to update the schema.

(I think this was true before my merge as well, but less obvious, because I was mutating target_version. Now, I'm using a different variable named target_step_version to make this a little more clear).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the test I added below really should be catching cases like this (and has, locally, when I've gotten it wrong) because it does trigger this "Fail a step and force retry" behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_version parameter would still be N.0.0-step.2

Is this guaranteed? What if we succeed in updating to N.0.0-step.1, and then prepare_schema_update fails to set the new target version of N.0.0-step.2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, we'd re-apply "step 1", redoing the work of N.0.0-step.1, which should be fine since individual steps must be idempotent anyway, right?

Also, regardless: I'm taking this feedback, and refactored a bit of this to try to make it easier to follow -- though I think the functionality is the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. Thanks for sticking with me on this one, and for the refactor - I find it more clearly correct for sure.

@davepacheco
Copy link
Collaborator

Just dropping some notes here from chat -- thanks @jgallagher.

Here's a somewhat more concrete example of the problem we're trying to solve: say we're at 46.0.0 and we want to move to 47.0.0. Say that transition has 8 steps and that step 6 invalidates the preconditions for step 2. (This is the key. It does not seem common but it might be easy to accidentally do.) Today if this happens, and then Nexus crashes, then when it resumes it'll try to re-execute all the steps, and so it'll re-execute step 2, but it'll fail (by construction, because step 6 has broken step 2's preconditions). Then it'll be stuck unable to move forward or backward.

Previous to this PR, we'd set target_version = 47.0.0 at the start of the migration from 46.0.0 to 47.0.0 Then we'd go apply a bunch of SQL transactions, each conditional on the target being 47.0.0. This PR addresses the problem above by including the step number in the target version (so that we store target version 47.0.0-step.6) and, when resuming, skipping any steps that we know have been applied (because the target version is newer).

As I understand it we're not preventing the system from applying the same update step twice. That can still happen if two Nexus instances are doing it concurrently. The new guarantee that's being provided is that if you're updating from, say, 46.0.0 to 47.0.0 and you're talking about step 6, this step will only ever be applied to a database in one of two schema states:

  • 46.0.0 plus steps 1-5 of the 46.0.0 -> 47.0.0 migration
  • 46.0.0 plus steps 1-6 of the 46.0.0 -> 47.0.0 migration

(whereas previously it could have been applied to a database at 46.0.0 plus steps 1-7 or steps 1-8 of the 46.0.0 -> 47.0.0 migration).

@smklein
Copy link
Collaborator Author

smklein commented Mar 20, 2024

Just dropping some notes here from chat -- thanks @jgallagher.

Here's a somewhat more concrete example of the problem we're trying to solve: say we're at 46.0.0 and we want to move to 47.0.0. Say that transition has 8 steps and that step 6 invalidates the preconditions for step 2. (This is the key. It does not seem common but it might be easy to accidentally do.) Today if this happens, and then Nexus crashes, then when it resumes it'll try to re-execute all the steps, and so it'll re-execute step 2, but it'll fail (by construction, because step 6 has broken step 2's preconditions). Then it'll be stuck unable to move forward or backward.

Previous to this PR, we'd set target_version = 47.0.0 at the start of the migration from 46.0.0 to 47.0.0 Then we'd go apply a bunch of SQL transactions, each conditional on the target being 47.0.0. This PR addresses the problem above by including the step number in the target version (so that we store target version 47.0.0-step.6) and, when resuming, skipping any steps that we know have been applied (because the target version is newer).

As I understand it we're not preventing the system from applying the same update step twice. That can still happen if two Nexus instances are doing it concurrently. The new guarantee that's being provided is that if you're updating from, say, 46.0.0 to 47.0.0 and you're talking about step 6, this step will only ever be applied to a database in one of two schema states:

  • 46.0.0 plus steps 1-5 of the 46.0.0 -> 47.0.0 migration
  • 46.0.0 plus steps 1-6 of the 46.0.0 -> 47.0.0 migration

(whereas previously it could have been applied to a database at 46.0.0 plus steps 1-7 or steps 1-8 of the 46.0.0 -> 47.0.0 migration).

Totally agreed with this -- fortunately, we do have tests for "single steps being idempotent" already, just not "going backwards from step 6 to step 2".

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! So much easier to reason about.

"observed_target_version" => %found_target_version,
"attempted_target_version" => %target_version,
);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also be worth factoring out into another function to make it so that prior_step_version only needs to be updated once, and it would still be correct. continue with mutable state like this always makes me a little worried.

}

#[tokio::test]
async fn schema_version_subcomponents_save_progress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the versions_have_idempotent_up test need to be updated to test by file rather than by directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still okay as-is -- that particular test, as well as some of the other schema tests, are "testing the schema changes" more than they are "testing the mechanism to apply schema changes".

I say this referencing the following block of code:

for _ in 0..times_to_apply {
for step in version.upgrade_steps() {
info!(
log,
"Applying sql schema upgrade step";
"file" => step.label()
);
apply_update_as_transaction(&log, &client, step.sql()).await;
}
}

Which appears to already be re-running each individual step, rather than "the whole directory at once".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although maybe that loop needs to be inverted? Seems like we want to apply each step times_to_apply times, not the whole set....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this in 36fdb61 - I think you're right, the test was checking the wrong thing before, and should be fixed now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this requires more elaboration. When I updated this test, I noticed that it actually started failing.

Specifically, this version threw an error:

ALTER TYPE omicron.public.dataset_kind DROP VALUE 'clickhouse_keeper';

As, once the value is dropped, it no longer exists.

Postgres supports syntax to ALTER TYPE DROP VALUE IF EXISTS: https://www.postgresql.org/docs/current/sql-altertype.html

But CockroachDB does not: https://www.cockroachlabs.com/docs/stable/alter-type

To mitigate:

@smklein smklein enabled auto-merge (squash) March 21, 2024 01:50
@smklein smklein merged commit ac4f2b0 into main Mar 22, 2024
@smklein smklein deleted the schema-party branch March 22, 2024 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save DB schema upgrade progress incrementally
4 participants