make it a little harder to mess up new db schema migrations #5274

davepacheco · 2024-03-15T22:27:46Z

This is something we've talked about for a while. Many ideas have been floated but as far as I know nobody's implemented any of them. Here's a time-boxed attempt to make a tangible improvement. It doesn't solve all the problems.

My goals here were:

When two people both make schema changes in separate branches, then one lands on "main", the other should be forced to resolve this before they can land their PR on main. This basically means there has to be a git conflict when merging the first change into the second change's branch. Recall that right now, there's not necessarily a git conflict. In practice, it's possible for the second PR to land on "main" successfully, resulting in a broken "main". The test suite on "main" will usually fail after that because for this to happen there must be a mix of files like "up1.sql" and "up01.sql" in the same version and we disallow that. But this is only detected when running the test suite after the merge, and as we said there are not necessarily any git conflicts generated by the merge, which means nothing prevents the PR from landing onto "main" and breaking it.
Further, when this happens, it should be easy to resolve this conflict. Right now, if you run into the above case, you have to create a new directory for your files and figure out which of the files in the previous version were yours vs. already present in upstream and move them manually. It's fairly mechanical but tedious and error-prone.
Make sure a test catches the case of forgetting to update SCHEMA_VERSION in Rust or the schema version in db_metadata. This was already true via dbinit_version_matches_version_known_to_nexus. Unfortunately I've found that if you get this wrong, many tests wind up hanging (because Nexus is sitting waiting for the database to be updated), so you'd have to be a little lucky to notice this one explicit failure. We could have a Nexus config option that causes it to explode if the version doesn't match and then use that only in the test suite to try to catch this better but I decided to punt on this in this PR.

The approach I took:

Instead of discovering all schema versions by listing the contents of the directory, we have an explicit list of versions in a Rust array called KNOWN_VERSIONS. This means if two changes attempt to add the same item at the same spot in the array, they should cause a git conflict. (I still want to test this.) It is conceivable that incorrect changes would not generate a conflict (e.g., if somebody added their version in the wrong spot in the array) but I've added a lot of tests to verify that the array looks like it should (i.e., the versions can't be out of order, there can't be any gaps, etc.).
Rather than assuming the directory of SQL files for each version is named for the version, the name is explicit in KNOWN_VERSIONS. The developer's expected to create their own unique string for this (e.g., drop-table-services). This way, if you do wind up conflicting with somebody, all you have to do is fix up the KNOWN_VERSIONS table.
I hope the interfaces here make it hard to get any of this wrong (i.e., if you use KnownVersion::new, you have to provide a directory name and the test suite makes sure it does not contain the version number in it).
I did not change any of the existing migrations. (There's a separate KnownVersion::legacy() constructor for those and you should get a test failure if you try to use that for any new migrations.)

I'm really interested to know if people feel like this would be a net improvement. I know there were a lot of other ideas floated (and I'd be just as happy if someone implemented one of those instead!) but I'm trying to avoid letting the perfect be the enemy of the good.

Edit: there's an example migration in 89416f2.

davepacheco · 2024-03-15T22:31:01Z

For the record, I've confirmed that if you make only this change to "main" (where the current version is 43):

dap@ivanova omicron-play $ git diff
diff --git a/nexus/db-model/src/schema.rs b/nexus/db-model/src/schema.rs
index c31ce1677..08aab8ef7 100644
--- a/nexus/db-model/src/schema.rs
+++ b/nexus/db-model/src/schema.rs
@@ -13,7 +13,7 @@ use omicron_common::api::external::SemverVersion;
 ///
 /// This should be updated whenever the schema is changed. For more details,
 /// refer to: schema/crdb/README.adoc
-pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(43, 0, 0);
+pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(44, 0, 0);
 
 table! {
     disk (id) {

and run the test suite, after 25 minutes, 16 tests (the max that nextest runs concurrently on this machine?) appear hung: they've been running for 24+ minutes. I think when I dug into this previously, they were waiting for the db version to catch up to Nexus's (which would never happen).

Similarly, from the same commit, I undid that diff and applied this:

dap@ivanova omicron-play $ git diff
diff --git a/schema/crdb/dbinit.sql b/schema/crdb/dbinit.sql
index 255cdc813..a1b8887e2 100644
--- a/schema/crdb/dbinit.sql
+++ b/schema/crdb/dbinit.sql
@@ -3670,7 +3670,7 @@ INSERT INTO omicron.public.db_metadata (
     version,
     target_version
 ) VALUES
-    ( TRUE, NOW(), NOW(), '43.0.0', NULL)
+    ( TRUE, NOW(), NOW(), '44.0.0', NULL)
 ON CONFLICT DO NOTHING;
 
 COMMIT;

with the same result: after 17 minutes, I had 16 tests that had been hung for 16 minutes. This makes sense -- again, Nexus would be waiting for a catch-up that's never going to happen.

I don't expect this to be changed by this PR.

davepacheco · 2024-03-15T23:07:40Z

To test the merge/conflict behavior, I did this:

From the tip of this branch, create branch dap/schema-migration-better-demo1 with a trivial migration from 45 to 46. This passes the schema_versions tests.
From the same commit, create branch dap/schema-migration-better-demo2 with a different trivial migration from 45 to 46. This passes the schema_versions tests.
From the same commit, create branch dap/schema-migration-better-demo-merged:
- This is our simulated "main" branch.
- First, fast-forward merge dap/schema-migration-better-demo1. This simulates the first branch landing on "main".
- Next, merge dap/schema-migration-better-demo2. This simulates the second branch trying to land on "main".
- As expected, this generates a merge conflict that's pretty easy to resolve (more below).

Here's what the merge looks like. Starting from this PR's branch:

dap@ivanova omicron-iter $ git status -uno
On branch dap/schema-migration-better
Your branch is up to date with 'origin/dap/schema-migration-better'.

nothing to commit (use -u to show untracked files)

dap@ivanova omicron-iter $ git log -1
commit 89416f23fd02f24ebdb5e1f2890debda27034248 (HEAD -> dap/schema-migration-better, origin/dap/schema-migration-better)
Author: David Pacheco <dap@oxidecomputer.com>
Date:   Fri Mar 15 15:42:15 2024 -0700

    add example migration

Create the simulated "main" branch:

dap@ivanova omicron-iter $ git switch -c dap/schema-migration-better-merged
Switched to a new branch 'dap/schema-migration-better-merged'

Merge in one branch:

dap@ivanova omicron-iter $ git merge --ff-only dap/schema-migration-better-demo1
Updating 89416f23f..d98746135
Fast-forward
 nexus/db-model/src/schema_versions.rs        | 3 ++-
 schema/crdb/dbinit.sql                       | 2 +-
 schema/crdb/migration-conflict-demo-1/up.sql | 2 ++
 3 files changed, 5 insertions(+), 2 deletions(-)
 create mode 100644 schema/crdb/migration-conflict-demo-1/up.sql

Merge in the other:

dap@ivanova omicron-iter $ git merge dap/schema-migration-better-demo2
Auto-merging nexus/db-model/src/schema_versions.rs
CONFLICT (content): Merge conflict in nexus/db-model/src/schema_versions.rs
Automatic merge failed; fix conflicts and then commit the result.

Good! Here's what the conflict looks like (I have merge.conflictstyle=diff3 so it looks a little different than the default):

 22 /// List of all past database schema versions, in *reverse* order                 
 23 ///                                                                               
 24 /// If you want to change the Omicron database schema, you must update this.      
 25 static KNOWN_VERSIONS: Lazy<Vec<KnownVersion>> = Lazy::new(|| {                   
 26     vec![                                                                         
 27         // +- The next version goes here!  Duplicate this line, uncomment         
 28         // |  the *second* copy, then update that copy for your version,          
 29         // |  leaving the first copy as an example for the next person.           
 30         // v                                                                      
 31         // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),      
 32 <<<<<<< HEAD                                                                      
 33         KnownVersion::new(46, "migration-conflict-demo-1"),                       
 34 ||||||| 89416f23f                                                                 
 35 =======                                                                           
 36         KnownVersion::new(46, "migration-conflict-demo-2"),                       
 37 >>>>>>> dap/schema-migration-better-demo2                                         
 38         KnownVersion::new(45, "migration-demo"),                                  
 39         // The first many schema versions only vary by major or patch number and  
 40         // their path is predictable based on the version number.  (This was      
 41         // historically a problem because two pull requests both adding a new     
 42         // schema version might merge cleanly but produce an invalid result.)     
 43         KnownVersion::legacy(44, 0),                                              
 44         KnownVersion::legacy(43, 0),

The hand-merged result looks like this:

 27         // +- The next version goes here!  Duplicate this line, uncomment
 28         // |  the *second* copy, then update that copy for your version,
 29         // |  leaving the first copy as an example for the next person.
 30         // v
 31         // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
 32         KnownVersion::new(47, "migration-conflict-demo-2"),
 33         KnownVersion::new(46, "migration-conflict-demo-1"),
 34         KnownVersion::new(45, "migration-demo"),
 35         // The first many schema versions only vary by major or patch number and
 36         // their path is predictable based on the version number.  (This was
 37         // historically a problem because two pull requests both adding a new
 38         // schema version might merge cleanly but produce an invalid result.)

You also need to remember to update SCHEMA_VERSION above and dbinit.sql (which is annoying, but tests will at least not pass if you forget either or both of these).

All three of these branches are on GitHub if folks want to look or try this out.

smklein

Thanks for this, I hope it makes the process a little less conflict prone!

smklein · 2024-03-15T23:28:10Z

schema/crdb/README.adoc

+    these `schema/crdb/NAME/upN.sql` for as many `N` as you need, staring with
+    `N=1`.


Suggested change

these `schema/crdb/NAME/upN.sql` for as many `N` as you need, staring with

`N=1`.

these `schema/crdb/NAME/upN.sql` for as many `N` as you need, starting with

`N=1`.

smklein · 2024-03-15T23:29:28Z

schema/crdb/README.adoc

+* Update `schema/crdb/dbinit.sql`:
+** Update the SQL statements to match what the database should look like
+   after your up*.sql files are applied.


Practically, I do this first, and save my schema changes for later. I wonder if we should recommend this first? "All tests except the schema tests" should pass if you only change dbinit.sql, and don't change the schema version.

I think this is a good idea and I looked at it but it makes the instructions more complicated because up front you update the schema parts of dbinit.sql and later you go back to it. Reading it straight through it reads like "why did you tell me to go back to the same file twice"? I'd welcome edits here but otherwise I'm just going to leave it.

I also think this will be less necessary now. I did the same as you because I was worried about dealing with merges, but now that's pretty easy so I will probably just go ahead and do the migrations earlier than I used to.

smklein · 2024-03-15T23:32:17Z

nexus/db-model/src/schema_versions.rs

+        // |  leaving the first copy as an example for the next person.
+        // v
+        // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
+        KnownVersion::new(45, "migration-demo"),


Thanks for doing this in a way that encourages "only bumping major version" for now!

As a super nitpicky nitpick, can we change the name from migration-demo to like first-named-migration? At first glance, I thought it was test data, but I realize it actually is a durable long-lived migration.

I'll second the super nitpicky nitpick; the name threw me off also. (I was wondering if it was a demo that should be removed before this lands.)

smklein · 2024-03-15T23:35:15Z

nexus/db-model/src/schema_versions.rs

+    /// All versions have an associated SemVer.  We only use the major number.
+    semver: SemverVersion,


"We only use the major number"

This isn't strictly true, we do keep them ordered by minor/patch versions too, we just can't ensure backwards compatibility and recommend only bumping the major number.

I only bring this up because there already are legacy versions that only differ by minor versions.

Sorry, what I meant was not "only the major number is ever non-zero" but rather "we only use the major number to determine compatibility". I'll clarified this comment.

jgallagher · 2024-03-18T16:16:45Z

nexus/db-model/src/schema_versions.rs

+        // |  leaving the first copy as an example for the next person.
+        // v
+        // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
+        KnownVersion::new(45, "migration-demo"),


I'll second the super nitpicky nitpick; the name threw me off also. (I was wondering if it was a demo that should be removed before this lands.)

jgallagher · 2024-03-18T16:18:24Z

nexus/db-model/src/schema_versions.rs

+    /// from the previous version to this version.
+    fn new(major: u64, relative_path: &str) -> KnownVersion {
+        let semver = SemverVersion::new(major, 0, 0);
+        KnownVersion { semver, relative_path: relative_path.to_owned() }


I'm tempted to propose that we make the relative path for these {semver}-{name} so that the directory also remains sorted. But I don't feel super strongly about this.

I thought about this but it makes the instructions for merging more complicated (you have to git mv your directory as part of resolving a conflict, and that wouldn't necessarily be obvious) and I'm not sure it helps that much because they don't appear in sorted order in ls(1) output and on GitHub anyway (because it's a lexicographic sort, not a semver-aware sort).

jgallagher · 2024-03-18T16:30:06Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+        let target_versions: Vec<&SchemaVersion> = all_versions
+            .versions_range((
+                Bound::Excluded(&found_version),
+                Bound::Included(&desired_version),


Tiny nitpick and I realize this didn't change, but do you think this would be clearer as .versions_range(&found_version..=&desired_version)? I don't usually see explicit Bound tuples.

davepacheco · 2024-03-19T15:34:58Z

Here's the heads-up mail I plan to send about this.

If you do not actively work on Omicron, you can ignore this mail.

This pull request:

https://github.com/oxidecomputer/omicron/pull/5274 make it a little harder to mess up new db schema migrations

changes the way developers make changes to the Omicron (CockroachDB) database schema. The official instructions have been updated:

https://github.com/oxidecomputer/omicron/blob/dap/schema-migration-better/schema/crdb/README.adoc#2-how-to-change-the-schema

The main change to how you change the database schema is:

You previously would create a directory named for the new schema version. Now, you choose any descriptive name you want for the directory.
You previously would update the SCHEMA_VERSION constant. Now you update that and an array of KNOWN_VERSIONS.

The goals are that:

If two people try to add the same schema version, one of these PRs will generate a Git conflict with the other. (Previously, they would both get combined and then break "main" when the second one landed.)
If you have to deal with this kind of conflict, it's pretty straightforward to resolve: mostly, you just need to fix up the KNOWN_VERSIONS array to include the upstream version as well as yours.

This is all in the new instructions. Please reach out if you have any trouble!

sunshowers · 2024-03-19T20:27:53Z

nexus/db-model/src/schema_versions.rs

+        // |  leaving the first copy as an example for the next person.
+        // v
+        // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
+        KnownVersion::new(46, "first-named-migration"),


Came up in #oxide-update today, but some schema changes will require multiple versions to be correct. (Came up for #5032 earlier, and also for #5287 which requires three migrations.) It would be nice to be able to express an ordered list of versions this way, maybe within a single directory have a bunch of subdirectories.

(Though we think we can simplify this constraint and not require that, by changing the schema version after each file rather than each directory.)

(Though we think we can simplify this constraint and not require that, by changing the schema version after each file rather than each directory.)

@jgallagher is going to be doing this.

Sorry I didn't quite follow the problem. @sunshowers Is this PR blocked on that work? Does it make this situation worse?

Came up in #oxide-update today, but some schema changes will require multiple versions to be correct. (Came up for #5032 earlier, and also for #5287 which requires three migrations.) It would be nice to be able to express an ordered list of versions this way, maybe within a single directory have a bunch of subdirectories.

For a PR that needs to land more than one separate schema update, is there any reason not to just define two new KnownVersion values with separate major numbers? i.e., why does this use case not fit into the infrastructure we already have / what's added in this PR?

smklein · 2024-03-20T19:24:49Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+        let target_versions: Vec<&SchemaVersion> = all_versions
+            .versions_range(&found_version..=&desired_version)
+            .collect();


I know that I'm really late on this, but after rebasing on top of this locally -- this snippet changes the behavior of updates, by changing the bounds.

Before, the "current version" was excluded from the set of target versions (Bound::Excluded(&current_version)).

Now, it's included (in the &found_version.. syntax).

This means that after this PR has merged, the "current" version is always re-applied during schema migrations, even if it has been fully applied before.

Oops! That's definitely my bad; should've been .. instead of ..=.

I don't think that's exactly right, I think .. is "start inclusive, end exclusive". I believe we actually want "start exclusive, end inclusive", which doesn't have a short-hand syntax. That's why I was using std::ops::Bound::Excluded explicitly before

Ooof. Doubly sorry then. :( Do you want to fix this on #5293 or should I open a fix just for this?

I'm patching it in 5293, so no need for something else unless the urgency seems like it should be higher

davepacheco added 6 commits March 14, 2024 20:32

abstract existing discovery stuff

64a5032

first cut at KNOWN_VERSIONS

b64d341

WIP: update schema upgrade code (not tests yet)

6fce092

fix guts of schema update

417d077

docs, improvements, etc

b5f8377

Merge branch 'main' into dap/schema-migration-better

b01f821

davepacheco requested review from jgallagher and smklein March 15, 2024 22:27

davepacheco added 2 commits March 15, 2024 15:42

README updates

3b36e71

add example migration

89416f2

davepacheco marked this pull request as ready for review March 15, 2024 23:17

smklein approved these changes Mar 15, 2024

View reviewed changes

jgallagher approved these changes Mar 18, 2024

View reviewed changes

davepacheco added 3 commits March 18, 2024 20:03

review feedback

aff7a1e

Merge branch 'main' into dap/schema-migration-better

68be6eb

Merge branch 'main' into dap/schema-migration-better

7009958

mismerge

b7892d2

sunshowers reviewed Mar 19, 2024

View reviewed changes

smklein mentioned this pull request Mar 20, 2024

Save DB schema upgrade progress incrementally #5292

Closed

jgallagher mentioned this pull request Mar 20, 2024

[nexus] During DB schema update, avoid re-applying sub-updates #5293

Merged

davepacheco merged commit 58adb45 into main Mar 20, 2024

davepacheco deleted the dap/schema-migration-better branch March 20, 2024 16:58

smklein reviewed Mar 20, 2024

View reviewed changes

		these `schema/crdb/NAME/upN.sql` for as many `N` as you need, staring with
		`N=1`.

		/// All versions have an associated SemVer. We only use the major number.
		semver: SemverVersion,

make it a little harder to mess up new db schema migrations #5274

make it a little harder to mess up new db schema migrations #5274

Uh oh!

Conversation

davepacheco commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Mar 15, 2024

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Mar 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davepacheco commented Mar 15, 2024 •

edited

Loading

davepacheco commented Mar 15, 2024 •

edited

Loading

jgallagher Mar 18, 2024 •

edited

Loading

jgallagher Mar 18, 2024 •

edited

Loading