Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Mar 15, 2024

This is something we've talked about for a while. Many ideas have been floated but as far as I know nobody's implemented any of them. Here's a time-boxed attempt to make a tangible improvement. It doesn't solve all the problems.

My goals here were:

  • When two people both make schema changes in separate branches, then one lands on "main", the other should be forced to resolve this before they can land their PR on main. This basically means there has to be a git conflict when merging the first change into the second change's branch. Recall that right now, there's not necessarily a git conflict. In practice, it's possible for the second PR to land on "main" successfully, resulting in a broken "main". The test suite on "main" will usually fail after that because for this to happen there must be a mix of files like "up1.sql" and "up01.sql" in the same version and we disallow that. But this is only detected when running the test suite after the merge, and as we said there are not necessarily any git conflicts generated by the merge, which means nothing prevents the PR from landing onto "main" and breaking it.
  • Further, when this happens, it should be easy to resolve this conflict. Right now, if you run into the above case, you have to create a new directory for your files and figure out which of the files in the previous version were yours vs. already present in upstream and move them manually. It's fairly mechanical but tedious and error-prone.
  • Make sure a test catches the case of forgetting to update SCHEMA_VERSION in Rust or the schema version in db_metadata. This was already true via dbinit_version_matches_version_known_to_nexus. Unfortunately I've found that if you get this wrong, many tests wind up hanging (because Nexus is sitting waiting for the database to be updated), so you'd have to be a little lucky to notice this one explicit failure. We could have a Nexus config option that causes it to explode if the version doesn't match and then use that only in the test suite to try to catch this better but I decided to punt on this in this PR.

The approach I took:

  • Instead of discovering all schema versions by listing the contents of the directory, we have an explicit list of versions in a Rust array called KNOWN_VERSIONS. This means if two changes attempt to add the same item at the same spot in the array, they should cause a git conflict. (I still want to test this.) It is conceivable that incorrect changes would not generate a conflict (e.g., if somebody added their version in the wrong spot in the array) but I've added a lot of tests to verify that the array looks like it should (i.e., the versions can't be out of order, there can't be any gaps, etc.).
  • Rather than assuming the directory of SQL files for each version is named for the version, the name is explicit in KNOWN_VERSIONS. The developer's expected to create their own unique string for this (e.g., drop-table-services). This way, if you do wind up conflicting with somebody, all you have to do is fix up the KNOWN_VERSIONS table.
  • I hope the interfaces here make it hard to get any of this wrong (i.e., if you use KnownVersion::new, you have to provide a directory name and the test suite makes sure it does not contain the version number in it).
  • I did not change any of the existing migrations. (There's a separate KnownVersion::legacy() constructor for those and you should get a test failure if you try to use that for any new migrations.)

I'm really interested to know if people feel like this would be a net improvement. I know there were a lot of other ideas floated (and I'd be just as happy if someone implemented one of those instead!) but I'm trying to avoid letting the perfect be the enemy of the good.

Edit: there's an example migration in 89416f2.

@davepacheco
Copy link
Collaborator Author

davepacheco commented Mar 15, 2024

For the record, I've confirmed that if you make only this change to "main" (where the current version is 43):

dap@ivanova omicron-play $ git diff
diff --git a/nexus/db-model/src/schema.rs b/nexus/db-model/src/schema.rs
index c31ce1677..08aab8ef7 100644
--- a/nexus/db-model/src/schema.rs
+++ b/nexus/db-model/src/schema.rs
@@ -13,7 +13,7 @@ use omicron_common::api::external::SemverVersion;
 ///
 /// This should be updated whenever the schema is changed. For more details,
 /// refer to: schema/crdb/README.adoc
-pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(43, 0, 0);
+pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(44, 0, 0);
 
 table! {
     disk (id) {

and run the test suite, after 25 minutes, 16 tests (the max that nextest runs concurrently on this machine?) appear hung: they've been running for 24+ minutes. I think when I dug into this previously, they were waiting for the db version to catch up to Nexus's (which would never happen).

Similarly, from the same commit, I undid that diff and applied this:

dap@ivanova omicron-play $ git diff
diff --git a/schema/crdb/dbinit.sql b/schema/crdb/dbinit.sql
index 255cdc813..a1b8887e2 100644
--- a/schema/crdb/dbinit.sql
+++ b/schema/crdb/dbinit.sql
@@ -3670,7 +3670,7 @@ INSERT INTO omicron.public.db_metadata (
     version,
     target_version
 ) VALUES
-    ( TRUE, NOW(), NOW(), '43.0.0', NULL)
+    ( TRUE, NOW(), NOW(), '44.0.0', NULL)
 ON CONFLICT DO NOTHING;
 
 COMMIT;

with the same result: after 17 minutes, I had 16 tests that had been hung for 16 minutes. This makes sense -- again, Nexus would be waiting for a catch-up that's never going to happen.

I don't expect this to be changed by this PR.

@davepacheco
Copy link
Collaborator Author

To test the merge/conflict behavior, I did this:

  • From the tip of this branch, create branch dap/schema-migration-better-demo1 with a trivial migration from 45 to 46. This passes the schema_versions tests.
  • From the same commit, create branch dap/schema-migration-better-demo2 with a different trivial migration from 45 to 46. This passes the schema_versions tests.
  • From the same commit, create branch dap/schema-migration-better-demo-merged:
    • This is our simulated "main" branch.
    • First, fast-forward merge dap/schema-migration-better-demo1. This simulates the first branch landing on "main".
    • Next, merge dap/schema-migration-better-demo2. This simulates the second branch trying to land on "main".
    • As expected, this generates a merge conflict that's pretty easy to resolve (more below).

Here's what the merge looks like. Starting from this PR's branch:

dap@ivanova omicron-iter $ git status -uno
On branch dap/schema-migration-better
Your branch is up to date with 'origin/dap/schema-migration-better'.

nothing to commit (use -u to show untracked files)

dap@ivanova omicron-iter $ git log -1
commit 89416f23fd02f24ebdb5e1f2890debda27034248 (HEAD -> dap/schema-migration-better, origin/dap/schema-migration-better)
Author: David Pacheco <dap@oxidecomputer.com>
Date:   Fri Mar 15 15:42:15 2024 -0700

    add example migration

Create the simulated "main" branch:

dap@ivanova omicron-iter $ git switch -c dap/schema-migration-better-merged
Switched to a new branch 'dap/schema-migration-better-merged'

Merge in one branch:

dap@ivanova omicron-iter $ git merge --ff-only dap/schema-migration-better-demo1
Updating 89416f23f..d98746135
Fast-forward
 nexus/db-model/src/schema_versions.rs        | 3 ++-
 schema/crdb/dbinit.sql                       | 2 +-
 schema/crdb/migration-conflict-demo-1/up.sql | 2 ++
 3 files changed, 5 insertions(+), 2 deletions(-)
 create mode 100644 schema/crdb/migration-conflict-demo-1/up.sql

Merge in the other:

dap@ivanova omicron-iter $ git merge dap/schema-migration-better-demo2
Auto-merging nexus/db-model/src/schema_versions.rs
CONFLICT (content): Merge conflict in nexus/db-model/src/schema_versions.rs
Automatic merge failed; fix conflicts and then commit the result.

Good! Here's what the conflict looks like (I have merge.conflictstyle=diff3 so it looks a little different than the default):

 22 /// List of all past database schema versions, in *reverse* order                 
 23 ///                                                                               
 24 /// If you want to change the Omicron database schema, you must update this.      
 25 static KNOWN_VERSIONS: Lazy<Vec<KnownVersion>> = Lazy::new(|| {                   
 26     vec![                                                                         
 27         // +- The next version goes here!  Duplicate this line, uncomment         
 28         // |  the *second* copy, then update that copy for your version,          
 29         // |  leaving the first copy as an example for the next person.           
 30         // v                                                                      
 31         // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),      
 32 <<<<<<< HEAD                                                                      
 33         KnownVersion::new(46, "migration-conflict-demo-1"),                       
 34 ||||||| 89416f23f                                                                 
 35 =======                                                                           
 36         KnownVersion::new(46, "migration-conflict-demo-2"),                       
 37 >>>>>>> dap/schema-migration-better-demo2                                         
 38         KnownVersion::new(45, "migration-demo"),                                  
 39         // The first many schema versions only vary by major or patch number and  
 40         // their path is predictable based on the version number.  (This was      
 41         // historically a problem because two pull requests both adding a new     
 42         // schema version might merge cleanly but produce an invalid result.)     
 43         KnownVersion::legacy(44, 0),                                              
 44         KnownVersion::legacy(43, 0),                                              

The hand-merged result looks like this:

 27         // +- The next version goes here!  Duplicate this line, uncomment
 28         // |  the *second* copy, then update that copy for your version,
 29         // |  leaving the first copy as an example for the next person.
 30         // v
 31         // KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
 32         KnownVersion::new(47, "migration-conflict-demo-2"),
 33         KnownVersion::new(46, "migration-conflict-demo-1"),
 34         KnownVersion::new(45, "migration-demo"),
 35         // The first many schema versions only vary by major or patch number and
 36         // their path is predictable based on the version number.  (This was
 37         // historically a problem because two pull requests both adding a new
 38         // schema version might merge cleanly but produce an invalid result.)

You also need to remember to update SCHEMA_VERSION above and dbinit.sql (which is annoying, but tests will at least not pass if you forget either or both of these).

All three of these branches are on GitHub if folks want to look or try this out.

@davepacheco davepacheco marked this pull request as ready for review March 15, 2024 23:17
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I hope it makes the process a little less conflict prone!

Comment on lines 78 to 79
these `schema/crdb/NAME/upN.sql` for as many `N` as you need, staring with
`N=1`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
these `schema/crdb/NAME/upN.sql` for as many `N` as you need, staring with
`N=1`.
these `schema/crdb/NAME/upN.sql` for as many `N` as you need, starting with
`N=1`.

Comment on lines +89 to +91
* Update `schema/crdb/dbinit.sql`:
** Update the SQL statements to match what the database should look like
after your up*.sql files are applied.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, I do this first, and save my schema changes for later. I wonder if we should recommend this first? "All tests except the schema tests" should pass if you only change dbinit.sql, and don't change the schema version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea and I looked at it but it makes the instructions more complicated because up front you update the schema parts of dbinit.sql and later you go back to it. Reading it straight through it reads like "why did you tell me to go back to the same file twice"? I'd welcome edits here but otherwise I'm just going to leave it.

I also think this will be less necessary now. I did the same as you because I was worried about dealing with merges, but now that's pretty easy so I will probably just go ahead and do the migrations earlier than I used to.

// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(45, "migration-demo"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this in a way that encourages "only bumping major version" for now!

As a super nitpicky nitpick, can we change the name from migration-demo to like first-named-migration? At first glance, I thought it was test data, but I realize it actually is a durable long-lived migration.

Copy link
Contributor

@jgallagher jgallagher Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll second the super nitpicky nitpick; the name threw me off also. (I was wondering if it was a demo that should be removed before this lands.)

Comment on lines 97 to 98
/// All versions have an associated SemVer. We only use the major number.
semver: SemverVersion,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We only use the major number"

This isn't strictly true, we do keep them ordered by minor/patch versions too, we just can't ensure backwards compatibility and recommend only bumping the major number.

I only bring this up because there already are legacy versions that only differ by minor versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I meant was not "only the major number is ever non-zero" but rather "we only use the major number to determine compatibility". I'll clarified this comment.

// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(45, "migration-demo"),
Copy link
Contributor

@jgallagher jgallagher Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll second the super nitpicky nitpick; the name threw me off also. (I was wondering if it was a demo that should be removed before this lands.)

/// from the previous version to this version.
fn new(major: u64, relative_path: &str) -> KnownVersion {
let semver = SemverVersion::new(major, 0, 0);
KnownVersion { semver, relative_path: relative_path.to_owned() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to propose that we make the relative path for these {semver}-{name} so that the directory also remains sorted. But I don't feel super strongly about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but it makes the instructions for merging more complicated (you have to git mv your directory as part of resolving a conflict, and that wouldn't necessarily be obvious) and I'm not sure it helps that much because they don't appear in sorted order in ls(1) output and on GitHub anyway (because it's a lexicographic sort, not a semver-aware sort).

let target_versions: Vec<&SchemaVersion> = all_versions
.versions_range((
Bound::Excluded(&found_version),
Bound::Included(&desired_version),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nitpick and I realize this didn't change, but do you think this would be clearer as .versions_range(&found_version..=&desired_version)? I don't usually see explicit Bound tuples.

@davepacheco
Copy link
Collaborator Author

Here's the heads-up mail I plan to send about this.


If you do not actively work on Omicron, you can ignore this mail.

This pull request:

https://github.com/oxidecomputer/omicron/pull/5274 make it a little harder to mess up new db schema migrations

changes the way developers make changes to the Omicron (CockroachDB) database schema. The official instructions have been updated:

https://github.com/oxidecomputer/omicron/blob/dap/schema-migration-better/schema/crdb/README.adoc#2-how-to-change-the-schema

The main change to how you change the database schema is:

  • You previously would create a directory named for the new schema version. Now, you choose any descriptive name you want for the directory.
  • You previously would update the SCHEMA_VERSION constant. Now you update that and an array of KNOWN_VERSIONS.

The goals are that:

  • If two people try to add the same schema version, one of these PRs will generate a Git conflict with the other. (Previously, they would both get combined and then break "main" when the second one landed.)
  • If you have to deal with this kind of conflict, it's pretty straightforward to resolve: mostly, you just need to fix up the KNOWN_VERSIONS array to include the upstream version as well as yours.

This is all in the new instructions. Please reach out if you have any trouble!


// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(46, "first-named-migration"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came up in #oxide-update today, but some schema changes will require multiple versions to be correct. (Came up for #5032 earlier, and also for #5287 which requires three migrations.) It would be nice to be able to express an ordered list of versions this way, maybe within a single directory have a bunch of subdirectories.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though we think we can simplify this constraint and not require that, by changing the schema version after each file rather than each directory.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though we think we can simplify this constraint and not require that, by changing the schema version after each file rather than each directory.)

@jgallagher is going to be doing this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't quite follow the problem. @sunshowers Is this PR blocked on that work? Does it make this situation worse?

Came up in #oxide-update today, but some schema changes will require multiple versions to be correct. (Came up for #5032 earlier, and also for #5287 which requires three migrations.) It would be nice to be able to express an ordered list of versions this way, maybe within a single directory have a bunch of subdirectories.

For a PR that needs to land more than one separate schema update, is there any reason not to just define two new KnownVersion values with separate major numbers? i.e., why does this use case not fit into the infrastructure we already have / what's added in this PR?

Comment on lines +106 to +108
let target_versions: Vec<&SchemaVersion> = all_versions
.versions_range(&found_version..=&desired_version)
.collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that I'm really late on this, but after rebasing on top of this locally -- this snippet changes the behavior of updates, by changing the bounds.

  • Before, the "current version" was excluded from the set of target versions (Bound::Excluded(&current_version)).
  • Now, it's included (in the &found_version.. syntax).

This means that after this PR has merged, the "current" version is always re-applied during schema migrations, even if it has been fully applied before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! That's definitely my bad; should've been .. instead of ..=.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's exactly right, I think .. is "start inclusive, end exclusive". I believe we actually want "start exclusive, end inclusive", which doesn't have a short-hand syntax. That's why I was using std::ops::Bound::Excluded explicitly before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof. Doubly sorry then. :( Do you want to fix this on #5293 or should I open a fix just for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm patching it in 5293, so no need for something else unless the urgency seems like it should be higher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants