Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apparent cockroachdb data corruption due to cockroachdb issue 74475 #980

Closed
davepacheco opened this issue Apr 27, 2022 · 10 comments · Fixed by #988
Closed

apparent cockroachdb data corruption due to cockroachdb issue 74475 #980

davepacheco opened this issue Apr 27, 2022 · 10 comments · Fixed by #988

Comments

@davepacheco
Copy link
Collaborator

tl;dr: CockroachDB v21.1.10 has a bug where queries can return "incorrect results" due to index corruption "when two or more partial indexes in the same table had identical WHERE clauses". Ben appeared to observe this under #955. I'm still doing a bit more confirmation, but strong evidence suggests we hit that CockroachDB bug. Fortunately, it's reported fixed in v21.1.13. We'll want to update to at least that in order to land #955.

@davepacheco
Copy link
Collaborator Author

Here I'll try to describe as much of the debugging process and data as I recorded and remember. It's not quite as complete and in-order as I'd like because I didn't quite record everything, but I think it's a good summary. @bnaecker, let me know what I've missed!

Initial observations

Ben reported that at the tip of #955, commit 97ceda5, if you run cargo test -- vpcs::test_vpcs, it fails with this error:

running 1 test
test integration_tests::vpcs::test_vpcs ... FAILED

failures:

---- integration_tests::vpcs::test_vpcs stdout ----
log file: "/tmp/test_all-3e48cef0e98c659b-test_vpcs.24246.0.log"
note: configured to log to "/tmp/test_all-3e48cef0e98c659b-test_vpcs.24246.0.log"
thread 'integration_tests::vpcs::test_vpcs' panicked at 'called `Result::unwrap()` on an `Err` value: expected status code 404 Not Found, found 200 OK', nexus/tests/integration_tests/[vpcs.rs:152](http://vpcs.rs:152/):6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /tmp/.tmpXXyEjL
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::vpcs::test_vpcs

Ben summarized:

The test tries to rename a VPC from "just-rainsticks" to "new-name", and then assert that getting a VPC named "just-rainsticks" now returns a 404. However, in that branch, that rename fails, and the actual HTTP request returns 200.

Ben also reported that if you apply this patch, the test succeeds:

diff --git a/common/src/sql/dbinit.sql b/common/src/sql/dbinit.sql
index d5d449ce..106f9920 100644
--- a/common/src/sql/dbinit.sql
+++ b/common/src/sql/dbinit.sql
@@ -606,10 +606,12 @@ CREATE UNIQUE INDEX ON omicron.public.vpc (
 ) WHERE
     time_deleted IS NULL;

+/*
 CREATE UNIQUE INDEX ON omicron.public.vpc (
     vni
 ) WHERE
     time_deleted IS NULL;
+*/

 CREATE TABLE omicron.public.vpc_subnet (
     /* Identity metadata (resource) */

While debugging this, Ben introduced a "sleep" in the test so he could look at the database state. Here's some output:

root@127.0.0.1:34582/omicron> select project_id, id, name from vpc where time_deleted is NULL;
              project_id              |                  id                  |      name
---------------------------------------+--------------------------------------+------------------
 32bd312d-5d6d-47b5-94c2-808144c6cfc8 | 2cad54d9-800e-4efe-8acd-d93d97985657 | default
 32bd312d-5d6d-47b5-94c2-808144c6cfc8 | bbfd3544-ed39-4782-8be6-8f6159524ace | just-rainsticks
 4a3b4e82-a2fc-4e77-9a03-47693a6966ca | cecb0fc4-f2c1-4568-a610-a04d330cb897 | default
 4a3b4e82-a2fc-4e77-9a03-47693a6966ca | e5930e0a-de63-4319-ba53-e663879da1a8 | just-rainsticks
(4 rows)

Time: 1ms total (execution 1ms / network 0ms)

... (some time later)

root@127.0.0.1:34582/omicron> select project_id, id, name from vpc;
              project_id              |                  id                  |      name
---------------------------------------+--------------------------------------+------------------
 32bd312d-5d6d-47b5-94c2-808144c6cfc8 | 2cad54d9-800e-4efe-8acd-d93d97985657 | default
 32bd312d-5d6d-47b5-94c2-808144c6cfc8 | bbfd3544-ed39-4782-8be6-8f6159524ace | just-rainsticks
 4a3b4e82-a2fc-4e77-9a03-47693a6966ca | cecb0fc4-f2c1-4568-a610-a04d330cb897 | default
 4a3b4e82-a2fc-4e77-9a03-47693a6966ca | e5930e0a-de63-4319-ba53-e663879da1a8 | new-name
(4 rows)

Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:34582/omicron>

That's weird -- it looks as though the rename happened while he was looking at it, despite the test being stopped.

Side note on reproducing

The problem is wholly reproducible from commit 97ceda5 by running cargo test -- vpcs::test_vpcs. But in order to build that commit, you need to successfully clone the private GitHub repo that's used by that PR. With Ben's help, I did this by:

  1. Creating a new GitHub Personal Access Token with the "repos" scope.
  2. Running git config credential.helper "cache" (see git-config and git-credentials man pages)
  3. Setting export CARGO_NET_GIT_FETCH_WITH_CLI=true in the environment
  4. Building as usual. I was prompted for Username for 'https://github.com' (which is your GitHub username) and Password for 'https://davepacheco@github.com' (for which I supplied the personal access token).

Neat -- I didn't know you could use a personal access token to clone over https.

First observations from the log file

All the integration tests record a detailed log file from Nexus. Take the first failure I saw:

$ cargo test -p omicron-nexus -- vpcs::test_vpcs
...
     Running unittests (target/debug/deps/omicron_nexus-50676ab4881f19e0)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 77 filtered out; finished in 0.00s

     Running unittests (target/debug/deps/nexus-1efadc5cd49ea032)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/test_all.rs (target/debug/deps/test_all-79504f3d97c965fe)

running 1 test
test integration_tests::vpcs::test_vpcs ... FAILED

failures:

---- integration_tests::vpcs::test_vpcs stdout ----
log file: "/dangerzone/omicron_tmp/test_all-79504f3d97c965fe-test_vpcs.19837.0.log"
note: configured to log to "/dangerzone/omicron_tmp/test_all-79504f3d97c965fe-test_vpcs.19837.0.log"
thread 'integration_tests::vpcs::test_vpcs' panicked at 'called `Result::unwrap()` on an `Err` value: expected status code 404 Not Found, found 200 OK', nexus/tests/integration_tests/vpcs.rs:152:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/.tmpTGIqPX
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::vpcs::test_vpcs

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 69 filtered out; finished in 4.07s

error: test failed, to rerun pass '-p omicron-nexus --test test_all'

The log file starts out like this:

dap@ivanova omicron $ bunyan "/dangerzone/omicron_tmp/test_all-79504f3d97c965fe-test_vpcs.19837.0.log"
[2022-04-27T20:43:16.758207675Z]  INFO: test_vpcs/19837 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/.tmpTGIqPX
[2022-04-27T20:43:16.758517002Z]  INFO: test_vpcs/19837 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-409ced5eb99456d0/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/.tmpTGIqPX/data)
[2022-04-27T20:43:16.773075947Z]  INFO: test_vpcs/19837 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store /dangerzone/omicron_tmp/.tmpTGIqPX/data --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/.tmpTGIqPX/listen-url
[2022-04-27T20:43:17.114342697Z]  INFO: test_vpcs/19837 on ivanova: cockroach pid: 19842
[2022-04-27T20:43:17.115073023Z]  INFO: test_vpcs/19837 on ivanova: cockroach listen URL: postgresql://root@127.0.0.1:57400/omicron?sslmode=disable

That last line gives you the URL that CockroachDB is listening on. We used that a lot here.

Anyway, if we filter the output a bit, we see what happened:

$ grep -A2 -B1 'request completed' "/dangerzone/omicron_tmp/test_all-79504f3d97c965fe-test_vpcs.19837.0.log" | bunyan
...
[2022-04-27T20:43:20.57842754Z]  INFO: test_vpcs/external client test context/19837 on ivanova: client request (method=PUT)
    body: Body(Full(b"{\"name\":\"new-name\",\"description\":\"another description\",\"dns_name\":\"def\"}"))
    --
    uri: http://127.0.0.1:55493/organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
--
[2022-04-27T20:43:20.711343153Z] DEBUG: 879771b2-6096-4bde-b268-9f39e2fb196a/dropshot_external/19837 on ivanova: authorize result (req_id=cc3dfbe8-69c5-4b93-a4e3-8424aebf6f97, authenticated=true, method=PUT, remote_addr=127.0.0.1:55539, local_addr=127.0.0.1:55493, result=Ok(()), resource=Database, action=Query)
    actor: Some(Actor { id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000 })
    --
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
[2022-04-27T20:43:20.714463975Z]  INFO: 879771b2-6096-4bde-b268-9f39e2fb196a/dropshot_external/19837 on ivanova: request completed (req_id=cc3dfbe8-69c5-4b93-a4e3-8424aebf6f97, method=PUT, remote_addr=127.0.0.1:55539, local_addr=127.0.0.1:55493, response_code=200)
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
[2022-04-27T20:43:20.714894822Z]  INFO: test_vpcs/external client test context/19837 on ivanova: client received response (status=200)
[2022-04-27T20:43:20.715144897Z]  INFO: test_vpcs/external client test context/19837 on ivanova: client request (body=Body(Empty), method=GET)
    uri: http://127.0.0.1:55493/organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
--
[2022-04-27T20:43:20.804554543Z] DEBUG: 879771b2-6096-4bde-b268-9f39e2fb196a/dropshot_external/19837 on ivanova: authorize result (req_id=72bfc838-ffbf-4eb5-b1c9-e415eed74185, authenticated=true, method=GET, remote_addr=127.0.0.1:55539, local_addr=127.0.0.1:55493, result=Ok(()), action=Read)
    actor: Some(Actor { id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000 })
    --
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
    --
    resource: Vpc { parent: Project { parent: Organization { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: dc01dd8b-49f9-4541-b16a-dbd7f232f2cf, lookup_type: ByName("test-org") }, key: ed07142d-4669-4055-a805-284d57d36980, lookup_type: ByName("springfield-squidport") }, key: ebd5ac54-709a-41b9-9ee1-add51a14dfa0, lookup_type: ByName("just-rainsticks") }
[2022-04-27T20:43:20.806442074Z]  INFO: 879771b2-6096-4bde-b268-9f39e2fb196a/dropshot_external/19837 on ivanova: request completed (req_id=72bfc838-ffbf-4eb5-b1c9-e415eed74185, method=GET, remote_addr=127.0.0.1:55539, local_addr=127.0.0.1:55493, response_code=200)
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
[2022-04-27T20:43:20.807398559Z]  INFO: test_vpcs/external client test context/19837 on ivanova: client received response (status=200)

This is showing us:

  1. The client POSTs to /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks to rename it to "new-name".
  2. The server successfully completes that request.
  3. The client makes a GET to /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks.
  4. As part of handling this GET request, the server does an authz check on an actual Vpc resource (more below)
  5. The server successfully completes the GET.

The log message for the authz check tells us a bunch:

    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
    --
    resource: Vpc { parent: Project { parent: Organization { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: dc01dd8b-49f9-4541-b16a-dbd7f232f2cf, lookup_type: ByName("test-org") }, key: ed07142d-4669-4055-a805-284d57d36980, lookup_type: ByName("springfield-squidport") }, key: ebd5ac54-709a-41b9-9ee1-add51a14dfa0, lookup_type: ByName("just-rainsticks") }

It's not super obvious from the debug output, but that's telling us that we found Fleet 001de000-5110-4000-8000-000000000000 after looking it up by id (the sole, singleton Fleet in every control plane instance today), then we found Organization dc01dd8b-49f9-4541-b16a-dbd7f232f2cf after looking it up by name (test-org) within the Fleet, then we found Project ed07142d-4669-4055-a805-284d57d36980 after looking it up by name (springfield-squidport) within the Organization, then we found VPC ebd5ac54-709a-41b9-9ee1-add51a14dfa0 after looking it up by name (just-rainsticks) within the Project. The authz check succeeded so I'm not digging into the actor or roles.

From the log file, we can tell:

  • The test does what we think: renames a VPC, then tries to get it by the old name.
  • The rename (PUT) succeeded.
  • The fetch (GET) also succeeded, even though it used the old name.
  • The PUT and GET requests did not overlap in time -- the PUT really completed before the GET started.

Questions at this point:

  • Did the rename actually happen? (Did it fail silently? Is it unexpectedly asynchronous?) A quick review of the PUT code path doesn't raise red flags. If the request succeeded, the rename should have happened.
  • Why does this only affect VPCs? We have tests that do similar things with other resources. (We did not exhaustively check that all those worked, though.)
  • What exactly is the database state between the PUT and GET request? Does it reflect the rename?
  • Assuming the database does reflect the rename at that point, is it possible that the lookup code path found itself with a CockroachDB connection with an older transaction still open, and so was looking at an older snapshot in time? This seemed unlikely because (1) we should be clearing this sort of connection state when we return connections to the pool, and (2) we don't use BEGIN anywhere (I checked), and the Diesel transaction() interface presumably closes things properly.
  • Is it possible that something was inadverently using AS OF SYSTEM TIME to get an older snapshot of the database? Seems unlikely but I did a quick grep to rule it out.
  • If the rename is synchronous (as it should be, and as our cursory code audit suggested), why did Ben see the state change when he was poking at the database?

First observations about the database state

The test suite deliberately leaves around the database directory for failed tests. Little-known fact: you can spin up cockroach again on this directory and poke at the state! I ran (using the directory from the log output above):

$ cargo run --bin=omicron-dev -- db-run --store-dir=/dangerzone/omicron_tmp/.tmpTGIqPX/data

This was wrong -- I forgot to use --no-populate.

$ cargo run --bin=omicron-dev -- db-run --no-populate --store-dir=/dangerzone/omicron_tmp/.tmpTGIqPX/data

That successfully started it up. I checked the recent log files from CockroachDB in that directory hoping to find some red flag about a client having done something goofy with transactions...but found none.

I forgot to save the output of the database state at this point...but since the problem is so reproducible, I can show output from similar situations.

Tweak the test to better see what's going on

Skipping ahead a bit (not that much): we eventually added two sleeps to the test: one immediately before the rename, and one immediately after. After a bunch of iteration, I put together this invocation, which I ran during each sleep:

$ cockroach sql --url postgresql://root@127.0.0.1:56789/omicron?sslmode=disable -e 'SELECT time_deleted,id from silo; select time_deleted,silo_id,id,name from organization; SELECT time_deleted,organization_id,id,name from project; SELECT time_deleted,project_id,id,name from vpc;'

This dumps the Silos, Organizations, Projects, and VPCs in the database. Note that the URL there comes from the beginning of the log file emitted by the test, as I showed above. So we have the database state both before and after the rename. It looks like this:

dap@ivanova omicron $ cockroach sql --url postgresql://root@127.0.0.1:62544/omicron?sslmode=disable -e 'SELECT time_deleted,id from silo; select time_deleted,silo_id,id,name from organization; SELECT time_deleted,organization_id,id,name from project; SELECT time_deleted,project_id,id,name from vpc;'
  time_deleted |                  id
---------------+---------------------------------------
  NULL         | 001de000-5110-4000-8000-000000000000
(1 row)

  time_deleted |               silo_id                |                  id                  |   name
---------------+--------------------------------------+--------------------------------------+-----------
  NULL         | 001de000-5110-4000-8000-000000000000 | 536ad0d1-c257-4794-a3cb-525368fe911e | test-org
(1 row)

  time_deleted |           organization_id            |                  id                  |         name
---------------+--------------------------------------+--------------------------------------+------------------------
  NULL         | 536ad0d1-c257-4794-a3cb-525368fe911e | 187c298c-049f-4674-af07-e22adef6ba1a | pokemon
  NULL         | 536ad0d1-c257-4794-a3cb-525368fe911e | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | springfield-squidport
(2 rows)

  time_deleted |              project_id              |                  id                  |      name
---------------+--------------------------------------+--------------------------------------+------------------
  NULL         | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | 0d64a87e-8a8f-44ea-afe5-adb3c403fc4d | default
  NULL         | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | ce7220a9-d14b-4d01-b7c9-c3db9c6f0e19 | just-rainsticks
  NULL         | 187c298c-049f-4674-af07-e22adef6ba1a | f92f235f-a4e0-4393-a88d-08c3f93c487a | default
  NULL         | 187c298c-049f-4674-af07-e22adef6ba1a | ffc4d60f-f69f-4440-953c-c6a55801c116 | just-rainsticks
(4 rows)

dap@ivanova omicron $ 
dap@ivanova omicron $ 
dap@ivanova omicron $ cockroach sql --url postgresql://root@127.0.0.1:62544/omicron?sslmode=disable -e 'SELECT time_deleted,id from silo; select time_deleted,silo_id,id,name from organization; SELECT time_deleted,organization_id,id,name from project; SELECT time_deleted,project_id,id,name from vpc;'
  time_deleted |                  id
---------------+---------------------------------------
  NULL         | 001de000-5110-4000-8000-000000000000
(1 row)

  time_deleted |               silo_id                |                  id                  |   name
---------------+--------------------------------------+--------------------------------------+-----------
  NULL         | 001de000-5110-4000-8000-000000000000 | 536ad0d1-c257-4794-a3cb-525368fe911e | test-org
(1 row)

  time_deleted |           organization_id            |                  id                  |         name
---------------+--------------------------------------+--------------------------------------+------------------------
  NULL         | 536ad0d1-c257-4794-a3cb-525368fe911e | 187c298c-049f-4674-af07-e22adef6ba1a | pokemon
  NULL         | 536ad0d1-c257-4794-a3cb-525368fe911e | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | springfield-squidport
(2 rows)

  time_deleted |              project_id              |                  id                  |      name
---------------+--------------------------------------+--------------------------------------+------------------
  NULL         | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | 0d64a87e-8a8f-44ea-afe5-adb3c403fc4d | default
  NULL         | 8d2ead0c-ffef-4746-a371-d24ab0188dfa | ce7220a9-d14b-4d01-b7c9-c3db9c6f0e19 | new-name
  NULL         | 187c298c-049f-4674-af07-e22adef6ba1a | f92f235f-a4e0-4393-a88d-08c3f93c487a | default
  NULL         | 187c298c-049f-4674-af07-e22adef6ba1a | ffc4d60f-f69f-4440-953c-c6a55801c116 | just-rainsticks
(4 rows)

For this particular invocation, the log message for the failing request was:

[2022-04-27T21:41:02.881876775Z] DEBUG: 63f76cc0-2b8a-4ff1-bdb2-b79006d37330/dropshot_external/23492 on ivanova: authorize result (req_id=3bfee5e4-2e3d-4853-a389-ce13530d96b4, authenticated=true, method=GET, remote_addr=127.0.0.1:35492, local_addr=127.0.0.1:52793, result=Ok(()), action=Read)
    actor: Some(Actor { id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000 })
    --
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks
    --
    resource: Vpc { parent: Project { parent: Organization { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 536ad0d1-c257-4794-a3cb-525368fe911e, lookup_type: ByName("test-org") }, key: 8d2ead0c-ffef-4746-a371-d24ab0188dfa, lookup_type: ByName("springfield-squidport") }, key: ce7220a9-d14b-4d01-b7c9-c3db9c6f0e19, lookup_type: ByName("just-rainsticks") }
[2022-04-27T21:41:02.883479055Z]  INFO: 63f76cc0-2b8a-4ff1-bdb2-b79006d37330/dropshot_external/23492 on ivanova: request completed (req_id=3bfee5e4-2e3d-4853-a389-ce13530d96b4, method=GET, remote_addr=127.0.0.1:35492, local_addr=127.0.0.1:52793, response_code=200)
    uri: /organizations/test-org/projects/springfield-squidport/vpcs/just-rainsticks

We see here that VPC e7220a9-d14b-4d01-b7c9-c3db9c6f0e19 was renamed from "just-rainsticks" to "new-name", and that the GET request successfully found that VPC (that is, having the same id) by its old name. So it's not like we somehow looked it up by the new name or by id, or that we found a different VPC.

A distracting bit: you might notice the test creates two projects, and a VPC called "just-rainsticks" inside each one. Is it possible we grabbed the wrong one? No, the ids confirm that we incorrectly found the renamed VPC, not the same-named one in a different Project. (Why does the test do this? It's trying to verify that VPC names can be reused across Projects.)

Anyway, this answers some earlier questions:

  • Did the rename actually happen? (Yes.)
  • What exactly is the database state between the PUT and GET request? Does it reflect the rename? (Yes.)

We might also ask:

  • Does Nexus cache the results of the lookup? (No, for better or worse, there's no object caching within Nexus. We went through the fetch() code path (generated by the macro) to look for red flags, but it appears to do what it says: a SELECT for rows from Vpc having the right name and parent project_id and with time_deleted IS NULL.)

We still don't know:

  • Why does this only seem to affect VPCs?
  • Assuming the database does reflect the rename at that point, is it possible that the lookup code path found itself with a CockroachDB connection with an older transaction still open, and so was looking at an older snapshot in time?
  • If the rename is synchronous (as it should be, and as our cursory code audit suggested), why did Ben see the state change when he was poking at the database?

Debugging interactively

At this point, we noticed that we had a 30-second sleep after the rename and before the GET, and even though the database state was clearly correct, the subsequent lookup failed! That rules out some small race window, since we know the GET happened after we confirmed the database state was correct.

(It was around this point CockroachDB became a person of interest in the investigation. We'd obviously considered "data corruption in the database" earlier, but it seemed unlikely to me because (1) we'd never seen anything like this in the year-plus of thousands of test runs, even in the form of flaky tests or other unexplained weirdness, and (2) it was so reproducible now. These aren't really consistent with most forms of database corruption I've experienced.)

If there's no race here, maybe we can reproduce this interactively with just CockroachDB, Nexus, and the CLI -- no test suite at all. We spun up Cockroach and Nexus in the usual way. Then we created an Organization, Project, and VPC with the CLI:

dap@ivanova omicron $ oxide org create o1
organization description:: asdf
✔ Created organization o1

dap@ivanova omicron $ oxide project create --organization=o1 p1
project description:: asdflkj
✔ Created project o1/p1

dap@ivanova omicron $ oxide vpc create --organization=o1 --project=p1 vpc1
VPC description:: a
VPC dns_name:: foo
✔ Created VPC vpc1 in o1/p1

Looks good:

dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc1
{
  "description": "a",
  "dns_name": "foo",
  "id": "1baca315-b0e3-4df8-8be5-710456eb1170",
  "ipv6_prefix": "fd81:3af7:17d2::/48",
  "name": "vpc1",
  "project_id": "ba8d746e-a159-409a-a1fc-58db34105fe4",
  "system_router_id": "d5f63635-2445-4185-a5d3-1567d277d854",
  "time_created": "2022-04-27T21:58:44.985752Z",
  "time_modified": "2022-04-27T21:58:44.985752Z"
}

Now, let's rename it. I didn't see a CLI command for this, but we can do it easily enough with the raw API subcommand:

$ echo '{"name": "vpc2"}' | oxide api -X PUT /organizations/o1/projects/p1/vpcs/vpc1 --input -
{
  "description": "a",
  "dns_name": "foo",
  "id": "1baca315-b0e3-4df8-8be5-710456eb1170",
  "ipv6_prefix": "fd81:3af7:17d2::/48",
  "name": "vpc2",
  "project_id": "ba8d746e-a159-409a-a1fc-58db34105fe4",
  "system_router_id": "d5f63635-2445-4185-a5d3-1567d277d854",
  "time_created": "2022-04-27T21:58:44.985752Z",
  "time_modified": "2022-04-27T22:01:12.660347Z"
}

Note that the "name" has correctly been set to "vpc2". Now the moment of truth: can we fetch it with the old name ("vpc1")?

dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc1
{
  "description": "a",
  "dns_name": "foo",
  "id": "1baca315-b0e3-4df8-8be5-710456eb1170",
  "ipv6_prefix": "fd81:3af7:17d2::/48",
  "name": "vpc2",
  "project_id": "ba8d746e-a159-409a-a1fc-58db34105fe4",
  "system_router_id": "d5f63635-2445-4185-a5d3-1567d277d854",
  "time_created": "2022-04-27T21:58:44.985752Z",
  "time_modified": "2022-04-27T22:01:12.660347Z"
}

Boom! And whoa! It has "name" = "vpc2". Does the new name work too?

$ oxide api /organizations/o1/projects/p1/vpcs/vpc2
404 Not Found Not Found

Yikes. It's not looking good for CockroachDB at this point. At this point, I wanted to try two more things:

  1. Try shutting down Nexus and starting it up again. Does the problem remain?
  2. Use the DTrace probes in Nexus to get the exact query that we're sending to CockroachDB. Run that in an interactive SQL shell. Will it lie directly to our faces?

The first was easy enough: I killed Nexus, reran the checks to make sure I was pointing the client at the right thing:

dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc2
error sending request for url (http://localhost:12220/organizations/o1/projects/p1/vpcs/vpc2): error trying to connect: tcp connect error: Connection refused (os error 146)
dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc1
error sending request for url (http://localhost:12220/organizations/o1/projects/p1/vpcs/vpc1): error trying to connect: tcp connect error: Connection refused (os error 146)

Good. Then I restarted Nexus. (It didn't rebuild or anything -- exact same code.) Reran the check:

dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc1
{
  "description": "a",
  "dns_name": "foo",
  "id": "1baca315-b0e3-4df8-8be5-710456eb1170",
  "ipv6_prefix": "fd81:3af7:17d2::/48",
  "name": "vpc2",
  "project_id": "ba8d746e-a159-409a-a1fc-58db34105fe4",
  "system_router_id": "d5f63635-2445-4185-a5d3-1567d277d854",
  "time_created": "2022-04-27T21:58:44.985752Z",
  "time_modified": "2022-04-27T22:01:12.660347Z"
}

dap@ivanova omicron $ oxide api /organizations/o1/projects/p1/vpcs/vpc2
404 Not Found Not Found

This rules out a lot of possible bugs in Nexus, particularly this open question we had from earlier:

  • Assuming the database does reflect the rename at that point, is it possible that the lookup code path found itself with a CockroachDB connection with an older transaction still open, and so was looking at an older snapshot in time? (No, because it persists even across a Nexus restart, where all connections are now fresh and clean.)

Isolating CockroachDB

We're starting to suspect a problem with the CockroachDB index.

I had previously created a copy of the trace-db-queries DTrace script in the Omicron repo that would trace all processes:

#!/usr/sbin/dtrace -qs

#pragma strsize=16k

/* Trace all queries to the control plane database with their latency */

dtrace:::BEGIN
{
    printf("Tracing all database queries for all pids, use Ctrl-C to exit\n");
}

diesel-db*:::query-start
{
    this->conn_id = json(copyinstr(arg1), "ok");
    ts[this->conn_id] = timestamp;
    query[this->conn_id] = copyinstr(arg2);
}

diesel-db*:::query-done
{
    this->conn_id = json(copyinstr(arg1), "ok");
}

diesel-db*:::query-done
/ts[this->conn_id]/
{
    this->latency = (timestamp - ts[this->conn_id]) / 1000;
    printf(
        "%Y.%09d pid: %d, conn_id: %s, latency: %lu us, query:\n    '%s'\n",
	walltimestamp,
	walltimestamp % 1000000000,
	pid,
        this->conn_id,
        this->latency,
        query[this->conn_id]
    );
    ts[this->conn_id] = 0;
    query[this->conn_id] = NULL;
}

I started that:

$ pfexec dtrace -xstrsize=16k -Zq -s trace-db-queries.d 
Tracing all database queries for all pids, use Ctrl-C to exit

then reran the CLI command:

$ oxide api /organizations/o1/projects/p1/vpcs/vpc2
404 Not Found Not Found

The last query from the D script:

2022 Apr 27 15:07:12.641000260 pid: 23701, conn_id: 71124849-45c9-40ee-84fc-fab3e384a0fb, latency: 390 us, query:
    'SELECT "vpc"."id", "vpc"."name", "vpc"."description", "vpc"."time_created", "vpc"."time_modified", "vpc"."time_deleted", "vpc"."project_id", "vpc"."system_router_id", "vpc"."vni", "vpc"."ipv6_prefix", "vpc"."dns_name", "vpc"."firewall_gen" FROM "vpc" WHERE ((("vpc"."time_deleted" IS NULL) AND ("vpc"."name" = $1)) AND ("vpc"."project_id" = $2)) -- binds: [Name(Name("vpc2")), ba8d746e-a159-409a-a1fc-58db34105fe4]'

So let's plug that into CockroachDB:

$ cockroach sql --url postgresql://root@127.0.0.1:32221/omicron?sslmode=disable
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Server version: CockroachDB OSS v21.1.10 (x86_64-pc-solaris2.11, built 2021/10/12 22:52:33, go1.16.5) (same version as client)
# Cluster ID: c775bce8-3736-463c-9244-8ee1b8adb730
#
# Enter \? for a brief introduction.
#
root@127.0.0.1:32221/omicron> SELECT "vpc"."id", "vpc"."name", "vpc"."description", "vpc"."time_created", "vpc"."time_modified", "vpc"."time_deleted", "vpc"."project_id", "vpc"."system_router_id", "vpc"."vni", "vpc"."ipv6_prefix", "vpc"."dns_name", "vpc"."firewall_gen"
FROM "vpc" WHERE ((("vpc"."time_deleted" IS NULL) AND ("vpc"."name" = 'vpc2')) AND ("vpc"."project_id" = 'ba8d746e-a159-409a-a1fc-58db34105fe4'));
  id | name | description | time_created | time_modified | time_deleted | project_id | system_router_id | vni | ipv6_prefix | dns_name | firewall_gen
-----+------+-------------+--------------+---------------+--------------+------------+------------------+-----+-------------+----------+---------------
(0 rows)

Time: 2ms total (execution 1ms / network 0ms)
root@127.0.0.1:32221/omicron>

Well, that's consistent with what the API reports! What if we set name to vpc1 (ignoring a few iterations where I used the wrong kind of quote):

root@127.0.0.1:32221/omicron> SELECT "vpc"."id", "vpc"."name", "vpc"."description", "vpc"."time_created", "vpc"."time_modified", "vpc"."time_deleted", "vpc"."project_id", "vpc"."system_router_id", "vpc"."vni", "vpc"."ipv6_prefix", "vpc"."dns_name", "vpc"."firewall_gen"
FROM "vpc" WHERE ((("vpc"."time_deleted" IS NULL) AND ("vpc"."name" = 'vpc1')) AND ("vpc"."project_id" = 'ba8d746e-a159-409a-a1fc-58db34105fe4'));
                   id                  | name | description |         time_created          |         time_modified         | time_deleted |              project_id              |           system_router_id           |   vni   |     ipv6_prefix     | dns_name | firewall_gen
---------------------------------------+------+-------------+-------------------------------+-------------------------------+--------------+--------------------------------------+--------------------------------------+---------+---------------------+----------+---------------
  1baca315-b0e3-4df8-8be5-710456eb1170 | vpc2 | a           | 2022-04-27 21:58:44.985752+00 | 2022-04-27 22:01:12.660347+00 | NULL         | ba8d746e-a159-409a-a1fc-58db34105fe4 | d5f63635-2445-4185-a5d3-1567d277d854 | 9176178 | fd81:3af7:17d2::/48 | foo      |            2
(1 row)

Oops. Let's use \x for extended output. That's much easier to read:

root@127.0.0.1:32221/omicron> \x
root@127.0.0.1:32221/omicron> SELECT "vpc"."id", "vpc"."name", "vpc"."description", "vpc"."time_created", "vpc"."time_modified", "vpc"."time_deleted", "vpc"."project_id", "vpc"."system_router_id", "vpc"."vni", "vpc"."ipv6_prefix", "vpc"."dns_name", "vpc"."firewall_gen"
FROM "vpc" WHERE ((("vpc"."time_deleted" IS NULL) AND ("vpc"."name" = 'vpc1')) AND ("vpc"."project_id" = 'ba8d746e-a159-409a-a1fc-58db34105fe4'));
-[ RECORD 1 ]
id               | 1baca315-b0e3-4df8-8be5-710456eb1170
name             | vpc2
description      | a
time_created     | 2022-04-27 21:58:44.985752+00
time_modified    | 2022-04-27 22:01:12.660347+00
time_deleted     | NULL
project_id       | ba8d746e-a159-409a-a1fc-58db34105fe4
system_router_id | d5f63635-2445-4185-a5d3-1567d277d854
vni              | 9176178
ipv6_prefix      | fd81:3af7:17d2::/48
dns_name         | foo
firewall_gen     | 2

Time: 1ms total (execution 1ms / network 0ms)

Yowza! The query has an explicit predicate on vpc.name = 'vpc1', yet it returned a row from the vpc table with name = 'vpc2'! And the same query for name = 'vpc1' doesn't return that row! This smells a lot like a problem with the index used to serve this query.

Now, there's a lot of other stuff in that query. Let's simplify it and try a few variations:

root@127.0.0.1:32221/omicron> SELECT * from vpc WHERE vpc.name = 'vpc1';

Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:32221/omicron> SELECT * from vpc WHERE vpc.time_deleted IS NULL AND vpc.name = 'vpc1';

Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:32221/omicron> SELECT * from vpc WHERE vpc.time_deleted IS NULL AND vpc.name = 'vpc1' AND vpc.project_id = 'ba8d746e-a159-409a-a1fc-58db34105fe4';
-[ RECORD 1 ]
id               | 1baca315-b0e3-4df8-8be5-710456eb1170
name             | vpc2
description      | a
time_created     | 2022-04-27 21:58:44.985752+00
time_modified    | 2022-04-27 22:01:12.660347+00
time_deleted     | NULL
project_id       | ba8d746e-a159-409a-a1fc-58db34105fe4
system_router_id | d5f63635-2445-4185-a5d3-1567d277d854
dns_name         | foo
vni              | 9176178
ipv6_prefix      | fd81:3af7:17d2::/48
firewall_gen     | 2

Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:32221/omicron> SELECT * from vpc WHERE vpc.name = 'vpc1';

Time: 1ms total (execution 1ms / network 0ms)

This is all consistent: we expect the index we're using to serve the Nexus query is this one:

CREATE UNIQUE INDEX ON omicron.public.vpc (
    project_id,
    name 
) WHERE
    time_deleted IS NULL;

The query that specifies a project_id, name, and time_deleted IS NULL returns the wrong result. But the queries that don't specify all those things can't use the index, so they do a table scan and get the correct result. Let's confirm that by asking CockroachDB how it would answer these queries:

root@127.0.0.1:32221/omicron> EXPLAIN SELECT * from vpc WHERE vpc.name = 'vpc1';
-[ RECORD 1 ]
info | distribution: full
-[ RECORD 2 ]
info | vectorized: true
-[ RECORD 3 ]
info | 
-[ RECORD 4 ]
info | • filter
-[ RECORD 5 ]
info | │ estimated row count: 1
-[ RECORD 6 ]
info | │ filter: name = 'vpc1'
-[ RECORD 7 ]
info | │
-[ RECORD 8 ]
info | └── • scan
-[ RECORD 9 ]
info |       estimated row count: 1 (100% of the table; stats collected 16 minutes ago)
-[ RECORD 10 ]
info |       table: vpc@primary
-[ RECORD 11 ]
info |       spans: FULL SCAN

Time: 2ms total (execution 2ms / network 0ms)

Yeah, the FULL SCAN of vpc@primary means that this query did a table scan, and that's the one that produced the correct result.

The simplified version of the one from the API, though:

root@127.0.0.1:32221/omicron> EXPLAIN SELECT * from vpc WHERE vpc.time_deleted IS NULL AND vpc.name = 'vpc1' AND vpc.project_id = 'ba8d746e-a159-409a-a1fc-58db34105fe4';
-[ RECORD 1 ]
info | distribution: local
-[ RECORD 2 ]
info | vectorized: true
-[ RECORD 3 ]
info | 
-[ RECORD 4 ]
info | • index join
-[ RECORD 5 ]
info | │ estimated row count: 1
-[ RECORD 6 ]
info | │ table: vpc@primary
-[ RECORD 7 ]
info | │
-[ RECORD 8 ]
info | └── • scan
-[ RECORD 9 ]
info |       estimated row count: 1 (100% of the table; stats collected 16 minutes ago)
-[ RECORD 10 ]
info |       table: vpc@vpc_project_id_name_key (partial index)
-[ RECORD 11 ]
info |       spans: [/'ba8d746e-a159-409a-a1fc-58db34105fe4'/'vpc1' - /'ba8d746e-a159-409a-a1fc-58db34105fe4'/'vpc1']

Time: 1ms total (execution 1ms / network 0ms)

says it uses vpc@vpc_project_id_name_key (partial index), which, as it sounds, uses the index. This is pretty convincing that there's a CockroachDB problem and it's time to check the release notes. What version are we running?

$ cockroach version
Build Tag:        v21.1.10
Build Time:       2021/10/12 22:52:33
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.16.5
C Compiler:       gcc 10.3.0
Build Commit ID:  a6daab16ee8a1c15abc4c4a8a425e46d12033b5c
Build Type:       release

There have been quite a lot of releases since then -- 8 micro updates to 21.1 since ours, a whole 21.2 line, and 22.1 is currently in beta.

We started at the release notes for 21.1.0 and looked for anything like "stop! don't use this release because it has data corruption", but weren't so lucky. We walked forward until we found this under 21.1.13:

Fixed a bug which caused corruption of partial indexes, which could cause incorrect query results. The bug was only present when two or more partial indexes in the same table had identical WHERE clauses. This bug had been present since version 21.1.0. #74475

Wow -- this is remarkable. It's pretty specific, but it explains all of our data points, including this particularly confusing one:

  • If the rename is synchronous (as it should be, and as our cursory code audit suggested), why did Ben see the state change when he was poking at the database?

The state didn't change. Ben's query changed from one with time_deleted IS NULL to one without it. This shouldn't have mattered, since nothing was deleted in this test and we saw that later. But given the index corruption described in the CockroachDB issue, this is the behavior we'd expect.

We can also answer a few more questions:

  • Why does getting rid of the "vni" partial index fix the problem? (Because the corruption bug requires two partial indexes with the same WHERE clause and a change to a field that's in one index but not the other.)
  • Why are we only seeing this now? (Again, it requires you to have two partial indexes with the same WHERE clause and a change to a field that's in one index but not the other and we'd have to notice the corruption. We have very few tables with two partial indexes: "disk" and "network_interface". The ones on "disk" have different WHERE clauses. The ones in "network_interface" probably do have this problem and we just haven't noticed.)
  • Why does this only seem to affect the VPC test, when we have a bunch of others that test similar behavior? (Because it's the only one with the conditions above around its partial indexes. We'd probably have a similar problem if we renamed a "network_interface".)

Remaining work

We could build 21.1.13 on illumos and retest (show that it's fixed) and maybe even build 21.1.12 on illumos and retest (to show that it's still broken), which would be even more convincing. But at this point, we know for sure that CockroachDB has a bug that's totally consistent with the (very bizarre) observed behavior, so maybe it's enough to update to whatever latest one we want to use (maybe 21.2.9, the latest stable) and verify that we don't see this problem.

Note that this was initially reported under cockroachdb/cockroach#74385, which describes workarounds (which I don't think we have to worry about, since we can just update) and the specific conditions required to hit the bug and the fact that you'd have to drop the index and recreate it to fix the corruption. Longer term, we might consider a first-class mechanism in Omicron to recreate indexes for cases like this, but I wouldn't prioritize it for the MVP given how complex that will be and that this is the first one we've seen.

@jclulow
Copy link
Collaborator

jclulow commented Apr 28, 2022

We could build 21.1.13 on illumos and retest (show that it's fixed) and maybe even build 21.1.12 on illumos and retest (to show that it's still broken), which would be even more convincing. But at this point, we know for sure that CockroachDB has a bug that's totally consistent with the (very bizarre) observed behavior, so maybe it's enough to update to whatever latest one we want to use (maybe 21.2.9, the latest stable) and verify that we don't see this problem.

I am short on time before we leave for a harrowing aeroplane adventure, but I have reworked the patches and built the latest stable for illumos:

$ ./bin/cockroach version
Build Tag:        v21.2.9
Build Time:       2022/04/28 04:02:42
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.16.10
C Compiler:       gcc 10.3.0
Build Commit ID:  11787edfcfc157a0df951abc34684e4e18b3ef20
Build Type:       release

It's uploaded in the usual place; i.e., https://www.illumos.org/downloads/cockroach-v21.2.9.illumos.tar.gz (with SHA256 sum 07122cbe2d3ccd1846edbe4a19e146c96989fa8514549368e412f26c4c66c7d4).

I would appreciate if somebody would give this a spin and then I can integrate the changes to the garbage compactor.

@bnaecker
Copy link
Collaborator

Kicking the tires now, thanks @jclulow!

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 28, 2022

So, the good news is that this does indeed make the one failing integration test from our analysis pass! I've also checked that the database returns the same results if we select the VPC via a table scan or using one of the partial indexes. This is the repro using the CLI from our analysis above.

bnaecker@feldspar : ~/cli $ export OXIDE_HOST=http://127.0.0.1:12220
bnaecker@feldspar : ~/cli $ export OXIDE_TOKEN="oxide-spoof-001de000-05e4-4000-8000-000000004007"
bnaecker@feldspar : ~/cli $ ./target/debug/oxide org list
 id | name | description | time_created | time_modified |
----+------+-------------+--------------+---------------

bnaecker@feldspar : ~/cli $ ./target/debug/oxide org create o1
organization description: lkdsf
✔ Created organization o1
bnaecker@feldspar : ~/cli $ ./target/debug/oxide project create -o o1 p1
project description: lkjsdf
✔ Created project o1/p1
bnaecker@feldspar : ~/cli $ ./target/debug/oxide vpc create -o o1 -p p1 vpc1
VPC description: lksjdf
VPC dns_name: foo
✔ Created VPC vpc1 in o1/p1
bnaecker@feldspar : ~/cli $ ./target/debug/oxide api /organizations/o1/projects/p1/vpcs/vpc1
{
  "description": "lksjdf",
  "dns_name": "foo",
  "id": "6ef7323f-5d9f-4c89-b7f5-ed4ba015a4cb",
  "ipv6_prefix": "fd1e:6cf7:8c2c::/48",
  "name": "vpc1",
  "project_id": "5e21a1e1-75f5-44a7-ab11-a529c03d3b92",
  "system_router_id": "3a4a9cf4-fc8d-429b-bb50-c72e076497ef",
  "time_created": "2022-04-28T04:50:36.631563Z",
  "time_modified": "2022-04-28T04:50:36.631563Z"
}

Ok, so we've created the org, project, and VPC, and can get back the VPC with the correct name. Now rename it:

bnaecker@feldspar : ~/cli $ echo '{"name": "new-name"}' | ./target/debug/oxide api -X PUT /organizations/o1/projects/p1/vpcs/vpc1 --input -
{
  "description": "lksjdf",
  "dns_name": "foo",
  "id": "6ef7323f-5d9f-4c89-b7f5-ed4ba015a4cb",
  "ipv6_prefix": "fd1e:6cf7:8c2c::/48",
  "name": "new-name",
  "project_id": "5e21a1e1-75f5-44a7-ab11-a529c03d3b92",
  "system_router_id": "3a4a9cf4-fc8d-429b-bb50-c72e076497ef",
  "time_created": "2022-04-28T04:50:36.631563Z",
  "time_modified": "2022-04-28T04:51:43.624585Z"
}

So far, so good, the VPC has been apparently renamed. Let's check that we can't fetch it with the old name, which we were able to do as a result of the partial index corruption:

bnaecker@feldspar : ~/cli $ ./target/debug/oxide api /organizations/o1/projects/p1/vpcs/vpc1
404 Not Found Not Found
bnaecker@feldspar : ~/cli $ ./target/debug/oxide api /organizations/o1/projects/p1/vpcs/new-name
{
  "description": "lksjdf",
  "dns_name": "foo",
  "id": "6ef7323f-5d9f-4c89-b7f5-ed4ba015a4cb",
  "ipv6_prefix": "fd1e:6cf7:8c2c::/48",
  "name": "new-name",
  "project_id": "5e21a1e1-75f5-44a7-ab11-a529c03d3b92",
  "system_router_id": "3a4a9cf4-fc8d-429b-bb50-c72e076497ef",
  "time_created": "2022-04-28T04:50:36.631563Z",
  "time_modified": "2022-04-28T04:51:43.624585Z"
}
bnaecker@feldspar : ~/cli $

And we are able to fetch it via the new name, so the behavior seems to be fixed by the patch.

There is a bit of bad news, though. CRDB seems to have changed some details about what they emit to their listening URL file that we use in various tests. In particular, while v21.1.10 did not emit a database name to that URL, they now seem to always dump defaultdb in its place, causing this check to be hit. I think that should be pretty inocuous and easy to manage, though, so this is looking good! Thanks again for building this.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 28, 2022

At least more odd change required to make this version work. This SQLSTATE assertion fails, since the database seems to spit out UNDEFINED_DATABASE, rather than the previous UNDEFINED_TABLE. Not sure what to make of that.

Another possible oddity. I've noticed that the /tmp directory fills up extremely quickly. Like within two runs of the full test suite, whereas I'm not sure I'd ever hit an out-of-space error before. Something about the on-disk format may have changed that warrants further investigation.

@mgartner
Copy link

I'm barging into the conversation uninvited here, but as the engineer who created the index-corrupting bug, I wanted to apologize for causing this lengthy (yet excellent) investigation. You may already be aware, but we publish technical advisories for severe bugs like this, which you might want to check next time you suspect the database is at fault — maybe it would speed up the investigation.

At least more odd change required to make this version work. This SQLSTATE assertion fails, since the database seems to spit out UNDEFINED_DATABASE, rather than the previous UNDEFINED_TABLE. Not sure what to make of that.

This is the result of some improvements to CockroachDB's catalog management. CRDB allows connecting to non-existent databases (something not possible in Postgres). When connected to a non-existent database in 21.1 and prior, attempts to access system catalogs, like SELECT * FROM pg_catalog.pg_attribute, result in an undefined database error (3D000), but accessing non-system catalogs result in an undefined table error (42P01). We made the error codes match in 21.2, so both cases will result in an undefined database error.

Another possible oddity. I've noticed that the /tmp directory fills up extremely quickly. Like within two runs of the full test suite, whereas I'm not sure I'd ever hit an out-of-space error before. Something about the on-disk format may have changed that warrants further investigation.

I'm not aware of anything that would cause this, but I'll ask around here to see if anyone else has ideas. If you have a consistent reproduction we can take a closer look.

I'm happy to help further if you run into any other hurdles while upgrading.

@davepacheco
Copy link
Collaborator Author

Thanks, @mgartner! That's all very helpful. Is there a way to subscribe by email for advisory announcements?

The release note, PR notes, and issue comments were all very helpful in identifying this issue and understanding the blast radius, workarounds, and other consequences. I really appreciate that. Once we actually concluded CockroachDB was the source of our issue, it didn't take long to find the specific bug because of the release notes. The technical advisories will be very valuable for us moving forward so that we can upgrade more proactively and get ahead of issues in deployed systems. (We're not there yet.)

@mgartner
Copy link

I'm glad the paper trail we left helped! You can sign up for email notifications for technical advisories here: https://www.cockroachlabs.com/email-preferences/

@bcantrill
Copy link
Contributor

@mgartner, I mentioned this on social media, but just to say it here as well: we really appreciate you weighing in here -- your approach here is very reaffirming of our decision to use CRDB; thank you!

@davepacheco
Copy link
Collaborator Author

I would appreciate if somebody would give this a spin and then I can integrate the changes to the garbage compactor.

@jclulow The update of Omicron to Cockroach 21.2.9 using these binaries on illumos landed in #988, so I'd say you're good to go on those patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants