-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apparent cockroachdb data corruption due to cockroachdb issue 74475 #980
Comments
Here I'll try to describe as much of the debugging process and data as I recorded and remember. It's not quite as complete and in-order as I'd like because I didn't quite record everything, but I think it's a good summary. @bnaecker, let me know what I've missed! Initial observationsBen reported that at the tip of #955, commit 97ceda5, if you run
Ben summarized:
Ben also reported that if you apply this patch, the test succeeds:
While debugging this, Ben introduced a "sleep" in the test so he could look at the database state. Here's some output:
That's weird -- it looks as though the rename happened while he was looking at it, despite the test being stopped. Side note on reproducingThe problem is wholly reproducible from commit 97ceda5 by running
Neat -- I didn't know you could use a personal access token to clone over https. First observations from the log fileAll the integration tests record a detailed log file from Nexus. Take the first failure I saw:
The log file starts out like this:
That last line gives you the URL that CockroachDB is listening on. We used that a lot here. Anyway, if we filter the output a bit, we see what happened:
This is showing us:
The log message for the authz check tells us a bunch:
It's not super obvious from the debug output, but that's telling us that we found Fleet From the log file, we can tell:
Questions at this point:
First observations about the database stateThe test suite deliberately leaves around the database directory for failed tests. Little-known fact: you can spin up
This was wrong -- I forgot to use
That successfully started it up. I checked the recent log files from CockroachDB in that directory hoping to find some red flag about a client having done something goofy with transactions...but found none. I forgot to save the output of the database state at this point...but since the problem is so reproducible, I can show output from similar situations. Tweak the test to better see what's going onSkipping ahead a bit (not that much): we eventually added two sleeps to the test: one immediately before the rename, and one immediately after. After a bunch of iteration, I put together this invocation, which I ran during each sleep:
This dumps the Silos, Organizations, Projects, and VPCs in the database. Note that the URL there comes from the beginning of the log file emitted by the test, as I showed above. So we have the database state both before and after the rename. It looks like this:
For this particular invocation, the log message for the failing request was:
We see here that VPC e7220a9-d14b-4d01-b7c9-c3db9c6f0e19 was renamed from "just-rainsticks" to "new-name", and that the GET request successfully found that VPC (that is, having the same id) by its old name. So it's not like we somehow looked it up by the new name or by id, or that we found a different VPC. A distracting bit: you might notice the test creates two projects, and a VPC called "just-rainsticks" inside each one. Is it possible we grabbed the wrong one? No, the ids confirm that we incorrectly found the renamed VPC, not the same-named one in a different Project. (Why does the test do this? It's trying to verify that VPC names can be reused across Projects.) Anyway, this answers some earlier questions:
We might also ask:
We still don't know:
Debugging interactivelyAt this point, we noticed that we had a 30-second sleep after the rename and before the GET, and even though the database state was clearly correct, the subsequent lookup failed! That rules out some small race window, since we know the GET happened after we confirmed the database state was correct. (It was around this point CockroachDB became a person of interest in the investigation. We'd obviously considered "data corruption in the database" earlier, but it seemed unlikely to me because (1) we'd never seen anything like this in the year-plus of thousands of test runs, even in the form of flaky tests or other unexplained weirdness, and (2) it was so reproducible now. These aren't really consistent with most forms of database corruption I've experienced.) If there's no race here, maybe we can reproduce this interactively with just CockroachDB, Nexus, and the CLI -- no test suite at all. We spun up Cockroach and Nexus in the usual way. Then we created an Organization, Project, and VPC with the CLI:
Looks good:
Now, let's rename it. I didn't see a CLI command for this, but we can do it easily enough with the raw API subcommand:
Note that the "name" has correctly been set to "vpc2". Now the moment of truth: can we fetch it with the old name ("vpc1")?
Boom! And whoa! It has "name" = "vpc2". Does the new name work too?
Yikes. It's not looking good for CockroachDB at this point. At this point, I wanted to try two more things:
The first was easy enough: I killed Nexus, reran the checks to make sure I was pointing the client at the right thing:
Good. Then I restarted Nexus. (It didn't rebuild or anything -- exact same code.) Reran the check:
This rules out a lot of possible bugs in Nexus, particularly this open question we had from earlier:
Isolating CockroachDBWe're starting to suspect a problem with the CockroachDB index. I had previously created a copy of the trace-db-queries DTrace script in the Omicron repo that would trace all processes:
I started that:
then reran the CLI command:
The last query from the D script:
So let's plug that into CockroachDB:
Well, that's consistent with what the API reports! What if we set name to
Oops. Let's use
Yowza! The query has an explicit predicate on Now, there's a lot of other stuff in that query. Let's simplify it and try a few variations:
This is all consistent: we expect the index we're using to serve the Nexus query is this one:
The query that specifies a project_id, name, and
Yeah, the The simplified version of the one from the API, though:
says it uses
There have been quite a lot of releases since then -- 8 micro updates to 21.1 since ours, a whole 21.2 line, and 22.1 is currently in beta. We started at the release notes for 21.1.0 and looked for anything like "stop! don't use this release because it has data corruption", but weren't so lucky. We walked forward until we found this under 21.1.13:
Wow -- this is remarkable. It's pretty specific, but it explains all of our data points, including this particularly confusing one:
The state didn't change. Ben's query changed from one with We can also answer a few more questions:
Remaining workWe could build 21.1.13 on illumos and retest (show that it's fixed) and maybe even build 21.1.12 on illumos and retest (to show that it's still broken), which would be even more convincing. But at this point, we know for sure that CockroachDB has a bug that's totally consistent with the (very bizarre) observed behavior, so maybe it's enough to update to whatever latest one we want to use (maybe 21.2.9, the latest stable) and verify that we don't see this problem. Note that this was initially reported under cockroachdb/cockroach#74385, which describes workarounds (which I don't think we have to worry about, since we can just update) and the specific conditions required to hit the bug and the fact that you'd have to drop the index and recreate it to fix the corruption. Longer term, we might consider a first-class mechanism in Omicron to recreate indexes for cases like this, but I wouldn't prioritize it for the MVP given how complex that will be and that this is the first one we've seen. |
I am short on time before we leave for a harrowing aeroplane adventure, but I have reworked the patches and built the latest stable for illumos:
It's uploaded in the usual place; i.e., https://www.illumos.org/downloads/cockroach-v21.2.9.illumos.tar.gz (with SHA256 sum I would appreciate if somebody would give this a spin and then I can integrate the changes to the garbage compactor. |
Kicking the tires now, thanks @jclulow! |
So, the good news is that this does indeed make the one failing integration test from our analysis pass! I've also checked that the database returns the same results if we select the VPC via a table scan or using one of the partial indexes. This is the repro using the CLI from our analysis above.
Ok, so we've created the org, project, and VPC, and can get back the VPC with the correct name. Now rename it:
So far, so good, the VPC has been apparently renamed. Let's check that we can't fetch it with the old name, which we were able to do as a result of the partial index corruption:
And we are able to fetch it via the new name, so the behavior seems to be fixed by the patch. There is a bit of bad news, though. CRDB seems to have changed some details about what they emit to their listening URL file that we use in various tests. In particular, while v21.1.10 did not emit a database name to that URL, they now seem to always dump |
At least more odd change required to make this version work. This SQLSTATE assertion fails, since the database seems to spit out Another possible oddity. I've noticed that the |
I'm barging into the conversation uninvited here, but as the engineer who created the index-corrupting bug, I wanted to apologize for causing this lengthy (yet excellent) investigation. You may already be aware, but we publish technical advisories for severe bugs like this, which you might want to check next time you suspect the database is at fault — maybe it would speed up the investigation.
This is the result of some improvements to CockroachDB's catalog management. CRDB allows connecting to non-existent databases (something not possible in Postgres). When connected to a non-existent database in 21.1 and prior, attempts to access system catalogs, like
I'm not aware of anything that would cause this, but I'll ask around here to see if anyone else has ideas. If you have a consistent reproduction we can take a closer look. I'm happy to help further if you run into any other hurdles while upgrading. |
Thanks, @mgartner! That's all very helpful. Is there a way to subscribe by email for advisory announcements? The release note, PR notes, and issue comments were all very helpful in identifying this issue and understanding the blast radius, workarounds, and other consequences. I really appreciate that. Once we actually concluded CockroachDB was the source of our issue, it didn't take long to find the specific bug because of the release notes. The technical advisories will be very valuable for us moving forward so that we can upgrade more proactively and get ahead of issues in deployed systems. (We're not there yet.) |
I'm glad the paper trail we left helped! You can sign up for email notifications for technical advisories here: https://www.cockroachlabs.com/email-preferences/ |
@mgartner, I mentioned this on social media, but just to say it here as well: we really appreciate you weighing in here -- your approach here is very reaffirming of our decision to use CRDB; thank you! |
@jclulow The update of Omicron to Cockroach 21.2.9 using these binaries on illumos landed in #988, so I'd say you're good to go on those patches. |
tl;dr: CockroachDB v21.1.10 has a bug where queries can return "incorrect results" due to index corruption "when two or more partial indexes in the same table had identical WHERE clauses". Ben appeared to observe this under #955. I'm still doing a bit more confirmation, but strong evidence suggests we hit that CockroachDB bug. Fortunately, it's reported fixed in v21.1.13. We'll want to update to at least that in order to land #955.
The text was updated successfully, but these errors were encountered: