Check if CRDB is telling us to retry the txn #1683

jmpesp · 2022-09-07T20:30:57Z

Match against DieselDatabaseErrorKind::SerializationFailure and check if it's a case where a CRDB transaction didn't fail but needs to be retried. This can happen if there's too much database contention, and requires clients to retry the request.

Return HTTP 429 in this case to signal to clients that a retry is required.

Closes https://github.com/oxidecomputer/remote-access-preview/issues/23

Match against DieselDatabaseErrorKind::SerializationFailure and check if it's a case where a CRDB transaction didn't fail but needs to be retried. This can happen if there's too much database contention, and requires clients to retry the request. Return HTTP 429 in this case to signal to clients that a retry is required.

davepacheco · 2022-09-07T20:50:40Z

429 doesn't seem quite right:

The 429 status code indicates that the user has sent too many requests in a given amount of time ("rate limiting").

davepacheco · 2022-09-07T20:55:30Z

Admittedly it's hard to know what to do here. This is basically a spurious error, not the client's fault. But it's hard to automatically retry it -- it seems like it might depend on the specific operation we're doing. Maybe a 500 or 503 is the best default?

Looking at oxidecomputer/remote-access-preview#23 I see we hit RETRY_WRITE_TOO_OLD. The docs also say:

Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Transaction Contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.

Do we know which record was experiencing contention? Maybe we can rework the queries to avoid it?

CC @smklein at some point we were talking about using batched transactions -- it sounds like it might address this problem as well.

jmpesp · 2022-09-07T21:15:30Z

I believe the big region_allocate transaction is the problem here.

Sending all the statements as a single batch is something to try but I don't immediately see how we can take that arbitrary transaction function and do that.

davepacheco · 2022-09-07T21:57:53Z

That makes sense. Are you able to tell which query or record we're hitting this with?

I know we don't have an alternative to interactive transactions in Omicron today. Nevertheless, this looks like another serious consequence of using them. Looking at region_allocate_transaction(), it looks like we're making 3 + 4 * REGION_REDUNDANCY_THRESHOLD = 15 round-trips to the database within this transaction. By that I mean: we start the transaction, issue a query to the database, wait for the result, do some stuff in Rust, issue another query, wait for the result, etc. 15 times. During that entire period, if something else comes in and modifies any of the records we tried to update, I expect we'll get this error. That's a pretty big window. Note too that we always pick the same datasets for a new disk. So any concurrent attempt to create a disk will wind up trying to update the same records.

Retrying doesn't feel like a great long-term solution. It seems to replace a transient error (the retry error) with a latency bubble. As load increases and either Nexus or database latency increases, the window for a conflict gets longer. This increases the probability of a conflict. This results in more retries and more load -- it's a feedback loop. I don't know if this is likely in this scenario, but similar behaviors brought down Joyent's Manta several times.

We could reduce the likelihood of a conflict by reworking the function a bit. For example, it sequentially fetches each dataset and updates it, then does the same with zpools. We could do all the fetches at once and save maybe half of the round-trips? We could also introduce some randomness into the dataset selection so that not every concurrent attempt to create a disk tries to use (and update) the same dataset records. I'm not sure how much either of these will help.

Even without solving the interactive transaction problem, I bet we could improve this significantly by moving more of the logic into SQL using a CTE. (A CTE is sort of a side door way to create a batched transaction, in that if you want to do X, then Y, then Z, you can always WITH foo as (X), bar as (Y), baz as (Z) ....)

Or we could bite the bullet and invest in an interface for batched transactions. I think @smklein talked about this at some point but I don't remember if there's an issue for it?

Whether using a CTE or a batched transaction, there could be a sizable improvement in provisioning time. If each database round-trip is even 50ms, doing it in one go could shave off hundreds of milliseconds per disk per instance provisioned.

davepacheco · 2022-09-07T21:59:59Z

Related: #973

smklein · 2022-09-07T22:19:52Z

I agree with @davepacheco , my inclination is either: (1) make this a CTE, or (2) make it possible for this to be a batched transaction. However, for this to be usable as a batched transaction, all the statements must be usable without depending on the results of previous operations: https://www.cockroachlabs.com/docs/stable/transactions.html#batched-statements

I'm not sure this is viable here - the operations are roughly:

Read all (datasets, regions) used by the volume; exit early if they already exist
Read a few datasets which can be used as allocation targets
Insert regions into the DB, referencing those target datasets
Update the datasets to account for the increased size usage
Return an error if we are using too much of the zpool's storage

These operations are highly dependent on each other -- so even if we added batched transactions, I don't know if this would be a viable candidate.

davepacheco · 2022-09-07T22:25:02Z

Yeah. I'm not sure it's impossible, but a CTE may be easier because it lets you give names to those intermediate things and then use them in subsequent statements.

jmpesp · 2022-09-08T13:59:00Z

Closing this because of #1688

Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) Added a new field to the Board struct that propolis requires.

Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

smklein mentioned this pull request Sep 8, 2022

Region allocation should be a CTE #1688

Closed

jmpesp closed this Sep 8, 2022

jmpesp deleted the too_many_requests branch September 8, 2022 13:59

leftwo mentioned this pull request Mar 26, 2025

Update crucible and propolis #7877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check if CRDB is telling us to retry the txn #1683

Check if CRDB is telling us to retry the txn #1683

Uh oh!

jmpesp commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

jmpesp commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

smklein commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

jmpesp commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Check if CRDB is telling us to retry the txn #1683

Check if CRDB is telling us to retry the txn #1683

Uh oh!

Conversation

jmpesp commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

jmpesp commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

smklein commented Sep 7, 2022

Uh oh!

davepacheco commented Sep 7, 2022

Uh oh!

jmpesp commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants