Skip to content

Conversation

@jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Sep 7, 2022

Match against DieselDatabaseErrorKind::SerializationFailure and check if it's a case where a CRDB transaction didn't fail but needs to be retried. This can happen if there's too much database contention, and requires clients to retry the request.

Return HTTP 429 in this case to signal to clients that a retry is required.

Closes https://github.com/oxidecomputer/remote-access-preview/issues/23

Match against DieselDatabaseErrorKind::SerializationFailure and check if
it's a case where a CRDB transaction didn't fail but needs to be
retried. This can happen if there's too much database contention, and
requires clients to retry the request.

Return HTTP 429 in this case to signal to clients that a retry is
required.
@davepacheco
Copy link
Collaborator

429 doesn't seem quite right:

The 429 status code indicates that the user has sent too many requests in a given amount of time ("rate limiting").

@davepacheco
Copy link
Collaborator

Admittedly it's hard to know what to do here. This is basically a spurious error, not the client's fault. But it's hard to automatically retry it -- it seems like it might depend on the specific operation we're doing. Maybe a 500 or 503 is the best default?

Looking at oxidecomputer/remote-access-preview#23 I see we hit RETRY_WRITE_TOO_OLD. The docs also say:

Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Transaction Contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.

Do we know which record was experiencing contention? Maybe we can rework the queries to avoid it?

CC @smklein at some point we were talking about using batched transactions -- it sounds like it might address this problem as well.

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 7, 2022

I believe the big region_allocate transaction is the problem here.

Sending all the statements as a single batch is something to try but I don't immediately see how we can take that arbitrary transaction function and do that.

@davepacheco
Copy link
Collaborator

That makes sense. Are you able to tell which query or record we're hitting this with?

I know we don't have an alternative to interactive transactions in Omicron today. Nevertheless, this looks like another serious consequence of using them. Looking at region_allocate_transaction(), it looks like we're making 3 + 4 * REGION_REDUNDANCY_THRESHOLD = 15 round-trips to the database within this transaction. By that I mean: we start the transaction, issue a query to the database, wait for the result, do some stuff in Rust, issue another query, wait for the result, etc. 15 times. During that entire period, if something else comes in and modifies any of the records we tried to update, I expect we'll get this error. That's a pretty big window. Note too that we always pick the same datasets for a new disk. So any concurrent attempt to create a disk will wind up trying to update the same records.

Retrying doesn't feel like a great long-term solution. It seems to replace a transient error (the retry error) with a latency bubble. As load increases and either Nexus or database latency increases, the window for a conflict gets longer. This increases the probability of a conflict. This results in more retries and more load -- it's a feedback loop. I don't know if this is likely in this scenario, but similar behaviors brought down Joyent's Manta several times.

We could reduce the likelihood of a conflict by reworking the function a bit. For example, it sequentially fetches each dataset and updates it, then does the same with zpools. We could do all the fetches at once and save maybe half of the round-trips? We could also introduce some randomness into the dataset selection so that not every concurrent attempt to create a disk tries to use (and update) the same dataset records. I'm not sure how much either of these will help.

Even without solving the interactive transaction problem, I bet we could improve this significantly by moving more of the logic into SQL using a CTE. (A CTE is sort of a side door way to create a batched transaction, in that if you want to do X, then Y, then Z, you can always WITH foo as (X), bar as (Y), baz as (Z) ....)

Or we could bite the bullet and invest in an interface for batched transactions. I think @smklein talked about this at some point but I don't remember if there's an issue for it?

Whether using a CTE or a batched transaction, there could be a sizable improvement in provisioning time. If each database round-trip is even 50ms, doing it in one go could shave off hundreds of milliseconds per disk per instance provisioned.

@davepacheco
Copy link
Collaborator

Related: #973

@smklein
Copy link
Collaborator

smklein commented Sep 7, 2022

I agree with @davepacheco , my inclination is either: (1) make this a CTE, or (2) make it possible for this to be a batched transaction. However, for this to be usable as a batched transaction, all the statements must be usable without depending on the results of previous operations: https://www.cockroachlabs.com/docs/stable/transactions.html#batched-statements

I'm not sure this is viable here - the operations are roughly:

  • Read all (datasets, regions) used by the volume; exit early if they already exist
  • Read a few datasets which can be used as allocation targets
  • Insert regions into the DB, referencing those target datasets
  • Update the datasets to account for the increased size usage
  • Return an error if we are using too much of the zpool's storage

These operations are highly dependent on each other -- so even if we added batched transactions, I don't know if this would be a viable candidate.

@davepacheco
Copy link
Collaborator

Yeah. I'm not sure it's impossible, but a CTE may be easier because it lets you give names to those intermediate things and then use them in subsequent statements.

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 8, 2022

Closing this because of #1688

@jmpesp jmpesp closed this Sep 8, 2022
@jmpesp jmpesp deleted the too_many_requests branch September 8, 2022 13:59
leftwo pushed a commit that referenced this pull request Mar 26, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647)
Update Rust crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670)
Fix unused import (#1669)
Use `RangeSet` instead of tracking every complete job (#1663)
Update Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659)
Update Rust crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643)
Update Rust crate tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible activation (#873)
server: extend API request queue's memory of prior requests (#880)
Use production propolis-server flags in PHD CI builds (#878)

Added a new field to the Board struct that propolis requires.
leftwo added a commit that referenced this pull request Mar 27, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust
crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670) Fix unused import
(#1669)
Use `RangeSet` instead of tracking every complete job (#1663) Update
Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659) Update Rust
crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate
tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible
activation (#873) server: extend API request queue's memory of prior
requests (#880) Use production propolis-server flags in PHD CI builds
(#878)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants