Skip to content

Only kick a Downstairs on timeout #1252

@mkeeter

Description

@mkeeter

Right now, there are several ways a Downstairs can be kicked out of the Active state

  • If there are more than 57K jobs or 1 GiB in flight, it's moved to Faulted, which means it goes through live-repair when reconnecting
  • If it doesn't reply for 45 seconds, it's moved to Offline, which means it goes through replay when reconnecting
    • Note that it can later be moved from Offline to Faulted if it hits either 57K jobs / 1 GiB bytes-in-flight criteria

(there are other paths, e.g. returning an IO error or having a connection close, but those aren't relevant to the issue)

The jobs / bytes-in-flight criteria is uncomfortably coupled to backpressure: if the maximum backpressure is less than the true job completion time, then jobs will inevitably accumulate and will eventually cause the Downstairs to be faulted. In other words, we have to guess a value for maximum backpressure such that the system will never exceed it during normal operation.

We've gone through several rounds of tuning maximum byte-based backpressure (#1240, #1243), and are now seeing (https://github.com/oxidecomputer/customer-support/issues/121) that job-based backpressure may also be insufficient.

After talking about this in the 2024-04-01 Crucible Huddle, our new plan is to remove the jobs / bytes-in-flight criteria for the ActiveFaulted transition. In other words, as long as a Downstairs is replying (however slowly), we won't mark it as faulted. The OfflineFaulted transition will remain in place, so that we don't accumulate jobs forever.

This theoretically means that the Upstairs can buffer arbitrary amounts of data on behalf of a pathologically-slow Downstairs; however, with quadratic backpressure, we can do the math and satisfy ourselves that it won't ever hit worrying values.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions