-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Right now, there are several ways a Downstairs can be kicked out of the Active state
- If there are more than 57K jobs or 1 GiB in flight, it's moved to
Faulted, which means it goes through live-repair when reconnecting - If it doesn't reply for 45 seconds, it's moved to
Offline, which means it goes through replay when reconnecting- Note that it can later be moved from
OfflinetoFaultedif it hits either 57K jobs / 1 GiB bytes-in-flight criteria
- Note that it can later be moved from
(there are other paths, e.g. returning an IO error or having a connection close, but those aren't relevant to the issue)
The jobs / bytes-in-flight criteria is uncomfortably coupled to backpressure: if the maximum backpressure is less than the true job completion time, then jobs will inevitably accumulate and will eventually cause the Downstairs to be faulted. In other words, we have to guess a value for maximum backpressure such that the system will never exceed it during normal operation.
We've gone through several rounds of tuning maximum byte-based backpressure (#1240, #1243), and are now seeing (https://github.com/oxidecomputer/customer-support/issues/121) that job-based backpressure may also be insufficient.
After talking about this in the 2024-04-01 Crucible Huddle, our new plan is to remove the jobs / bytes-in-flight criteria for the Active → Faulted transition. In other words, as long as a Downstairs is replying (however slowly), we won't mark it as faulted. The Offline → Faulted transition will remain in place, so that we don't accumulate jobs forever.
This theoretically means that the Upstairs can buffer arbitrary amounts of data on behalf of a pathologically-slow Downstairs; however, with quadratic backpressure, we can do the math and satisfy ourselves that it won't ever hit worrying values.