-
Notifications
You must be signed in to change notification settings - Fork 62
Make backoff policy explicit, with short / long options #1996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// for a relatively long amount of time. | ||
| pub fn retry_policy_long() -> ::backoff::ExponentialBackoff { | ||
| const INITIAL_INTERVAL: Duration = Duration::from_millis(250); | ||
| const MAX_INTERVAL: Duration = Duration::from_secs(60 * 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At risk of invoking Clulow's lament - while we're fiddling with this, is maxing out at an hour reasonable? That seems quite long to me, but it's entirely possible my intuition here is wrong. Also seems fine to leave this alone for now and tune later when we have more experience to decide. (It looks like the majority of call sites are using _short() though, so maybe this is not super relevant either way, if the handful of _long() calls really are of the "as long as we get to it eventually" case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember asking @davepacheco the same question. If we get to that point, it means that the service has already failed to respond within the cumulative time we've waited thus far. That's on the order of an hour anyway, since the policy is exponential. At that point, it seems like waiting another hour is reasonable.
|
Thanks for fixing this! I have some specific suggestions and then I'll say more about why.
Since you're changing up the core primitives here (such as they are) I think it's worth digging in a bit. To elaborate on what I wrote in #1270 (maybe this needs an RFD?): I start from these principles: these policies are critical for maintaining availability when the system starts going south. In any given code path, it's super easy to make bad choices about the policy that can cause the system to keel over after otherwise-recoverable events. So we should think about how we can make it easy to do the right thing and harder to do the wrong thing. To start, I feel like we could use better guidance about (1) when to retry and (2) what kind of policy to use. My take (and I'm not saying this is the only valid take) is that retries are primarily for cases where you've attempted to make a request to an external, loosely-coupled component and either you failed to make the request at all (i.e., a network error or timeout) or the request failed explicitly with a transient error indicating it's offline or overloaded (e.g, 503). This might happen because we're in the middle of initialization and it hasn't been started yet, or it's being upgraded (although hopefully in most cases upgrades won't involve downtime), or because it's having an unexpected outage, or because it's overloaded, or because the network is having an outage, or because the network is overloaded, etc. In most of these cases, it's possible that the caller itself is responsible for an overload that is the reason these requests are failing. This is what For these reasons, I'd expect to use
Another common case where we do retry today is when we're waiting for some external, loosely-coupled service to do something, we don't know how long it will take, and we have no way of just directly waiting for that thing to be done. As I've said before, I don't think this is a great use case for a standard "check state, backoff, retry" loop. With the overload case above, exponential backoff makes sense because by definition we're colliding with other callers for a limited resource and we're trying to get out of that situation by spreading the attempts over a longer period. That's not the case if we're just waiting for another service to do some work for us. A long poll wrapped in a retry loop (with no backoff at all) would provide much better latency (which matters a lot in the case of operations like instance provisioning). A proactive call from the other service works too. I can understand for urgency reasons why we might choose to use an exponential retry here, but I think we should be intentional and clear that we're doing this because it's expedient, not because exponential backoff makes sense, and probably keep track of these cases so that we can fix them later. There are cases in the Sled Agent that look similar to both of the above, but where the component we're making requests to is local to the box. You could argue that in this case, there's only so overloaded it could possibly be, and there's no need to backoff quite so long. There are potentially still two different cases here: one for when requests are failing because of a transient condition that could be an overload (I don't know if there's anything in this bucket) and one where you're just waiting for some other system to do something and don't have a great way to wait directly for it (e.g., svc.startd to create the "state" property on a newly-created SMF instance). These are things I've seen happen when code paths that choose the wrong policy:
If we accept the above, what can we do to make it easy for people to choose the right policy and hard to choose the wrong one? I think these failure modes are pretty subtle. My experience is that people don't know about these problems until they've either experienced them or heard from someone who has. So if you ask people to pick "short" or "long", they're often going to pick the wrong one. My original thought was to provide a small number of canned policies for specific use cases: e.g., If there's general agreement on the above principles, then I'd suggest:
(which is what I said at the top) I also kind of wish that every caller of |
|
I agree with you that we should settle on names more descriptive than "short" and "long", and I'm happy to take on that follow-up. More precise naming may indicate we need better-defined categories here. However, I don't totally know if I agree with this take:
Here are some examples where we're making requests from Sled Agent to Nexus, where the cost of continuing to back off, on the order of minutes to hours, may actually be much more detrimental than the cost of "sending a query per second until we get a response": In this case, we are notifying Nexus that a new sled exists: omicron/sled-agent/src/sled_agent.rs Lines 524 to 530 in 2d7373b
Similarly, in this case, we are notifying Nexus that a new disk (well, zpool) exists: omicron/sled-agent/src/storage_manager.rs Lines 660 to 667 in 2d7373b
If we allowed backoff to get far -- up to minutes, or hours -- then the rack could be severely underutilized, increasing the burden of any existing load on the system. Nexus simply wouldn't be aware of disks or sleds that it could be using, and arguably, this would make any ongoing load problems worse. Arguably, this delay was already manifesting in #1991 This is part of why I settled on the "short" vs "long" names - these cases made me skeptical that it was apply the same retry policy universally based on the target service. Does that make sense? I know that's not exactly a concrete proposal, but I still think it's a use-case we need to consider. |
|
That's a good point but I'm not sure these cases are really different? I think you're saying the value of these requests is higher than most, which is fair, but the reason to use a longer backoff is because the probability of success (for everybody) is low when the system is overloaded. The assumption here is that we're backing off because CockroachDB, Nexus, or the network is too overloaded to successfully complete requests. It doesn't seem like hitting them harder can help. What might help is prioritizing these requests over other requests, and we could totally do that, but even if we did, if we still got a 503 overload response, I think the best action would be to backoff (potentially a lot) and retry. To make it concrete: suppose a customer is adding 100 racks to their deployment, causing thousands of sled agents to slam Nexus/CockroachDB with these requests. They may well be responsible for an overload that compromises the whole control plane. If they start getting 503s, I think they should back off, potentially a lot, just like other Nexus/CockroachDB consumers. If we imagine it's not that they're adding 100 racks, but they're just power cycling them, and if the same internal requests are triggered in that case, then we have the same result: yes, we may need those systems back online to restore control plane capacity, but we need to throttle the load on the surviving sleds in order to make sure we're completing any requests usefully. (And prioritization of these requests may help if we think that's the best way to get the system back online.) I think it's a long way in scale before we run into these issues, but I don't think it's any more correct in the short term to use a more aggressive backoff? |
|
I added #1999 as a follow-up. I'm using the policy called |
#1999) Follow-up to #1996 - Adds some docs - `retry_policy_long` -> `retry_policy_internal_service` - `retry_policy_short` -> `retry_policy_local` - Adds a new policy, `retry_policy_internal_service_aggressive`. This backoff policy eventually uses timeout intervals similar to `retry_policy_internal_service` (to avoid overloading the target service), but it backs off more slowly, which avoids stalls for critical service updates (e.g., Sled Agent notifying Nexus about new hardware or faults).
sled-agentto use the "short" variantnexusto use the "long" variantinternal_service_policytoretry_policy_{short,long}, as these policies are not exclusively used when accessing internal services.Fixes #1270
Fixes #1991 (hopefully)