Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Nov 30, 2022

  • Provides options of a "short" and "long" backoff retry policy. These are arbitrary, but so are all timeouts, and they can help us start to consider "what kind of timeout does this call need".
  • Updated most locally-querying calls inside sled-agent to use the "short" variant
  • Updated most internal-service-querying calls inside nexus to use the "long" variant
  • Renamed internal_service_policy to retry_policy_{short,long}, as these policies are not exclusively used when accessing internal services.

Fixes #1270
Fixes #1991 (hopefully)

/// for a relatively long amount of time.
pub fn retry_policy_long() -> ::backoff::ExponentialBackoff {
const INITIAL_INTERVAL: Duration = Duration::from_millis(250);
const MAX_INTERVAL: Duration = Duration::from_secs(60 * 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At risk of invoking Clulow's lament - while we're fiddling with this, is maxing out at an hour reasonable? That seems quite long to me, but it's entirely possible my intuition here is wrong. Also seems fine to leave this alone for now and tune later when we have more experience to decide. (It looks like the majority of call sites are using _short() though, so maybe this is not super relevant either way, if the handful of _long() calls really are of the "as long as we get to it eventually" case.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember asking @davepacheco the same question. If we get to that point, it means that the service has already failed to respond within the cumulative time we've waited thus far. That's on the order of an hour anyway, since the policy is exponential. At that point, it seems like waiting another hour is reasonable.

@smklein smklein merged commit eebdd20 into main Nov 30, 2022
@smklein smklein deleted the backoff branch November 30, 2022 22:23
@davepacheco
Copy link
Collaborator

Thanks for fixing this! I have some specific suggestions and then I'll say more about why.

  • I'd expect us to use the "long" policy for most of the Sled Agent calls that were changed here, especially stuff that reaches off-box, and especially stuff that reaches out to Nexus [and so potentially CockroachDB]. Exceptions: I'm not sure about RSS or DDM. I get why we might want to use the "short" one for that SMF case where we're waiting for the restarter/state property.
  • I would suggest renaming these:
    • internal_service_policy() vs. local_service_policy() or sled_local_service_policy()?
    • rack_service_policy() vs. local_service_policy() or sled_local_service_policy()? (it's not really "rack", though, it's the scope of the whole control plane)
    • retry_policy_avoid_overload() vs. retry_policy_aggressive()?
    • if we don't like naming these semantically, but want names that describe what they do, then I'd suggest renaming retry_policy_short() to retry_policy_aggressive() and I'm not sure what to call the long one.
  • probably in a follow-on PR: more guidance about backoffs and retries

Since you're changing up the core primitives here (such as they are) I think it's worth digging in a bit. To elaborate on what I wrote in #1270 (maybe this needs an RFD?): I start from these principles: these policies are critical for maintaining availability when the system starts going south. In any given code path, it's super easy to make bad choices about the policy that can cause the system to keel over after otherwise-recoverable events. So we should think about how we can make it easy to do the right thing and harder to do the wrong thing.

To start, I feel like we could use better guidance about (1) when to retry and (2) what kind of policy to use. My take (and I'm not saying this is the only valid take) is that retries are primarily for cases where you've attempted to make a request to an external, loosely-coupled component and either you failed to make the request at all (i.e., a network error or timeout) or the request failed explicitly with a transient error indicating it's offline or overloaded (e.g, 503). This might happen because we're in the middle of initialization and it hasn't been started yet, or it's being upgraded (although hopefully in most cases upgrades won't involve downtime), or because it's having an unexpected outage, or because it's overloaded, or because the network is having an outage, or because the network is overloaded, etc. In most of these cases, it's possible that the caller itself is responsible for an overload that is the reason these requests are failing. This is what internal_service_policy() was intended for. It's got a pretty long max timeout, but as @bnaecker mentioned, we only get there when the we've been failing with a transient error for quite some time. In an overload condition, by definition the system needs the value to be that large in order to recover from the overload (or else an earlier request would have succeeded). While the specific values are somewhat arbitrary, I think I originally suggested an hour based on my experience from past outages of production databases, where I remember a 10 minute cap being insufficient in some cases. (Other experience reports welcome!)

For these reasons, I'd expect to use internal_service_policy() (or whatever we rename it to) for:

  • any request to Nexus, CockroachDB, MGS, Dendrite, Clickhouse, or any other non-sled-local service
  • any request from any of these services (which is probably just Nexus) to any other service (including Sled Agent)
  • maybe: to keep things simple, if there's not a compelling reason to do differently, then also for any request from any service to any other service within Omicron

Another common case where we do retry today is when we're waiting for some external, loosely-coupled service to do something, we don't know how long it will take, and we have no way of just directly waiting for that thing to be done. As I've said before, I don't think this is a great use case for a standard "check state, backoff, retry" loop. With the overload case above, exponential backoff makes sense because by definition we're colliding with other callers for a limited resource and we're trying to get out of that situation by spreading the attempts over a longer period. That's not the case if we're just waiting for another service to do some work for us. A long poll wrapped in a retry loop (with no backoff at all) would provide much better latency (which matters a lot in the case of operations like instance provisioning). A proactive call from the other service works too. I can understand for urgency reasons why we might choose to use an exponential retry here, but I think we should be intentional and clear that we're doing this because it's expedient, not because exponential backoff makes sense, and probably keep track of these cases so that we can fix them later.

There are cases in the Sled Agent that look similar to both of the above, but where the component we're making requests to is local to the box. You could argue that in this case, there's only so overloaded it could possibly be, and there's no need to backoff quite so long. There are potentially still two different cases here: one for when requests are failing because of a transient condition that could be an overload (I don't know if there's anything in this bucket) and one where you're just waiting for some other system to do something and don't have a great way to wait directly for it (e.g., svc.startd to create the "state" property on a newly-created SMF instance).

These are things I've seen happen when code paths that choose the wrong policy:

  • if you don't backoff when a service is overloaded, or don't do so exponentially, the system may not be able to automatically recover. It'll be offline indefinitely (or until an operator can come in and manually throttle attempts)
  • if you backoff too far, a similar thing can happen (e.g., the backend service comes back online, but the client is waiting days to try again, and so it's effectively offline until an operator comes in to kick everything)
  • if you backoff when a service is not overloaded, but just because you're waiting for it to do some work for you, then latency can suffer quite a lot. After a couple of layers doing this, a user operation that takes 300ms can take multiple seconds

If we accept the above, what can we do to make it easy for people to choose the right policy and hard to choose the wrong one? I think these failure modes are pretty subtle. My experience is that people don't know about these problems until they've either experienced them or heard from someone who has. So if you ask people to pick "short" or "long", they're often going to pick the wrong one. My original thought was to provide a small number of canned policies for specific use cases: e.g., internal_service_policy() was intended for requests to internal control plane services (e.g., Nexus, Sled Agent, CockroachDB, Clickhouse, etc.). Then we could add external_service_policy() (for requests to services on the customer's network or over the internet) or sled_local_service_policy() for things on the same sled where we might reasonably expect tighter bounds. Another advantage to this approach is that if we decide to update these parameters based on experience or research, we can adjust the implementations accordingly and we'll know that's safe because we know what each caller was intended to do (i.e., contact another service in the control plane). If the functions are just called "short" and "long", we'd have to inspect each one to decide if our changes apply to it. That said, this approach doesn't seem to have been very effective -- maybe because the name internal_service_policy() sounds so generic and because we hadn't implemented any of the others?

If there's general agreement on the above principles, then I'd suggest:

  • Let's add a block comment in backoff.rs summarizing the principles and more specific guidance. I'd be happy to draft this in a follow-on PR. Maybe we could even have a couple of flow charts around: "what retry policy should I use?" or "should I retry this request?"
  • I still think it's a good idea to name these functions by use case rather than if they're "short" or "long" for the reason I said above.
  • Reviewing the specific cases I mentioned above

(which is what I said at the top)

I also kind of wish that every caller of retry_notify() had a comment explaining what transient conditions are expected and maybe why the chosen policy makes sense. For a lot of the existing callers, it seems hard for a reader to know if the code is right. Is there some better way to communicate this, maybe one that makes it possible for Rust to tell when the wrong thing's being done?

@smklein
Copy link
Collaborator Author

smklein commented Nov 30, 2022

I agree with you that we should settle on names more descriptive than "short" and "long", and I'm happy to take on that follow-up. More precise naming may indicate we need better-defined categories here.

However, I don't totally know if I agree with this take:

For these reasons, I'd expect to use internal_service_policy() (or whatever we rename it to) for:

  • any request to Nexus, CockroachDB, MGS, Dendrite, Clickhouse, or any other non-sled-local service
  • any request from any of these services (which is probably just Nexus) to any other service (including Sled Agent)
  • maybe: to keep things simple, if there's not a compelling reason to do differently, then also for any request from any service to any other service within Omicron

Here are some examples where we're making requests from Sled Agent to Nexus, where the cost of continuing to back off, on the order of minutes to hours, may actually be much more detrimental than the cost of "sending a query per second until we get a response":

In this case, we are notifying Nexus that a new sled exists:

retry_notify(
retry_policy_short(),
notify_nexus,
log_notification_failure,
)
.await
.expect("Expected an infinite retry loop contacting Nexus");

Similarly, in this case, we are notifying Nexus that a new disk (well, zpool) exists:

nexus_notifications.push_back(
backoff::retry_notify(
backoff::retry_policy_short(),
notify_nexus,
log_post_failure,
)
.boxed(),
);

If we allowed backoff to get far -- up to minutes, or hours -- then the rack could be severely underutilized, increasing the burden of any existing load on the system. Nexus simply wouldn't be aware of disks or sleds that it could be using, and arguably, this would make any ongoing load problems worse. Arguably, this delay was already manifesting in #1991

This is part of why I settled on the "short" vs "long" names - these cases made me skeptical that it was apply the same retry policy universally based on the target service.

Does that make sense? I know that's not exactly a concrete proposal, but I still think it's a use-case we need to consider.

@davepacheco
Copy link
Collaborator

That's a good point but I'm not sure these cases are really different? I think you're saying the value of these requests is higher than most, which is fair, but the reason to use a longer backoff is because the probability of success (for everybody) is low when the system is overloaded. The assumption here is that we're backing off because CockroachDB, Nexus, or the network is too overloaded to successfully complete requests. It doesn't seem like hitting them harder can help. What might help is prioritizing these requests over other requests, and we could totally do that, but even if we did, if we still got a 503 overload response, I think the best action would be to backoff (potentially a lot) and retry.

To make it concrete: suppose a customer is adding 100 racks to their deployment, causing thousands of sled agents to slam Nexus/CockroachDB with these requests. They may well be responsible for an overload that compromises the whole control plane. If they start getting 503s, I think they should back off, potentially a lot, just like other Nexus/CockroachDB consumers. If we imagine it's not that they're adding 100 racks, but they're just power cycling them, and if the same internal requests are triggered in that case, then we have the same result: yes, we may need those systems back online to restore control plane capacity, but we need to throttle the load on the surviving sleds in order to make sure we're completing any requests usefully. (And prioritization of these requests may help if we think that's the best way to get the system back online.) I think it's a long way in scale before we run into these issues, but I don't think it's any more correct in the short term to use a more aggressive backoff?

@smklein
Copy link
Collaborator Author

smklein commented Dec 1, 2022

I added #1999 as a follow-up.

I'm using the policy called retry_policy_internal_service_aggressive in that PR, which is able to back off as much as you suggested, but uses a smaller multiplier to get there. This seems like a quick, slightly hacky compromise to get what we're both reaching for: we won't stall out as long on the initialization pathway, but we will backoff if Nexus is truly overloaded.

smklein added a commit that referenced this pull request Dec 5, 2022
#1999)

Follow-up to #1996

- Adds some docs
- `retry_policy_long` -> `retry_policy_internal_service`
- `retry_policy_short` -> `retry_policy_local`
- Adds a new policy, `retry_policy_internal_service_aggressive`. This
backoff policy eventually uses timeout intervals similar to
`retry_policy_internal_service` (to avoid overloading the target
service), but it backs off more slowly, which avoids stalls for critical
service updates (e.g., Sled Agent notifying Nexus about new hardware or
faults).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky issue with helios deploy CI job during "ensuring datasets are ready" Less-exponential alternative to internal_service_policy()?

5 participants