-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Sled agent has a relatively rich set of internal error types, but at the Dropshot HTTP error boundary, it converts almost all of them to 500 errors:
omicron/sled-agent/src/sled_agent.rs
Lines 102 to 137 in cab0925
| // Provide a more specific HTTP error for some sled agent errors. | |
| impl From<Error> for dropshot::HttpError { | |
| fn from(err: Error) -> Self { | |
| match err { | |
| crate::sled_agent::Error::Instance(instance_manager_error) => { | |
| match instance_manager_error { | |
| crate::instance_manager::Error::Instance( | |
| instance_error, | |
| ) => match instance_error { | |
| crate::instance::Error::Propolis(propolis_error) => { | |
| match propolis_error.status() { | |
| None => HttpError::for_internal_error( | |
| propolis_error.to_string(), | |
| ), | |
| Some(status_code) => { | |
| HttpError::for_status(None, status_code) | |
| } | |
| } | |
| } | |
| crate::instance::Error::Transition(omicron_error) => { | |
| // Preserve the status associated with the wrapped | |
| // Omicron error so that Nexus will see it in the | |
| // Progenitor client error it gets back. | |
| HttpError::from(omicron_error) | |
| } | |
| e => HttpError::for_internal_error(e.to_string()), | |
| }, | |
| e => HttpError::for_internal_error(e.to_string()), | |
| } | |
| } | |
| e => HttpError::for_internal_error(e.to_string()), | |
| } | |
| } | |
| } |
This makes it hard for callers to reason about the cause or permanence of any of these errors. This was a pebble in the shoe of PR #2892 and is coming up again in #3230 (sled agent is making a call to Nexus whose response depends in part on calls back down into sled agent, and there's no way to be sure about the permanence of any errors returned from Nexus in that path because all errors are getting flattened into a single error code).
There are similar paths through Nexus that flatten errors this way that the fix for #3230 will have to take into account:
- Updating instance state can update the instance's Dendrite configuration:
omicron/nexus/src/app/instance.rs
Lines 1138 to 1144 in cab0925
self.instance_ensure_dpd_config( opctx, db_instance.id(), &sled.address(), None, ) .await?; instance_ensure_dpd_configcallsdpd_client.ensure_nat_entryand maps all errors to 500s:omicron/nexus/src/app/instance.rs
Lines 1243 to 1259 in cab0925
dpd_client .ensure_nat_entry( &log, target_ip.ip, dpd_client::types::MacAddr { a: mac_address.into_array() }, *target_ip.first_port, *target_ip.last_port, vni, sled_ip_address.ip(), ) .await .map_err(|e| { Error::internal_error(&format!( "failed to ensure dpd entry: {e}" )) })?; } ensure_nat_entrycalls Dendrite daemon routines that can (I presume) produce transient failures (e.g. transient communication failures), e.g.:Lines 111 to 127 in cab0925
self.nat_ipv4_create( &network.ip(), target_first_port, target_last_port, &nat_target, ) .await } ipnetwork::IpNetwork::V6(network) => { self.nat_ipv6_create( &network.ip(), target_first_port, target_last_port, &nat_target, ) .await }
Revisiting absolutely every single error conversion in sled agent and Nexus all at once is probably a non-starter, but we can likely start by improving sled agent's internal-error-to-Dropshot-error conversion and look for opportunities to remove other error-flattening cases as they arise.