Skip to content

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

@gjcolombo

Description

@gjcolombo

Sled agent has a relatively rich set of internal error types, but at the Dropshot HTTP error boundary, it converts almost all of them to 500 errors:

// Provide a more specific HTTP error for some sled agent errors.
impl From<Error> for dropshot::HttpError {
fn from(err: Error) -> Self {
match err {
crate::sled_agent::Error::Instance(instance_manager_error) => {
match instance_manager_error {
crate::instance_manager::Error::Instance(
instance_error,
) => match instance_error {
crate::instance::Error::Propolis(propolis_error) => {
match propolis_error.status() {
None => HttpError::for_internal_error(
propolis_error.to_string(),
),
Some(status_code) => {
HttpError::for_status(None, status_code)
}
}
}
crate::instance::Error::Transition(omicron_error) => {
// Preserve the status associated with the wrapped
// Omicron error so that Nexus will see it in the
// Progenitor client error it gets back.
HttpError::from(omicron_error)
}
e => HttpError::for_internal_error(e.to_string()),
},
e => HttpError::for_internal_error(e.to_string()),
}
}
e => HttpError::for_internal_error(e.to_string()),
}
}
}

This makes it hard for callers to reason about the cause or permanence of any of these errors. This was a pebble in the shoe of PR #2892 and is coming up again in #3230 (sled agent is making a call to Nexus whose response depends in part on calls back down into sled agent, and there's no way to be sure about the permanence of any errors returned from Nexus in that path because all errors are getting flattened into a single error code).

There are similar paths through Nexus that flatten errors this way that the fix for #3230 will have to take into account:

  • Updating instance state can update the instance's Dendrite configuration:
    self.instance_ensure_dpd_config(
    opctx,
    db_instance.id(),
    &sled.address(),
    None,
    )
    .await?;
  • instance_ensure_dpd_config calls dpd_client.ensure_nat_entry and maps all errors to 500s:
    dpd_client
    .ensure_nat_entry(
    &log,
    target_ip.ip,
    dpd_client::types::MacAddr { a: mac_address.into_array() },
    *target_ip.first_port,
    *target_ip.last_port,
    vni,
    sled_ip_address.ip(),
    )
    .await
    .map_err(|e| {
    Error::internal_error(&format!(
    "failed to ensure dpd entry: {e}"
    ))
    })?;
    }
  • ensure_nat_entry calls Dendrite daemon routines that can (I presume) produce transient failures (e.g. transient communication failures), e.g.:
    self.nat_ipv4_create(
    &network.ip(),
    target_first_port,
    target_last_port,
    &nat_target,
    )
    .await
    }
    ipnetwork::IpNetwork::V6(network) => {
    self.nat_ipv6_create(
    &network.ip(),
    target_first_port,
    target_last_port,
    &nat_target,
    )
    .await
    }

Revisiting absolutely every single error conversion in sled agent and Nexus all at once is probably a non-starter, but we can likely start by improving sled agent's internal-error-to-Dropshot-error conversion and look for opportunities to remove other error-flattening cases as they arise.

Metadata

Metadata

Assignees

Labels

DebuggingFor when you want better data in debugging an issue (log messages, post mortem debugging, and more)Sled AgentRelated to the Per-Sled Configuration and ManagementnexusRelated to nexus

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions