sled agent and Nexus frequently flatten errors into 500 Internal Server Error

Sled agent has a relatively rich set of internal error types, but at the Dropshot HTTP error boundary, it converts almost all of them to 500 errors:

https://github.com/oxidecomputer/omicron/blob/cab09253e6e6e27761d1ef99f8087f86ccad2e8b/sled-agent/src/sled_agent.rs#L102-L137

This makes it hard for callers to reason about the cause or permanence of any of these errors. This was a pebble in the shoe of PR #2892 and is coming up again in #3230 (sled agent is making a call to Nexus whose response depends in part on calls back down into sled agent, and there's no way to be sure about the permanence of any errors returned from Nexus in that path because all errors are getting flattened into a single error code).

There are similar paths through Nexus that flatten errors this way that the fix for #3230 will have to take into account:

- Updating instance state can update the instance's Dendrite configuration: https://github.com/oxidecomputer/omicron/blob/cab09253e6e6e27761d1ef99f8087f86ccad2e8b/nexus/src/app/instance.rs#L1138-L1144
- `instance_ensure_dpd_config` calls `dpd_client.ensure_nat_entry` and maps all errors to 500s: https://github.com/oxidecomputer/omicron/blob/cab09253e6e6e27761d1ef99f8087f86ccad2e8b/nexus/src/app/instance.rs#L1243-L1259
- `ensure_nat_entry` calls Dendrite daemon routines that can (I presume) produce transient failures (e.g. transient communication failures), e.g.: https://github.com/oxidecomputer/omicron/blob/cab09253e6e6e27761d1ef99f8087f86ccad2e8b/dpd-client/src/lib.rs#L111-L127

Revisiting absolutely every single error conversion in sled agent and Nexus all at once is probably a non-starter, but we can likely start by improving sled agent's internal-error-to-Dropshot-error conversion and look for opportunities to remove other error-flattening cases as they arise.

	// Provide a more specific HTTP error for some sled agent errors.
	impl From<Error> for dropshot::HttpError {
	fn from(err: Error) -> Self {
	match err {
	crate::sled_agent::Error::Instance(instance_manager_error) => {
	match instance_manager_error {
	crate::instance_manager::Error::Instance(
	instance_error,
	) => match instance_error {
	crate::instance::Error::Propolis(propolis_error) => {
	match propolis_error.status() {
	None => HttpError::for_internal_error(
	propolis_error.to_string(),
	),

	Some(status_code) => {
	HttpError::for_status(None, status_code)
	}
	}
	}
	crate::instance::Error::Transition(omicron_error) => {
	// Preserve the status associated with the wrapped
	// Omicron error so that Nexus will see it in the
	// Progenitor client error it gets back.
	HttpError::from(omicron_error)
	}
	e => HttpError::for_internal_error(e.to_string()),
	},
	e => HttpError::for_internal_error(e.to_string()),
	}
	}

	e => HttpError::for_internal_error(e.to_string()),
	}
	}
	}

	self.instance_ensure_dpd_config(
	opctx,
	db_instance.id(),
	&sled.address(),
	None,
	)
	.await?;

	dpd_client
	.ensure_nat_entry(
	&log,
	target_ip.ip,
	dpd_client::types::MacAddr { a: mac_address.into_array() },
	*target_ip.first_port,
	*target_ip.last_port,
	vni,
	sled_ip_address.ip(),
	)
	.await
	.map_err(\|e\| {
	Error::internal_error(&format!(
	"failed to ensure dpd entry: {e}"
	))
	})?;
	}

	self.nat_ipv4_create(
	&network.ip(),
	target_first_port,
	target_last_port,
	&nat_target,
	)
	.await
	}
	ipnetwork::IpNetwork::V6(network) => {
	self.nat_ipv6_create(
	&network.ip(),
	target_first_port,
	target_last_port,
	&nat_target,
	)
	.await
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions