Skip to content

Instance start does not independently ensure v2p/dpd state exists #3813

@gjcolombo

Description

@gjcolombo

Repro steps:

  1. start rack
  2. create and start instance
  3. stop instance
  4. power rack down
  5. cold boot the rack
  6. restart the instance

Expected: instances have appropriate internal and external connectivity
Observed: they don't; closer examination shows the relevant Dendrite NAT mappings are missing (V2P mappings might be missing as well though I'm not sure)

Workaround: start the instance, stop it, and start it again; connectivity should be restored on the second boot.


The reason this happens is that Nexus's instance_start function doesn't actually try to establish any of these mappings:

/// Make sure the given Instance is running.
pub async fn instance_start(
&self,
opctx: &OpContext,
instance_lookup: &lookup::Instance<'_>,
) -> UpdateResult<db::model::Instance> {
// TODO(#2824): This needs to be a saga for crash resiliency
// purposes (otherwise the instance can be leaked if Nexus crashes
// between registration and instance start).
let (.., authz_instance, mut db_instance) =
instance_lookup.fetch().await?;
// The instance is not really being "created" (it already exists from
// the caller's perspective), but if it does not exist on its sled, the
// target sled agent will populate its instance manager with the
// contents of this modified record, and that record needs to allow a
// transition to the Starting state.
//
// If the instance does exist on this sled, this initial runtime state
// is ignored.
let initial_runtime = nexus_db_model::InstanceRuntimeState {
state: nexus_db_model::InstanceState(InstanceState::Creating),
..db_instance.runtime_state
};
db_instance.runtime_state = initial_runtime;
self.instance_ensure_registered(opctx, &authz_instance, &db_instance)
.await?;
self.instance_request_state(
opctx,
&authz_instance,
&db_instance,
InstanceStateRequested::Running,
)
.await?;
self.db_datastore.instance_refetch(opctx, &authz_instance).await
}

The mappings also aren't removed when the instance is stopped. This means that, before introducing cold boot into the picture, we've been able to rely on having programmed the correct networking state for any non-deleted instance--the state is created once when the instance is created and deleted only when the instance goes away.

Stopping and restarting the instance works around the problem because instance destruction triggers a Propolis generation change in the sled agent via a call to clear the instance's current migration IDs:

/// Unconditionally clears the instance's migration IDs and advances its
/// Propolis generation. Not public; used internally to conclude migrations.
fn clear_migration_ids(&mut self) {
self.current.migration_id = None;
self.current.dst_propolis_id = None;
self.current.propolis_gen = self.current.propolis_gen.next();
}

Nexus assumes that any Propolis generation change is a sign that an instance might have moved between sleds, and uses it as an opportunity to propagate all the correct networking state:

async fn handle_instance_propolis_gen_change(
&self,
opctx: &OpContext,
new_runtime: &nexus::InstanceRuntimeState,
db_instance: &nexus_db_model::Instance,
) -> Result<(), Error> {
let log = &self.log;
let instance_id = db_instance.id();
info!(log,
"updating configuration after Propolis generation change";
"instance_id" => %instance_id,
"new_sled_id" => %new_runtime.sled_id,
"old_sled_id" => %db_instance.runtime().sled_id);
// Push updated V2P mappings to all interested sleds. This needs to be
// done irrespective of whether the sled ID actually changed, because
// merely creating the target Propolis on the target sled will create
// XDE devices for its NICs, and creating an XDE device for a virtual IP
// creates a V2P mapping that maps that IP to that sled. This is fine if
// migration succeeded, but if it failed, the instance is running on the
// source sled, and the incorrect mapping needs to be replaced.
//
// TODO(#3107): When XDE no longer creates mappings implicitly, this
// can be restricted to cases where an instance's sled has actually
// changed.
self.create_instance_v2p_mappings(
opctx,
instance_id,
new_runtime.sled_id,
)
.await?;
let (.., sled) = LookupPath::new(opctx, &self.db_datastore)
.sled_id(new_runtime.sled_id)
.fetch()
.await?;
let boundary_switches =
self.boundary_switches(&self.opctx_alloc).await?;
for switch in &boundary_switches {
let dpd_client = self.dpd_clients.get(switch).ok_or_else(|| {
Error::internal_error(&format!(
"could not find dpd client for {switch}"
))
})?;
self.instance_ensure_dpd_config(
opctx,
db_instance.id(),
&sled.address(),
None,
dpd_client,
)
.await?;
}
Ok(())
}


This should be fixed by having instance_start ensure that all the correct mappings are created on the appropriate sleds--i.e. it needs to call create_instance_v2p_mappings and instance_ensure_dpd_config on instance start instead of hoping the right things happen when the instance stops. Related (once again) are #2315 and #2824--this behavior will become especially important when an instance can go from having a sled assignment to not having a sled assignment to having a new, different sled assignment after a start-stop-start sequence. When networking state updates are propagated through RPWs, the start sequence can just trigger the relevant RPWs instead of doing the work itself.

Naively dropping the configuration calls into instance_start almost works, but there's a lurking race condition because of the way the "Starting" state gets applied:

  1. T1 begins to start the instance
  2. T2 begins to delete the instance & starts the deletion saga
  3. T2's saga deletes the instance's mappings
  4. T2's saga marks the instance as deleted
  5. T1 recreates the mappings
  6. T1 actually tries to start the instance and fails because it has been deleted

To make this safe, we need to ensure the instance has transitioned from Stopped to Starting in instance_start to block any subsequent deletion attempts before we actually program any networking state. That's more complicated than it sounds because Nexus currently relies on sled agent to push the instance to the Starting state--we'll need some code up in Nexus to help manage that state transition.

Metadata

Metadata

Assignees

Labels

bugSomething that isn't working.known issueTo include in customer documentation and trainingnexusRelated to nexus

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions