Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instances' sled assignments won't change if an instance is stopped and restarted #2315

Closed
gjcolombo opened this issue Feb 2, 2023 · 1 comment · Fixed by #4194
Closed
Labels
nexus Related to nexus

Comments

@gjcolombo
Copy link
Contributor

The Nexus external API's "instance start" command passes through to Nexus::instance_start_runtime:

let instance = nexus.instance_start(&opctx, &instance_lookup).await?;

/// Make sure the given Instance is running.
pub async fn instance_start(
&self,
opctx: &OpContext,
instance_lookup: &lookup::Instance<'_>,
) -> UpdateResult<db::model::Instance> {
let (.., authz_instance, db_instance) = instance_lookup.fetch().await?;
let requested = InstanceRuntimeStateRequested {
run_state: InstanceStateRequested::Running,
migration_params: None,
};
self.instance_set_runtime(
opctx,
&authz_instance,
&db_instance,
requested,
)
.await?;
self.db_datastore.instance_refetch(opctx, &authz_instance).await
}

If I'm reading things right, this function selects the sled to which to send the runtime state update by looking at the instance record in CRDB without regard for the instance's current state:

let sa = self.instance_sled(&db_instance).await?;
let instance_put_result = sa
.instance_put(
&db_instance.id(),
&sled_agent_client::types::InstanceEnsureBody {
initial: instance_hardware,
target: requested.clone(),
migrate: None,
},
)
.await;

This seems like the right thing to do if the instance is already incarnated on a sled somewhere. But if the instance is stopped and doesn't exist on any sled, this will try to create the instance on the sled on which it most recently ran, which might not have capacity for it (even though some other sled might). This function should distinguish the "instance already incarnated" and "instance stopped" cases and select a new sled in the latter case.

@gjcolombo gjcolombo added the nexus Related to nexus label Feb 2, 2023
@gjcolombo
Copy link
Contributor Author

I suspect fixing this problem will more or less require instance start to become a saga, because once it's done, starting an instance will require a lot of attendant work (reserving space on a sled, setting up V2P mappings) that we need to be able to retry if interrupted and that may need to be undone if the entire attempt to start the instance fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant