New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sled Agent service initialization cleanup #3712
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this cleanup, I appreciate this a ton. Hopefully this eliminates most developer-local issues, and makes future ones more obvious.
_ => { | ||
// Mismatch between SA's view and reality, let's try to | ||
// clean up any remanants and try initialize it again | ||
warn!( | ||
log, | ||
"expected to find existing zone in running state"; | ||
"zone" => &name, | ||
); | ||
if let Err(e) = | ||
existing_zones.remove(&name).unwrap().stop().await | ||
{ | ||
error!( | ||
log, | ||
"Failed to stop zone"; | ||
"zone" => &name, | ||
"error" => %e, | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call to do this cleanup here
I think so. I'm not sure it ever makes sense to have a request timeout when using TCP keep-alive. We removed a similar one in #3503. |
We're inevitably going to timeout this request with the current 60s timeout and subsequent requests won't make progress anyways until the task spawned from earlier ones finishes. Per @davepacheco's [comment](#3712 (comment)) let's just remove the timeout here.
Roughly 3 changes:
zoneadm
)service_put
would return on the server-side.Lack of space was causing service initialization to fail on single-machine deployments using the
create_virtual_hardware
script. (3) should fix this.Because of
ENOSPC
errors, we were running into some future cancellation issues which #3707 addressed. But actually determining the underlying cause was a bit difficult asRSS
was timing out (*) on theservices_put
call and so never got back the error. (2) should hopefully make this easier to catch in the future, e.g.:(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.
Finally, (1) is to address the case where we try to initialize a zone that's already running. That's sometimes fine as we'll eventually realize the zone already exists. But in the case of a zone with an OPTE port, we'll run into errors like so:
This happens because the check for "is zone running" comes after we do all the work necessary to create the zone (e.g. creating an opte port). (1) updates
initialize_services_locked
to cross-reference sled-agent and the system's view of what zones are running so that we don't try to do the unnecessary work.