Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sled Agent service initialization cleanup #3712

Merged
merged 9 commits into from Jul 19, 2023
Merged

Conversation

luqmana
Copy link
Contributor

@luqmana luqmana commented Jul 19, 2023

Roughly 3 changes:

  1. Skip initializing any zones we already know are running (both from sled-agent's perspective and as reflected via zoneadm)
  2. Log any errors service_put would return on the server-side.
  3. Bump file-based zpools to 15G

Lack of space was causing service initialization to fail on single-machine deployments using the create_virtual_hardware script. (3) should fix this.

Because of ENOSPC errors, we were running into some future cancellation issues which #3707 addressed. But actually determining the underlying cause was a bit difficult as RSS was timing out (*) on the services_put call and so never got back the error. (2) should hopefully make this easier to catch in the future, e.g.:

04:12:31.312Z ERRO SledAgent: failed to init services: Failed to install zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127' from '/opt/oxide/clickhouse.tar.gz': Failed to execute zoneadm command '
Install' for zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127': Failed to parse command output: exit code 1
    stdout:
    A ZFS file system has been created for this zone.
    INFO: omicron: installing zone oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127 @ "/pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127"...
    INFO: omicron: replicating /usr tree...
    INFO: omicron: replicating /lib tree...
    INFO: omicron: replicating /sbin tree...
    INFO: omicron: pruning SMF manifests...
    INFO: omicron: pruning global-only files...
    INFO: omicron: unpacking baseline archive...
    INFO: omicron: unpacking image "/opt/oxide/clickhouse.tar.gz"...
    stderr:
    Error: No space left on device (os error 28)
    sled_id = 8fd66238-6bb3-4f08-89bf-f88fc2320d83

(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.

Finally, (1) is to address the case where we try to initialize a zone that's already running. That's sometimes fine as we'll eventually realize the zone already exists. But in the case of a zone with an OPTE port, we'll run into errors like so:

    error_message_internal = Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte12", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:EC:09 } }

This happens because the check for "is zone running" comes after we do all the work necessary to create the zone (e.g. creating an opte port). (1) updates initialize_services_locked to cross-reference sled-agent and the system's view of what zones are running so that we don't try to do the unnecessary work.

@luqmana luqmana requested a review from smklein July 19, 2023 06:05
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup, I appreciate this a ton. Hopefully this eliminates most developer-local issues, and makes future ones more obvious.

Comment on lines +2533 to +2550
_ => {
// Mismatch between SA's view and reality, let's try to
// clean up any remanants and try initialize it again
warn!(
log,
"expected to find existing zone in running state";
"zone" => &name,
);
if let Err(e) =
existing_zones.remove(&name).unwrap().stop().await
{
error!(
log,
"Failed to stop zone";
"zone" => &name,
"error" => %e,
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call to do this cleanup here

@luqmana luqmana merged commit 1eec1a3 into main Jul 19, 2023
20 checks passed
@luqmana luqmana deleted the luqmana/running-zones branch July 19, 2023 08:14
@davepacheco
Copy link
Collaborator

(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.

I think so. I'm not sure it ever makes sense to have a request timeout when using TCP keep-alive. We removed a similar one in #3503.

luqmana added a commit that referenced this pull request Jul 19, 2023
We're inevitably going to timeout this request with the current 60s
timeout and subsequent requests won't make progress anyways until the
task spawned from earlier ones finishes. Per @davepacheco's
[comment](#3712 (comment))
let's just remove the timeout here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants