Sled Agent service initialization cleanup #3712

luqmana · 2023-07-19T06:05:43Z

Roughly 3 changes:

Skip initializing any zones we already know are running (both from sled-agent's perspective and as reflected via zoneadm)
Log any errors service_put would return on the server-side.
Bump file-based zpools to 15G

Lack of space was causing service initialization to fail on single-machine deployments using the create_virtual_hardware script. (3) should fix this.

Because of ENOSPC errors, we were running into some future cancellation issues which #3707 addressed. But actually determining the underlying cause was a bit difficult as RSS was timing out (*) on the services_put call and so never got back the error. (2) should hopefully make this easier to catch in the future, e.g.:

04:12:31.312Z ERRO SledAgent: failed to init services: Failed to install zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127' from '/opt/oxide/clickhouse.tar.gz': Failed to execute zoneadm command '
Install' for zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127': Failed to parse command output: exit code 1
    stdout:
    A ZFS file system has been created for this zone.
    INFO: omicron: installing zone oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127 @ "/pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127"...
    INFO: omicron: replicating /usr tree...
    INFO: omicron: replicating /lib tree...
    INFO: omicron: replicating /sbin tree...
    INFO: omicron: pruning SMF manifests...
    INFO: omicron: pruning global-only files...
    INFO: omicron: unpacking baseline archive...
    INFO: omicron: unpacking image "/opt/oxide/clickhouse.tar.gz"...
    stderr:
    Error: No space left on device (os error 28)
    sled_id = 8fd66238-6bb3-4f08-89bf-f88fc2320d83

(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.

Finally, (1) is to address the case where we try to initialize a zone that's already running. That's sometimes fine as we'll eventually realize the zone already exists. But in the case of a zone with an OPTE port, we'll run into errors like so:

    error_message_internal = Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte12", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:EC:09 } }

This happens because the check for "is zone running" comes after we do all the work necessary to create the zone (e.g. creating an opte port). (1) updates initialize_services_locked to cross-reference sled-agent and the system's view of what zones are running so that we don't try to do the unnecessary work.

smklein

Thanks for this cleanup, I appreciate this a ton. Hopefully this eliminates most developer-local issues, and makes future ones more obvious.

smklein · 2023-07-19T06:24:07Z

sled-agent/src/services.rs

+                    _ => {
+                        // Mismatch between SA's view and reality, let's try to
+                        // clean up any remanants and try initialize it again
+                        warn!(
+                            log,
+                            "expected to find existing zone in running state";
+                            "zone" => &name,
+                        );
+                        if let Err(e) =
+                            existing_zones.remove(&name).unwrap().stop().await
+                        {
+                            error!(
+                                log,
+                                "Failed to stop zone";
+                                "zone" => &name,
+                                "error" => %e,
+                            );
+                        }


Good call to do this cleanup here

davepacheco · 2023-07-19T15:42:42Z

(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.

I think so. I'm not sure it ever makes sense to have a request timeout when using TCP keep-alive. We removed a similar one in #3503.

@davepacheco

We're inevitably going to timeout this request with the current 60s timeout and subsequent requests won't make progress anyways until the task spawned from earlier ones finishes. Per @davepacheco's [comment](#3712 (comment)) let's just remove the timeout here.

smklein and others added 9 commits July 18, 2023 20:26

Error

1409464

sled-agent: store running zones as a map rather than list

fd1f6cb

sled-agent: skip trying to initialize already running zone

47d29be

extend on exactiterator

c232642

log better

f234309

sled-agent: log services_put error server-side

bcfc0cb

sled-agent: make sure our view of running zones reflects reality

a686174

Bump file-based zpools to 15G

db6c919

destroy_virtual_hardware: don't skip destroy if not currently mounted

0aa341e

luqmana requested a review from smklein July 19, 2023 06:05

smklein approved these changes Jul 19, 2023

View reviewed changes

luqmana merged commit 1eec1a3 into main Jul 19, 2023
20 checks passed

luqmana deleted the luqmana/running-zones branch July 19, 2023 08:14

luqmana mentioned this pull request Jul 19, 2023

Follow-up feedback on #3707 #3709

Closed

luqmana mentioned this pull request Jul 19, 2023

RSS: remove inevitable timeout on services_put call #3719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sled Agent service initialization cleanup #3712

Sled Agent service initialization cleanup #3712

luqmana commented Jul 19, 2023

smklein left a comment

smklein Jul 19, 2023

davepacheco commented Jul 19, 2023

Sled Agent service initialization cleanup #3712

Sled Agent service initialization cleanup #3712

Conversation

luqmana commented Jul 19, 2023

smklein left a comment

Choose a reason for hiding this comment

smklein Jul 19, 2023

Choose a reason for hiding this comment

davepacheco commented Jul 19, 2023