Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid task cancellation on service init failure #3707

Merged
merged 1 commit into from Jul 19, 2023
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 18, 2023

We have seen some behavior where, after failing to initialize services, the sled agent enters "split-brain" behavior, and on subsequent requests to initialize services, the sled agent is not aware of zones which it created itself.

I believe this to be a possible cause: While integrating #3676 , I used try_for_each_concurrent to parallelize zone bringup. This is an operation which can potentially return early on cancellation, dropping all other ongoing zone initialization tasks.

This PR mitigates this issue by preventing concurrent drop within service initialization.

Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

let local_existing_zones = Arc::new(Mutex::new(existing_zones));
let first_err = Arc::new(Mutex::new(None));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but as written shouldn't this be last_err?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #3709 , pulled it out to not block the build for folks MUPdating

@sunshowers
Copy link
Contributor

The try_ methods seem pretty difficult to use correctly 😬

@smklein smklein merged commit ebd3db2 into main Jul 19, 2023
20 checks passed
@smklein smklein deleted the not-that-concurrent branch July 19, 2023 03:24
luqmana added a commit that referenced this pull request Jul 19, 2023
Roughly 3 changes:
1. Skip initializing any zones we already know are running (both from
sled-agent's perspective and as reflected via `zoneadm`)
2. Log any errors `service_put` would return on the server-side.
3. Bump file-based zpools to 15G

Lack of space was causing service initialization to fail on
single-machine deployments using the `create_virtual_hardware` script.
(3) should fix this.

Because of `ENOSPC` errors, we were running into some future
cancellation issues which #3707 addressed. But actually determining the
underlying cause was a bit difficult as `RSS` was timing out (*) on the
`services_put` call and so never got back the error. (2) should
hopefully make this easier to catch in the future, e.g.:
```
04:12:31.312Z ERRO SledAgent: failed to init services: Failed to install zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127' from '/opt/oxide/clickhouse.tar.gz': Failed to execute zoneadm command '
Install' for zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127': Failed to parse command output: exit code 1
    stdout:
    A ZFS file system has been created for this zone.
    INFO: omicron: installing zone oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127 @ "/pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127"...
    INFO: omicron: replicating /usr tree...
    INFO: omicron: replicating /lib tree...
    INFO: omicron: replicating /sbin tree...
    INFO: omicron: pruning SMF manifests...
    INFO: omicron: pruning global-only files...
    INFO: omicron: unpacking baseline archive...
    INFO: omicron: unpacking image "/opt/oxide/clickhouse.tar.gz"...
    stderr:
    Error: No space left on device (os error 28)
    sled_id = 8fd66238-6bb3-4f08-89bf-f88fc2320d83

```

(*) speaking of which, do we want to increase that timeout from just
60s? it's not like any subsequent requests will work considering they'll
be blocked on a lock from the initial request. and in failure cases like
this the subsequent requests will timeout as well, leaving a bunch of
tasks in sled-agent all waiting on the same lock to try initializing.
might be worth switching to the single processing task model rack/sled
initialization use.

Finally, (1) is to address the case where we try to initialize a zone
that's already running. That's sometimes fine as we'll eventually
realize the zone already exists. But in the case of a zone with an OPTE
port, we'll run into errors like so:
```
    error_message_internal = Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte12", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:EC:09 } }
```
This happens because the
[check](https://github.com/oxidecomputer/omicron/blob/ebd3db27f6bbd08af3194cae84b0efc6b6e7248d/illumos-utils/src/zone.rs#L282C27-L282C27)
for "is zone running" comes after we do all the work necessary to create
the zone (e.g. creating an opte port). (1) updates
`initialize_services_locked` to cross-reference sled-agent and the
system's view of what zones are running so that we don't try to do the
unnecessary work.

---------

Co-authored-by: Sean Klein <sean@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants