CLI commands log an error but exit 0 on failure

## Summary

Several `osism` subcommands log an error on a failure path and then return
control without setting a non-zero exit code. The command therefore exits `0`
(success) even though the operation failed — a timeout, a failed inventory
query, a missing resource, or an unmet precondition.

This makes the affected commands unsafe in `set -e` scripts and `&&` chains: a
failed lookup silently looks like success.

Found while reviewing #2313, which fixed two instances of this pattern in
`osism get`. A follow-up audit of `osism/commands/` turned up many more.

## Root cause

cliff turns the return value of `take_action` into the process exit code:

```python
# cliff/command.py
return_code = self.take_action(parsed_args) or 0
```

So any of the following on a failure path yields exit code `0`:

- a bare `return`
- `return None`
- `return 0`
- falling off the end of `take_action` (implicit `None`)
- a helper method that returns `None` on error, whose value is then returned by
  `take_action`

The fix in each case is to `return 1` (or another non-zero value) on the failure
path.

### What PR #2313 already fixed

`osism get hostvars` and `osism get hosts` logged an error and `return`ed `None`
when the underlying `ansible-inventory` subprocess raised
`subprocess.CalledProcessError`. Both now `return 1` on that path.

PR #2313 also established the intended convention, which the fixes below should
follow:

- **Operational failure** (the query/operation could not run: timeout,
  subprocess error, resource not found, precondition unmet) → **non-zero**.
- **Empty result** (the query ran fine but returned nothing: a host with no
  vars, an absent variable, an empty inventory) → **zero**. These are valid
  answers to a query that did run, and should be left as exit 0.

## Findings

Grouped by priority. Line numbers are as of commit `a386d29` and may drift.

### Category A — Timeout handler does not produce a non-zero exit code (highest priority)

The command waits for task output, hits `TimeoutError`, logs the timeout, and
then fails to return a non-zero exit code. In most cases it falls through to the
end of `take_action` → implicit `None` → exit 0. Same class as the bug PR #2313
fixed, but for the timeout path.

| File | Command | Lines | Failure mode |
|---|---|---|---|
| `osism/commands/reconciler.py` | `Sync` | 63-66 | falls through → exit 0 |
| `osism/commands/validate.py` | `_handle_task` (value returned by `take_action`) | 59-62 | returns `None` → exit 0 |
| `osism/commands/netbox.py` | `Ironic` | 96-104 | falls through → exit 0 |
| `osism/commands/netbox.py` | `Sync` | 370-378 | falls through → exit 0 |
| `osism/commands/wait.py` | `Wait` | 125-128 | see note below |

`wait.py` is the messiest of the group; its timeout behaviour and exit code
depend on `--live` and the number of tasks. After `task_id = task_ids.pop()`
(line 88), `len(task_ids)` is the number of *remaining* tasks, and the only place
an exit code is returned is line 131, `return rc`, guarded by
`if len(task_ids) == 1` (exactly one task left after the pop). `rc` is only ever
assigned inside the `try` at line 124.

- **`--live`, single task**: after the pop `len(task_ids) == 0`, so line 131 is
  never reached. Whether the fetch succeeds or times out, control falls off the
  end of `take_action` → exit 0. The timeout is logged but ignored.
- **`--live`, two tasks**: the first iteration pops one, leaving `len == 1`, so
  it hits `return rc` after processing only the first task — the second
  (remaining) task is never processed. On a timeout in that iteration `rc` is
  unassigned → `UnboundLocalError`.
- **`--live`, three or more tasks**: `return rc` fires on the iteration that
  leaves exactly one task remaining (the second-to-last task), again skipping the
  last task. A timeout there returns whatever `rc` an earlier successful fetch
  left behind — a **stale** value — or `UnboundLocalError` if no earlier
  iteration assigned it.
- **non-`--live` (polling), any task count**: STARTED/PENDING tasks are re-queued
  and the loop drains until everything reaches SUCCESS/unavailable, then falls
  off the end (~line 156) → exit 0. The SUCCESS branch never captures a task's
  `rc`, so this path always exits 0 regardless of task outcome.

So fixing `wait.py` is more than a one-liner: initialize `rc` before the loop,
return non-zero on `TimeoutError`, and reconsider the `len(task_ids) == 1` guard,
which returns before the final task is processed.

### Category B — Inventory query failure / timeout → exit 0 (`report.py`)

The direct twins of the PR #2313 bug. Four commands (`Memory`, `Lldp`, `Bgp`,
`Status`) each run `ansible-inventory` via `subprocess.run(...)` and bare-`return`
on both a non-zero return code (`Error loading inventory.`) and on
`subprocess.TimeoutExpired` (`Timeout loading inventory.`). The neighbouring
`No hosts found in inventory.` branch is the empty-result case and should be
**left at exit 0** per the convention above.

| Command | `Error loading inventory.` | `Timeout loading inventory.` |
|---|---|---|
| `Memory` | 53 | 56 |
| `Lldp` | 191 | 194 |
| `Bgp` | 360 | 363 |
| `Status` | 537 | 540 |

### Category C — Lookup / config / validation failure → exit 0

Real operational or input failures that bare-`return` (or fall through) in
`take_action`, silently reporting success:

- **`osism/commands/baremetal.py`** (19 sites) — missing-argument validation,
  unconfirmed `--yes-i-really-really-mean-it`, "not found" errors, and one
  exception handler. Lines: 165, 169, 453, 572, 587, 716, 720, 867, 874,
  **960 (exception handler — ping operation failed)**, 1025, 1033, 1184, 1188,
  1316, 1490, 1549, 1618, 1622.
- **`osism/commands/status.py:71`** — `Run`: unknown resource type, then falls
  through to exit 0.
- **`osism/commands/compute.py`** — `ComputeEnable.take_action` line 70
  (`BadRequestException` caught at line 64: cannot force host up while `done`
  evacuation records remain — an operational failure inside a `try`/`finally`,
  bare-`return` in the `try`); and `ComputeMigrationList.take_action` lines 666,
  692, 699, 711, 720, 734, 737 (no domain/user/project/server found, multiple
  servers found, bad `changes-since`).
- **`osism/commands/server.py`** — 184, 191, 197, 236, 243 (no domain / no user
  / project not found).
- **`osism/commands/volume.py`** — 69, 108, 116 (domain / project domain /
  project not found).
- **`osism/commands/netbox.py`** — 538 (`Console`: NetBox not configured), 587
  (`Dump`: NetBox not configured), 633 (`Dump`: device not found).

Within Category C, the not-found / operational failures (e.g.
`baremetal.py:960`, and the `compute`/`server`/`volume`/`netbox` lookups) are
clear bugs: a failed lookup must not look like success. The pure
argument-validation and unconfirmed-`--yes` cases are also arguably wrong
(returning non-zero on bad input is conventional CLI behaviour) but are a
behaviour change worth deciding on separately.

### Category D — Precondition / not-found logged at info/warning → exit 0

The audit above keyed on `logger.error`/`critical`/`fatal`. A follow-up pass for
bare-`return` after `logger.info`/`logger.warning` found the **same exit-code
problem** logged at a lower level. By the convention in this issue
("operational failure / precondition unmet → non-zero"), these are the same bug;
they were just easy to miss because the log level is not `error`.

Hard failures (should be non-zero):

- **`osism/commands/server.py:82`** — `ServerMigrate.take_action`: server not in
  `ACTIVE`/`PAUSED` "cannot be live migrated" (`logger.info`), bare-`return`. A
  requested migration that cannot run is a precondition failure, not success.
- **`osism/commands/baremetal.py`** — "Could not find node {name}" for a single
  named node (`logger.warning`), bare-`return`: lines 194, 745, 1056, 1213,
  1337, 1407, 1455, 1507, 1566, 1647. Same not-found class as the `logger.error`
  entries in Category C, just logged at `warning`.

Needs case-by-case triage (may legitimately be empty-result / status → exit 0):

- `osism/commands/baremetal.py:911` — "No devices found with primary IPv4
  addresses": looks like a genuine empty-result (nothing to ping) → exit 0 is
  probably correct.
- `osism/commands/netbox.py:328,342` — "No NetBox instances configured"; `:730` —
  "No fields matching '...' found".
- `osism/commands/lock.py:36` (existing lock reason) and `:58` ("Tasks are not
  currently locked"): these read like status/no-op reporting where exit 0 is
  intended.

## Not affected (verified during the audit)

These files were checked and already return non-zero (via `return 1` or a
failure counter) on their error paths — no changes needed:
`migrate.py`, `loadbalancer.py`, `vault.py`, `check.py`, `sonic.py` (helpers
return `None` on error but callers check and return 1), and the `Database`
helpers in `status.py`.

Scope note: "exit code 0 on error" only meaningfully applies to cliff
`take_action` methods, so the audit was limited to `osism/commands/`. Celery
tasks (`osism/tasks/`), the FastAPI app (`osism/api.py`), and `osism/services/`
log errors with different semantics and were not in scope. The initial pass
keyed on `logger.error`/`critical`/`fatal`; Category D was added after a second
pass for `logger.info`/`logger.warning` followed by a bare `return`, so the same
bug at a lower log level is covered. Line numbers may still drift, and the
Category D "needs triage" entries should be confirmed before changing.

## Suggested approach

1. **PR 1 — Categories A + B.** The clean PR-#2313 analogues (timeouts and
   failed inventory queries). Low risk, highest value for `set -e` scripts.
   Mirror PR #2313's tests: a failure path yields exit 1, an empty-result path
   yields exit 0.
2. **PR 2 — Category C + Category D operational/not-found failures.**
3. **Decide separately** whether argument-validation / unconfirmed-`--yes`
   paths should also become non-zero, and confirm the Category D "needs triage"
   entries.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI commands log an error but exit 0 on failure #2314

Summary

Root cause

What PR #2313 already fixed

Findings

Category A — Timeout handler does not produce a non-zero exit code (highest priority)

Category B — Inventory query failure / timeout → exit 0 (`report.py`)

Category C — Lookup / config / validation failure → exit 0

Category D — Precondition / not-found logged at info/warning → exit 0

Not affected (verified during the audit)

Suggested approach

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File	Command	Lines	Failure mode
`osism/commands/reconciler.py`	`Sync`	63-66	falls through → exit 0
`osism/commands/validate.py`	`_handle_task` (value returned by `take_action`)	59-62	returns `None` → exit 0
`osism/commands/netbox.py`	`Ironic`	96-104	falls through → exit 0
`osism/commands/netbox.py`	`Sync`	370-378	falls through → exit 0
`osism/commands/wait.py`	`Wait`	125-128	see note below

CLI commands log an error but exit 0 on failure #2314

Description

Summary

Root cause

What PR #2313 already fixed

Findings

Category A — Timeout handler does not produce a non-zero exit code (highest priority)

Category B — Inventory query failure / timeout → exit 0 (report.py)

Category C — Lookup / config / validation failure → exit 0

Category D — Precondition / not-found logged at info/warning → exit 0

Not affected (verified during the audit)

Suggested approach

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Category B — Inventory query failure / timeout → exit 0 (`report.py`)