Skip to content

CLI commands log an error but exit 0 on failure #2314

@ideaship

Description

@ideaship

Summary

Several osism subcommands log an error on a failure path and then return
control without setting a non-zero exit code. The command therefore exits 0
(success) even though the operation failed — a timeout, a failed inventory
query, a missing resource, or an unmet precondition.

This makes the affected commands unsafe in set -e scripts and && chains: a
failed lookup silently looks like success.

Found while reviewing #2313, which fixed two instances of this pattern in
osism get. A follow-up audit of osism/commands/ turned up many more.

Root cause

cliff turns the return value of take_action into the process exit code:

# cliff/command.py
return_code = self.take_action(parsed_args) or 0

So any of the following on a failure path yields exit code 0:

  • a bare return
  • return None
  • return 0
  • falling off the end of take_action (implicit None)
  • a helper method that returns None on error, whose value is then returned by
    take_action

The fix in each case is to return 1 (or another non-zero value) on the failure
path.

What PR #2313 already fixed

osism get hostvars and osism get hosts logged an error and returned None
when the underlying ansible-inventory subprocess raised
subprocess.CalledProcessError. Both now return 1 on that path.

PR #2313 also established the intended convention, which the fixes below should
follow:

  • Operational failure (the query/operation could not run: timeout,
    subprocess error, resource not found, precondition unmet) → non-zero.
  • Empty result (the query ran fine but returned nothing: a host with no
    vars, an absent variable, an empty inventory) → zero. These are valid
    answers to a query that did run, and should be left as exit 0.

Findings

Grouped by priority. Line numbers are as of commit a386d29 and may drift.

Category A — Timeout handler does not produce a non-zero exit code (highest priority)

The command waits for task output, hits TimeoutError, logs the timeout, and
then fails to return a non-zero exit code. In most cases it falls through to the
end of take_action → implicit None → exit 0. Same class as the bug PR #2313
fixed, but for the timeout path.

File Command Lines Failure mode
osism/commands/reconciler.py Sync 63-66 falls through → exit 0
osism/commands/validate.py _handle_task (value returned by take_action) 59-62 returns None → exit 0
osism/commands/netbox.py Ironic 96-104 falls through → exit 0
osism/commands/netbox.py Sync 370-378 falls through → exit 0
osism/commands/wait.py Wait 125-128 see note below

wait.py is the messiest of the group; its timeout behaviour and exit code
depend on --live and the number of tasks. After task_id = task_ids.pop()
(line 88), len(task_ids) is the number of remaining tasks, and the only place
an exit code is returned is line 131, return rc, guarded by
if len(task_ids) == 1 (exactly one task left after the pop). rc is only ever
assigned inside the try at line 124.

  • --live, single task: after the pop len(task_ids) == 0, so line 131 is
    never reached. Whether the fetch succeeds or times out, control falls off the
    end of take_action → exit 0. The timeout is logged but ignored.
  • --live, two tasks: the first iteration pops one, leaving len == 1, so
    it hits return rc after processing only the first task — the second
    (remaining) task is never processed. On a timeout in that iteration rc is
    unassigned → UnboundLocalError.
  • --live, three or more tasks: return rc fires on the iteration that
    leaves exactly one task remaining (the second-to-last task), again skipping the
    last task. A timeout there returns whatever rc an earlier successful fetch
    left behind — a stale value — or UnboundLocalError if no earlier
    iteration assigned it.
  • non---live (polling), any task count: STARTED/PENDING tasks are re-queued
    and the loop drains until everything reaches SUCCESS/unavailable, then falls
    off the end (~line 156) → exit 0. The SUCCESS branch never captures a task's
    rc, so this path always exits 0 regardless of task outcome.

So fixing wait.py is more than a one-liner: initialize rc before the loop,
return non-zero on TimeoutError, and reconsider the len(task_ids) == 1 guard,
which returns before the final task is processed.

Category B — Inventory query failure / timeout → exit 0 (report.py)

The direct twins of the PR #2313 bug. Four commands (Memory, Lldp, Bgp,
Status) each run ansible-inventory via subprocess.run(...) and bare-return
on both a non-zero return code (Error loading inventory.) and on
subprocess.TimeoutExpired (Timeout loading inventory.). The neighbouring
No hosts found in inventory. branch is the empty-result case and should be
left at exit 0 per the convention above.

Command Error loading inventory. Timeout loading inventory.
Memory 53 56
Lldp 191 194
Bgp 360 363
Status 537 540

Category C — Lookup / config / validation failure → exit 0

Real operational or input failures that bare-return (or fall through) in
take_action, silently reporting success:

  • osism/commands/baremetal.py (19 sites) — missing-argument validation,
    unconfirmed --yes-i-really-really-mean-it, "not found" errors, and one
    exception handler. Lines: 165, 169, 453, 572, 587, 716, 720, 867, 874,
    960 (exception handler — ping operation failed), 1025, 1033, 1184, 1188,
    1316, 1490, 1549, 1618, 1622.
  • osism/commands/status.py:71Run: unknown resource type, then falls
    through to exit 0.
  • osism/commands/compute.pyComputeEnable.take_action line 70
    (BadRequestException caught at line 64: cannot force host up while done
    evacuation records remain — an operational failure inside a try/finally,
    bare-return in the try); and ComputeMigrationList.take_action lines 666,
    692, 699, 711, 720, 734, 737 (no domain/user/project/server found, multiple
    servers found, bad changes-since).
  • osism/commands/server.py — 184, 191, 197, 236, 243 (no domain / no user
    / project not found).
  • osism/commands/volume.py — 69, 108, 116 (domain / project domain /
    project not found).
  • osism/commands/netbox.py — 538 (Console: NetBox not configured), 587
    (Dump: NetBox not configured), 633 (Dump: device not found).

Within Category C, the not-found / operational failures (e.g.
baremetal.py:960, and the compute/server/volume/netbox lookups) are
clear bugs: a failed lookup must not look like success. The pure
argument-validation and unconfirmed---yes cases are also arguably wrong
(returning non-zero on bad input is conventional CLI behaviour) but are a
behaviour change worth deciding on separately.

Category D — Precondition / not-found logged at info/warning → exit 0

The audit above keyed on logger.error/critical/fatal. A follow-up pass for
bare-return after logger.info/logger.warning found the same exit-code
problem
logged at a lower level. By the convention in this issue
("operational failure / precondition unmet → non-zero"), these are the same bug;
they were just easy to miss because the log level is not error.

Hard failures (should be non-zero):

  • osism/commands/server.py:82ServerMigrate.take_action: server not in
    ACTIVE/PAUSED "cannot be live migrated" (logger.info), bare-return. A
    requested migration that cannot run is a precondition failure, not success.
  • osism/commands/baremetal.py — "Could not find node {name}" for a single
    named node (logger.warning), bare-return: lines 194, 745, 1056, 1213,
    1337, 1407, 1455, 1507, 1566, 1647. Same not-found class as the logger.error
    entries in Category C, just logged at warning.

Needs case-by-case triage (may legitimately be empty-result / status → exit 0):

  • osism/commands/baremetal.py:911 — "No devices found with primary IPv4
    addresses": looks like a genuine empty-result (nothing to ping) → exit 0 is
    probably correct.
  • osism/commands/netbox.py:328,342 — "No NetBox instances configured"; :730
    "No fields matching '...' found".
  • osism/commands/lock.py:36 (existing lock reason) and :58 ("Tasks are not
    currently locked"): these read like status/no-op reporting where exit 0 is
    intended.

Not affected (verified during the audit)

These files were checked and already return non-zero (via return 1 or a
failure counter) on their error paths — no changes needed:
migrate.py, loadbalancer.py, vault.py, check.py, sonic.py (helpers
return None on error but callers check and return 1), and the Database
helpers in status.py.

Scope note: "exit code 0 on error" only meaningfully applies to cliff
take_action methods, so the audit was limited to osism/commands/. Celery
tasks (osism/tasks/), the FastAPI app (osism/api.py), and osism/services/
log errors with different semantics and were not in scope. The initial pass
keyed on logger.error/critical/fatal; Category D was added after a second
pass for logger.info/logger.warning followed by a bare return, so the same
bug at a lower log level is covered. Line numbers may still drift, and the
Category D "needs triage" entries should be confirmed before changing.

Suggested approach

  1. PR 1 — Categories A + B. The clean PR-Return non-zero exit from get on query failure #2313 analogues (timeouts and
    failed inventory queries). Low risk, highest value for set -e scripts.
    Mirror PR Return non-zero exit from get on query failure #2313's tests: a failure path yields exit 1, an empty-result path
    yields exit 0.
  2. PR 2 — Category C + Category D operational/not-found failures.
  3. Decide separately whether argument-validation / unconfirmed---yes
    paths should also become non-zero, and confirm the Category D "needs triage"
    entries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions