You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several osism subcommands log an error on a failure path and then return
control without setting a non-zero exit code. The command therefore exits 0
(success) even though the operation failed — a timeout, a failed inventory
query, a missing resource, or an unmet precondition.
This makes the affected commands unsafe in set -e scripts and && chains: a
failed lookup silently looks like success.
Found while reviewing #2313, which fixed two instances of this pattern in osism get. A follow-up audit of osism/commands/ turned up many more.
Root cause
cliff turns the return value of take_action into the process exit code:
osism get hostvars and osism get hosts logged an error and returned None
when the underlying ansible-inventory subprocess raised subprocess.CalledProcessError. Both now return 1 on that path.
PR #2313 also established the intended convention, which the fixes below should
follow:
Operational failure (the query/operation could not run: timeout,
subprocess error, resource not found, precondition unmet) → non-zero.
Empty result (the query ran fine but returned nothing: a host with no
vars, an absent variable, an empty inventory) → zero. These are valid
answers to a query that did run, and should be left as exit 0.
Findings
Grouped by priority. Line numbers are as of commit a386d29 and may drift.
Category A — Timeout handler does not produce a non-zero exit code (highest priority)
The command waits for task output, hits TimeoutError, logs the timeout, and
then fails to return a non-zero exit code. In most cases it falls through to the
end of take_action → implicit None → exit 0. Same class as the bug PR #2313
fixed, but for the timeout path.
File
Command
Lines
Failure mode
osism/commands/reconciler.py
Sync
63-66
falls through → exit 0
osism/commands/validate.py
_handle_task (value returned by take_action)
59-62
returns None → exit 0
osism/commands/netbox.py
Ironic
96-104
falls through → exit 0
osism/commands/netbox.py
Sync
370-378
falls through → exit 0
osism/commands/wait.py
Wait
125-128
see note below
wait.py is the messiest of the group; its timeout behaviour and exit code
depend on --live and the number of tasks. After task_id = task_ids.pop()
(line 88), len(task_ids) is the number of remaining tasks, and the only place
an exit code is returned is line 131, return rc, guarded by if len(task_ids) == 1 (exactly one task left after the pop). rc is only ever
assigned inside the try at line 124.
--live, single task: after the pop len(task_ids) == 0, so line 131 is
never reached. Whether the fetch succeeds or times out, control falls off the
end of take_action → exit 0. The timeout is logged but ignored.
--live, two tasks: the first iteration pops one, leaving len == 1, so
it hits return rc after processing only the first task — the second
(remaining) task is never processed. On a timeout in that iteration rc is
unassigned → UnboundLocalError.
--live, three or more tasks: return rc fires on the iteration that
leaves exactly one task remaining (the second-to-last task), again skipping the
last task. A timeout there returns whatever rc an earlier successful fetch
left behind — a stale value — or UnboundLocalError if no earlier
iteration assigned it.
non---live (polling), any task count: STARTED/PENDING tasks are re-queued
and the loop drains until everything reaches SUCCESS/unavailable, then falls
off the end (~line 156) → exit 0. The SUCCESS branch never captures a task's rc, so this path always exits 0 regardless of task outcome.
So fixing wait.py is more than a one-liner: initialize rc before the loop,
return non-zero on TimeoutError, and reconsider the len(task_ids) == 1 guard,
which returns before the final task is processed.
The direct twins of the PR #2313 bug. Four commands (Memory, Lldp, Bgp, Status) each run ansible-inventory via subprocess.run(...) and bare-return
on both a non-zero return code (Error loading inventory.) and on subprocess.TimeoutExpired (Timeout loading inventory.). The neighbouring No hosts found in inventory. branch is the empty-result case and should be left at exit 0 per the convention above.
osism/commands/status.py:71 — Run: unknown resource type, then falls
through to exit 0.
osism/commands/compute.py — ComputeEnable.take_action line 70
(BadRequestException caught at line 64: cannot force host up while done
evacuation records remain — an operational failure inside a try/finally,
bare-return in the try); and ComputeMigrationList.take_action lines 666,
692, 699, 711, 720, 734, 737 (no domain/user/project/server found, multiple
servers found, bad changes-since).
osism/commands/server.py — 184, 191, 197, 236, 243 (no domain / no user
/ project not found).
osism/commands/netbox.py — 538 (Console: NetBox not configured), 587
(Dump: NetBox not configured), 633 (Dump: device not found).
Within Category C, the not-found / operational failures (e.g. baremetal.py:960, and the compute/server/volume/netbox lookups) are
clear bugs: a failed lookup must not look like success. The pure
argument-validation and unconfirmed---yes cases are also arguably wrong
(returning non-zero on bad input is conventional CLI behaviour) but are a
behaviour change worth deciding on separately.
Category D — Precondition / not-found logged at info/warning → exit 0
The audit above keyed on logger.error/critical/fatal. A follow-up pass for
bare-return after logger.info/logger.warning found the same exit-code
problem logged at a lower level. By the convention in this issue
("operational failure / precondition unmet → non-zero"), these are the same bug;
they were just easy to miss because the log level is not error.
Hard failures (should be non-zero):
osism/commands/server.py:82 — ServerMigrate.take_action: server not in ACTIVE/PAUSED "cannot be live migrated" (logger.info), bare-return. A
requested migration that cannot run is a precondition failure, not success.
osism/commands/baremetal.py — "Could not find node {name}" for a single
named node (logger.warning), bare-return: lines 194, 745, 1056, 1213,
1337, 1407, 1455, 1507, 1566, 1647. Same not-found class as the logger.error
entries in Category C, just logged at warning.
Needs case-by-case triage (may legitimately be empty-result / status → exit 0):
osism/commands/baremetal.py:911 — "No devices found with primary IPv4
addresses": looks like a genuine empty-result (nothing to ping) → exit 0 is
probably correct.
osism/commands/netbox.py:328,342 — "No NetBox instances configured"; :730 —
"No fields matching '...' found".
osism/commands/lock.py:36 (existing lock reason) and :58 ("Tasks are not
currently locked"): these read like status/no-op reporting where exit 0 is
intended.
Not affected (verified during the audit)
These files were checked and already return non-zero (via return 1 or a
failure counter) on their error paths — no changes needed: migrate.py, loadbalancer.py, vault.py, check.py, sonic.py (helpers
return None on error but callers check and return 1), and the Database
helpers in status.py.
Scope note: "exit code 0 on error" only meaningfully applies to cliff take_action methods, so the audit was limited to osism/commands/. Celery
tasks (osism/tasks/), the FastAPI app (osism/api.py), and osism/services/
log errors with different semantics and were not in scope. The initial pass
keyed on logger.error/critical/fatal; Category D was added after a second
pass for logger.info/logger.warning followed by a bare return, so the same
bug at a lower log level is covered. Line numbers may still drift, and the
Category D "needs triage" entries should be confirmed before changing.
PR 2 — Category C + Category D operational/not-found failures.
Decide separately whether argument-validation / unconfirmed---yes
paths should also become non-zero, and confirm the Category D "needs triage"
entries.
Summary
Several
osismsubcommands log an error on a failure path and then returncontrol without setting a non-zero exit code. The command therefore exits
0(success) even though the operation failed — a timeout, a failed inventory
query, a missing resource, or an unmet precondition.
This makes the affected commands unsafe in
set -escripts and&&chains: afailed lookup silently looks like success.
Found while reviewing #2313, which fixed two instances of this pattern in
osism get. A follow-up audit ofosism/commands/turned up many more.Root cause
cliff turns the return value of
take_actioninto the process exit code:So any of the following on a failure path yields exit code
0:returnreturn Nonereturn 0take_action(implicitNone)Noneon error, whose value is then returned bytake_actionThe fix in each case is to
return 1(or another non-zero value) on the failurepath.
What PR #2313 already fixed
osism get hostvarsandosism get hostslogged an error andreturnedNonewhen the underlying
ansible-inventorysubprocess raisedsubprocess.CalledProcessError. Both nowreturn 1on that path.PR #2313 also established the intended convention, which the fixes below should
follow:
subprocess error, resource not found, precondition unmet) → non-zero.
vars, an absent variable, an empty inventory) → zero. These are valid
answers to a query that did run, and should be left as exit 0.
Findings
Grouped by priority. Line numbers are as of commit
a386d29and may drift.Category A — Timeout handler does not produce a non-zero exit code (highest priority)
The command waits for task output, hits
TimeoutError, logs the timeout, andthen fails to return a non-zero exit code. In most cases it falls through to the
end of
take_action→ implicitNone→ exit 0. Same class as the bug PR #2313fixed, but for the timeout path.
osism/commands/reconciler.pySyncosism/commands/validate.py_handle_task(value returned bytake_action)None→ exit 0osism/commands/netbox.pyIronicosism/commands/netbox.pySyncosism/commands/wait.pyWaitwait.pyis the messiest of the group; its timeout behaviour and exit codedepend on
--liveand the number of tasks. Aftertask_id = task_ids.pop()(line 88),
len(task_ids)is the number of remaining tasks, and the only placean exit code is returned is line 131,
return rc, guarded byif len(task_ids) == 1(exactly one task left after the pop).rcis only everassigned inside the
tryat line 124.--live, single task: after the poplen(task_ids) == 0, so line 131 isnever reached. Whether the fetch succeeds or times out, control falls off the
end of
take_action→ exit 0. The timeout is logged but ignored.--live, two tasks: the first iteration pops one, leavinglen == 1, soit hits
return rcafter processing only the first task — the second(remaining) task is never processed. On a timeout in that iteration
rcisunassigned →
UnboundLocalError.--live, three or more tasks:return rcfires on the iteration thatleaves exactly one task remaining (the second-to-last task), again skipping the
last task. A timeout there returns whatever
rcan earlier successful fetchleft behind — a stale value — or
UnboundLocalErrorif no earlieriteration assigned it.
--live(polling), any task count: STARTED/PENDING tasks are re-queuedand the loop drains until everything reaches SUCCESS/unavailable, then falls
off the end (~line 156) → exit 0. The SUCCESS branch never captures a task's
rc, so this path always exits 0 regardless of task outcome.So fixing
wait.pyis more than a one-liner: initializercbefore the loop,return non-zero on
TimeoutError, and reconsider thelen(task_ids) == 1guard,which returns before the final task is processed.
Category B — Inventory query failure / timeout → exit 0 (
report.py)The direct twins of the PR #2313 bug. Four commands (
Memory,Lldp,Bgp,Status) each runansible-inventoryviasubprocess.run(...)and bare-returnon both a non-zero return code (
Error loading inventory.) and onsubprocess.TimeoutExpired(Timeout loading inventory.). The neighbouringNo hosts found in inventory.branch is the empty-result case and should beleft at exit 0 per the convention above.
Error loading inventory.Timeout loading inventory.MemoryLldpBgpStatusCategory C — Lookup / config / validation failure → exit 0
Real operational or input failures that bare-
return(or fall through) intake_action, silently reporting success:osism/commands/baremetal.py(19 sites) — missing-argument validation,unconfirmed
--yes-i-really-really-mean-it, "not found" errors, and oneexception handler. Lines: 165, 169, 453, 572, 587, 716, 720, 867, 874,
960 (exception handler — ping operation failed), 1025, 1033, 1184, 1188,
1316, 1490, 1549, 1618, 1622.
osism/commands/status.py:71—Run: unknown resource type, then fallsthrough to exit 0.
osism/commands/compute.py—ComputeEnable.take_actionline 70(
BadRequestExceptioncaught at line 64: cannot force host up whiledoneevacuation records remain — an operational failure inside a
try/finally,bare-
returnin thetry); andComputeMigrationList.take_actionlines 666,692, 699, 711, 720, 734, 737 (no domain/user/project/server found, multiple
servers found, bad
changes-since).osism/commands/server.py— 184, 191, 197, 236, 243 (no domain / no user/ project not found).
osism/commands/volume.py— 69, 108, 116 (domain / project domain /project not found).
osism/commands/netbox.py— 538 (Console: NetBox not configured), 587(
Dump: NetBox not configured), 633 (Dump: device not found).Within Category C, the not-found / operational failures (e.g.
baremetal.py:960, and thecompute/server/volume/netboxlookups) areclear bugs: a failed lookup must not look like success. The pure
argument-validation and unconfirmed-
--yescases are also arguably wrong(returning non-zero on bad input is conventional CLI behaviour) but are a
behaviour change worth deciding on separately.
Category D — Precondition / not-found logged at info/warning → exit 0
The audit above keyed on
logger.error/critical/fatal. A follow-up pass forbare-
returnafterlogger.info/logger.warningfound the same exit-codeproblem logged at a lower level. By the convention in this issue
("operational failure / precondition unmet → non-zero"), these are the same bug;
they were just easy to miss because the log level is not
error.Hard failures (should be non-zero):
osism/commands/server.py:82—ServerMigrate.take_action: server not inACTIVE/PAUSED"cannot be live migrated" (logger.info), bare-return. Arequested migration that cannot run is a precondition failure, not success.
osism/commands/baremetal.py— "Could not find node {name}" for a singlenamed node (
logger.warning), bare-return: lines 194, 745, 1056, 1213,1337, 1407, 1455, 1507, 1566, 1647. Same not-found class as the
logger.errorentries in Category C, just logged at
warning.Needs case-by-case triage (may legitimately be empty-result / status → exit 0):
osism/commands/baremetal.py:911— "No devices found with primary IPv4addresses": looks like a genuine empty-result (nothing to ping) → exit 0 is
probably correct.
osism/commands/netbox.py:328,342— "No NetBox instances configured";:730—"No fields matching '...' found".
osism/commands/lock.py:36(existing lock reason) and:58("Tasks are notcurrently locked"): these read like status/no-op reporting where exit 0 is
intended.
Not affected (verified during the audit)
These files were checked and already return non-zero (via
return 1or afailure counter) on their error paths — no changes needed:
migrate.py,loadbalancer.py,vault.py,check.py,sonic.py(helpersreturn
Noneon error but callers check and return 1), and theDatabasehelpers in
status.py.Scope note: "exit code 0 on error" only meaningfully applies to cliff
take_actionmethods, so the audit was limited toosism/commands/. Celerytasks (
osism/tasks/), the FastAPI app (osism/api.py), andosism/services/log errors with different semantics and were not in scope. The initial pass
keyed on
logger.error/critical/fatal; Category D was added after a secondpass for
logger.info/logger.warningfollowed by a barereturn, so the samebug at a lower log level is covered. Line numbers may still drift, and the
Category D "needs triage" entries should be confirmed before changing.
Suggested approach
failed inventory queries). Low risk, highest value for
set -escripts.Mirror PR Return non-zero exit from get on query failure #2313's tests: a failure path yields exit 1, an empty-result path
yields exit 0.
--yespaths should also become non-zero, and confirm the Category D "needs triage"
entries.