Skip to content

USHIFT-6637: Ansible: Improve performance metrics capture#6271

Merged
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
sjug:ansible_improvements
Feb 26, 2026
Merged

USHIFT-6637: Ansible: Improve performance metrics capture#6271
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
sjug:ansible_improvements

Conversation

@sjug
Copy link
Contributor

@sjug sjug commented Feb 24, 2026

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

  • New Features

    • Detailed container image storage reporting and a Prometheus-backed network query tool.
  • Improvements

    • Replaced vnStat-based network collection with Prometheus; records measurement start/end timestamps.
    • Readiness checks now include elapsed time and list not-ready pods.
    • Detects and cleans prior MicroShift installations; standardizes node_exporter service/package name.
  • Chores

    • Removed default vnstat_db setting.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 24, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two new CLI utilities for CRI-O disk reporting and Prometheus network queries, replaces vnstat collection with these tools in microshift-start, updates node-exporter naming and package, adds microshift cleanup/uninstall steps, and enhances ready.sh diagnostics and timing output.

Changes

Cohort / File(s) Summary
CRI-O storage utility
ansible/roles/common/tasks/files/crio-df.py
New Python script that parses CRI‑O metadata to build layer graphs, compute per-image/total sizes (unique vs shared), reclaimable space, and compressed download sizes; supports verbose and concise outputs.
Prometheus network query utility
ansible/roles/common/tasks/files/prom-network-query.py
New Python script to query Prometheus over a configurable window, handle HTTP/JSON errors and counter resets, compute RX/TX increases, print readable output and emit a JSON summary.
microshift-start role changes
ansible/roles/microshift-start/defaults/main.yml, ansible/roles/microshift-start/tasks/main.yml
Removed vnstat_db default and vnstat tasks; added tasks to copy/run crio-df.py, capture image sizes and start/end epochs, and conditionally run prom-network-query.py when prometheus_logging is enabled.
Microshift uninstall / cleanup
ansible/roles/install-microshift/tasks/main.yml
Added rpm presence check and cleanup/uninstall block: stop microshift service, discover/run cleanup scripts with --all, and uninstall packages via dnf autoremove.
Node-exporter naming & package update
ansible/roles/install-logging-exporters/defaults/main.yml, ansible/roles/install-logging-exporters/tasks/main.yml
Replaced prometheus-node-exporter with node_exporter in defaults and changed package install from golang-github-prometheus-node-exporter to node-exporter.
Readiness script diagnostics
ansible/roles/common/tasks/files/ready.sh
Enhanced progress output to include elapsed time and list non-ready pods; final JSON now includes start_epoch and end_epoch.

Sequence Diagram(s)

sequenceDiagram
participant Controller as Ansible Controller
participant Host as Target Host (CRI‑O)
participant Prom as Prometheus

Controller->>Host: copy `crio-df.py`
Controller->>Host: execute `crio-df.py` (collect image sizes)
Host-->>Controller: `results/images.txt` (image sizes, reclaimable)
Controller->>Controller: record network start_epoch
alt prometheus_logging enabled
    Controller->>Prom: run `prom-network-query.py` / API queries
    Prom-->>Controller: RX/TX totals, timeseries, JSON summary
    Controller->>Controller: save `results/network.txt` and measurement end_epoch
end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main objective of the changeset: improving performance metrics capture through new tools (crio-df.py, prom-network-query.py) and modifications to existing metrics collection.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from eslutsky and pacevedom February 24, 2026 14:41
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

Release Notes

  • New Features

  • Added container image storage reporting capability.

  • Integrated Prometheus-based network monitoring to track data transfer metrics.

  • Improvements

  • Enhanced installation diagnostics with detailed progress tracking, elapsed time, and pod readiness status.

  • Improved microshift installation workflow with automatic cleanup of existing installations.

  • Updated node exporter service naming and package configuration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (9)
ansible/roles/common/tasks/files/ready.sh (1)

126-126: local elapsed declared twice in the same function.

Line 110 already declares local elapsed in wait_for_ready. This second local on line 126 is harmless in bash but redundant — you can just reassign elapsed=... here.

Proposed fix
-      local elapsed=$(( $(date +%s) - START_TIME ))
+      elapsed=$(( $(date +%s) - START_TIME ))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/ready.sh` at line 126, In wait_for_ready
replace the redundant re-declaration of the variable elapsed (currently using
"local elapsed=$(( ... ))") with a plain reassignment so you don't declare it
twice; locate the second "local elapsed" in the function and change it to
"elapsed=..." so the original local declaration at the top of wait_for_ready
remains the single declaration.
ansible/roles/common/tasks/files/crio-df.py (5)

340-343: Consider using argparse instead of manual sys.argv inspection.

Currently -v anywhere in argv triggers verbose mode, which could conflict if future args are added. Low risk for now, but argparse is already used in the sibling script.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` around lines 340 - 343, The
__main__ block currently sets verbose by scanning sys.argv which can misfire;
replace this with argparse: import argparse, create an ArgumentParser, add a
-v/--verbose flag (action='store_true'), call parser.parse_args() and pass
args.verbose into main() instead of computing verbose via "-v" in sys.argv or
"--verbose" in sys.argv; update the __main__ block in crio-df.py to use the new
parser and ensure main(verbose) signature is unchanged.

333-333: Ambiguous variable name l.

Rename to something clearer like lid or layer_id per E741.

Proposed fix
-        print(f"  Shared layers (used by >1 image): {len([l for l, c in layer_count.items() if c > 1])}")
+        print(f"  Shared layers (used by >1 image): {len([lid for lid, c in layer_count.items() if c > 1])}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` at line 333, The comprehension
in the print statement uses the ambiguous single-letter variable name "l"
(print(f"  Shared layers (used by >1 image): {len([l for l, c in
layer_count.items() if c > 1])}")), rename that loop variable to a clear
identifier like "layer_id" (or "lid") throughout the expression so it becomes
len([layer_id for layer_id, c in layer_count.items() if c > 1]); update only
that comprehension variable name to satisfy E741 and keep surrounding code
unchanged.

281-284: Unused variables: comp_naive, comp_shared_size, comp_layer_usage.

These are computed but never referenced. Either use them in output or remove to avoid confusion.

Proposed fix
-    comp_layers, comp_layer_usage, comp_shared, comp_per_image = get_compressed_sizes(images)
+    comp_layers, _, comp_shared, comp_per_image = get_compressed_sizes(images)
     comp_deduplicated = sum(comp_layers.values())
-    comp_naive = sum(img["compressed"] for img in comp_per_image)
-    comp_shared_size = sum(comp_layers[d] for d in comp_shared)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` around lines 281 - 284, The
variables comp_naive, comp_shared_size and comp_layer_usage computed after
calling get_compressed_sizes(images) are never used; either remove their
calculations or wire them into the function's output/logging. Edit the block
that calls get_compressed_sizes (referencing comp_layers, comp_layer_usage,
comp_shared, comp_per_image, comp_deduplicated) and either (A) delete the lines
computing comp_naive and comp_shared_size and any unused comp_layer_usage
handling, or (B) return/include comp_naive, comp_shared_size and
comp_layer_usage in the function's return value or summary log so they are
actually referenced downstream.

296-296: f-strings without placeholders on lines 296, 306, 311.

These are plain strings with no interpolation — drop the f prefix.

Proposed fix
-        print(f"\nContainers space usage:\n")
+        print("\nContainers space usage:\n")
-        print(f"\nLocal Volumes space usage:\n")
+        print("\nLocal Volumes space usage:\n")
-        print(f"\nCompressed download sizes:\n")
+        print("\nCompressed download sizes:\n")

Also applies to: 306-306, 311-311

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` at line 296, Remove the
unnecessary f-string prefixes on plain printed strings in crio-df.py: replace
the print(f"\nContainers space usage:\n") and the other two prints with the same
literal content (the ones at the locations printing "Containers space usage:",
plus the prints at the lines printing the other plain strings shown in the diff)
with regular print("...") calls (i.e., drop the leading `f`), keeping the exact
string text and spacing unchanged.

300-300: Silently truncates container listing to 10 entries.

volatile_layers[:10] caps the output with no indication to the user that entries were omitted. If the limit is intentional, print a note when truncated.

Proposed fix
         for container in volatile_layers[:10]:  # Limit to first 10
             ...
+        if len(volatile_layers) > 10:
+            print(f"  ... and {len(volatile_layers) - 10} more containers")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` at line 300, The loop currently
uses volatile_layers[:10] which silently truncates results; change the logic
around the for loop that uses volatile_layers (the line "for container in
volatile_layers[:10]") to detect truncation (if len(volatile_layers) > 10) and,
when truncated, print or log a clear notice stating the total count and that
only the first 10 entries are shown (e.g., "Showing 10 of X volatile layers");
keep the existing slice for iteration but ensure the message is emitted using
the same output/logging mechanism as the rest of the script.
ansible/roles/common/tasks/files/prom-network-query.py (2)

128-136: No error handling for malformed Prometheus responses.

Line 134 accesses data["data"]["result"] — a non-200 or unexpected response shape will raise KeyError. Consider wrapping in a try/except or using .get().

Proposed fix
 def get_increase(metric, start, end, prometheus, device, instance=None):
     """Use increase() to get accurate total accounting for counter resets."""
     window = end - start
     sel = label_selector(device, instance)
     query = f"increase({metric}{{{sel}}}[{window}s])"
     data = query_instant(prometheus, query, end)
-    if data["status"] != "success" or not data["data"]["result"]:
-        return None
-    return float(data["data"]["result"][0]["value"][1])
+    try:
+        if data.get("status") != "success" or not data.get("data", {}).get("result"):
+            return None
+        return float(data["data"]["result"][0]["value"][1])
+    except (KeyError, IndexError, TypeError, ValueError):
+        return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 128 -
136, In get_increase, avoid direct indexing into the Prometheus response which
can raise KeyError on non-200 or unexpected payloads: wrap the access to data
returned by query_instant(prometheus, query, end) in a try/except (or use
chained .get() calls) to verify data.get("status")== "success" and that
data.get("data", {}).get("result") is a non-empty list before reading
result[0]["value"][1]; on any failure return None (and optionally log the raw
response), and also guard the float(...) conversion in a try/except to return
None if the value is not parseable. Ensure you still use label_selector(device,
instance) and the same query construction.

217-226: JSON summary reports 0 bytes when there's no data — indistinguishable from actual zero traffic.

When rx_increase or tx_increase is None, the summary emits 0. Downstream consumers can't tell "no data" from "zero transfer." Consider using null instead.

Proposed fix
     summary = {
         ...
-        "rx_bytes": int(rx_increase) if rx_increase is not None else 0,
-        "tx_bytes": int(tx_increase) if tx_increase is not None else 0,
+        "rx_bytes": int(rx_increase) if rx_increase is not None else None,
+        "tx_bytes": int(tx_increase) if tx_increase is not None else None,
         ...
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 217 -
226, The summary currently converts missing rx_increase/tx_increase to 0 which
conflates "no data" with actual zero traffic; update the summary dict (the
"summary" assignment) to set "rx_bytes" to int(rx_increase) if rx_increase is
not None else None and "tx_bytes" to int(tx_increase) if tx_increase is not None
else None so that missing values remain Python None (serialized as JSON null)
instead of 0; keep "counter_resets" and "samples" as-is.
ansible/roles/microshift-start/tasks/main.yml (1)

125-143: Image capture block runs unconditionally — no guard or error handling.

If CRI-O isn't running or has no images, crio-df.py will fail and halt the playbook. Consider adding ignore_errors: true or gating on a condition if this is a best-effort metric.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 125 - 143, The
image-capture block currently runs unconditionally and will fail the playbook if
CRI-O isn't available; update the tasks so the crio-df invocation is best-effort
by adding error-tolerant behavior: on the "run crio-df script" task (the task
that registers crio_df) add failed_when: false and/or ignore_errors: true so the
playbook continues on failure, and make the "save container image sizes to
results" copy task conditional (when: crio_df is defined and crio_df.stdout |
length > 0) so you only write results when output exists; reference the task
names "run crio-df script", the registered variable crio_df, and the results
destination variable results_dir when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Around line 92-95: The except: is too broad and can swallow system exceptions;
replace it with except Exception as e to only catch regular exceptions, and if
appropriate capture/log e (e.g. process_logger or logging) before returning
"Unknown" in the same function that computes the human-readable age (the block
handling years = days // 365 and the subsequent return).

In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 106-107: The helper fmt_ts currently uses
datetime.fromtimestamp(epoch) which yields local time and conflicts with
parse_window_args that uses UTC; change fmt_ts to produce UTC timestamps (e.g.,
use datetime.fromtimestamp(epoch, timezone.utc) or datetime.utcfromtimestamp and
then strftime) so all displayed timestamps are in UTC, and add/ensure the
timezone import (timezone) is present if missing; update the fmt_ts function
name reference accordingly.
- Around line 89-93: The prom_query function currently calls
urllib.request.urlopen without a timeout which can hang; modify prom_query to
accept an optional timeout parameter (e.g., timeout=10) and pass that timeout
into urllib.request.urlopen(url, timeout=timeout), and update any callers if
necessary to supply a different timeout; ensure the new timeout has a sensible
default so existing callers continue to work.

In `@ansible/roles/install-microshift/tasks/main.yml`:
- Around line 4-8: The current check using ansible.builtin.command: rpm -q
microshift (registered to microshift_installed) misses cases where only
sub-packages like microshift-networking are present; replace that task with a
broadened shell check (e.g., ansible.builtin.shell) that matches package names
with a glob/regex such as rpm -qa | grep -E '^microshift' or rpm -q
'microshift*' so any package beginning with "microshift" is detected, keep
register: microshift_installed and preserve ignore_errors/changed_when semantics
so subsequent cleanup runs when any matching package is installed.
- Around line 33-39: The task using ansible.builtin.dnf with name:
[microshift-*] and state: absent uses a glob that the DNF module does not
expand, so it will match nothing and leave packages installed; replace it with a
two-step approach: run ansible.builtin.dnf list: installed and register it to
installed_packages, then call ansible.builtin.dnf state: absent with name set to
the filtered list of installed_packages.results whose name matches the
^microshift pattern (use selectattr/search and map(name) to build the list) and
keep autoremove: yes; update the task names (e.g., "List installed microshift
packages" and "Uninstall microshift packages") and variable installed_packages
accordingly.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 149-168: The task "resolve prometheus network query settings"
currently falls back to the hardcoded host 'zen3' when no logging group and no
prometheus_query_endpoint are present; change this to fail fast or emit a clear
warning instead of silently using 'zen3'. Modify the set_fact logic for
prom_network_endpoint (or add a preceding task) to check if
prometheus_query_endpoint is undefined and groups['logging'] is empty, then call
ansible.builtin.fail with a clear message (or ansible.builtin.debug/warn)
indicating the missing Prometheus endpoint and how to supply it, rather than
defaulting to 'zen3'. Ensure the change references prom_network_endpoint and the
resolve prometheus network query settings task so reviewers can find the update.
- Around line 171-182: The task "run prometheus network query" uses a relative
script path "roles/common/tasks/files/prom-network-query.py" which can break
depending on the working directory; update the command to reference the script
via a role-relative or absolute path (e.g. use the {{ role_path }} Jinja
variable to point at the file within the role) or move the script into the
current role's files directory and call it from there so the
prom-network-query.py invocation is robust regardless of where ansible-playbook
is run.

---

Nitpick comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Around line 340-343: The __main__ block currently sets verbose by scanning
sys.argv which can misfire; replace this with argparse: import argparse, create
an ArgumentParser, add a -v/--verbose flag (action='store_true'), call
parser.parse_args() and pass args.verbose into main() instead of computing
verbose via "-v" in sys.argv or "--verbose" in sys.argv; update the __main__
block in crio-df.py to use the new parser and ensure main(verbose) signature is
unchanged.
- Line 333: The comprehension in the print statement uses the ambiguous
single-letter variable name "l" (print(f"  Shared layers (used by >1 image):
{len([l for l, c in layer_count.items() if c > 1])}")), rename that loop
variable to a clear identifier like "layer_id" (or "lid") throughout the
expression so it becomes len([layer_id for layer_id, c in layer_count.items() if
c > 1]); update only that comprehension variable name to satisfy E741 and keep
surrounding code unchanged.
- Around line 281-284: The variables comp_naive, comp_shared_size and
comp_layer_usage computed after calling get_compressed_sizes(images) are never
used; either remove their calculations or wire them into the function's
output/logging. Edit the block that calls get_compressed_sizes (referencing
comp_layers, comp_layer_usage, comp_shared, comp_per_image, comp_deduplicated)
and either (A) delete the lines computing comp_naive and comp_shared_size and
any unused comp_layer_usage handling, or (B) return/include comp_naive,
comp_shared_size and comp_layer_usage in the function's return value or summary
log so they are actually referenced downstream.
- Line 296: Remove the unnecessary f-string prefixes on plain printed strings in
crio-df.py: replace the print(f"\nContainers space usage:\n") and the other two
prints with the same literal content (the ones at the locations printing
"Containers space usage:", plus the prints at the lines printing the other plain
strings shown in the diff) with regular print("...") calls (i.e., drop the
leading `f`), keeping the exact string text and spacing unchanged.
- Line 300: The loop currently uses volatile_layers[:10] which silently
truncates results; change the logic around the for loop that uses
volatile_layers (the line "for container in volatile_layers[:10]") to detect
truncation (if len(volatile_layers) > 10) and, when truncated, print or log a
clear notice stating the total count and that only the first 10 entries are
shown (e.g., "Showing 10 of X volatile layers"); keep the existing slice for
iteration but ensure the message is emitted using the same output/logging
mechanism as the rest of the script.

In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 128-136: In get_increase, avoid direct indexing into the
Prometheus response which can raise KeyError on non-200 or unexpected payloads:
wrap the access to data returned by query_instant(prometheus, query, end) in a
try/except (or use chained .get() calls) to verify data.get("status")==
"success" and that data.get("data", {}).get("result") is a non-empty list before
reading result[0]["value"][1]; on any failure return None (and optionally log
the raw response), and also guard the float(...) conversion in a try/except to
return None if the value is not parseable. Ensure you still use
label_selector(device, instance) and the same query construction.
- Around line 217-226: The summary currently converts missing
rx_increase/tx_increase to 0 which conflates "no data" with actual zero traffic;
update the summary dict (the "summary" assignment) to set "rx_bytes" to
int(rx_increase) if rx_increase is not None else None and "tx_bytes" to
int(tx_increase) if tx_increase is not None else None so that missing values
remain Python None (serialized as JSON null) instead of 0; keep "counter_resets"
and "samples" as-is.

In `@ansible/roles/common/tasks/files/ready.sh`:
- Line 126: In wait_for_ready replace the redundant re-declaration of the
variable elapsed (currently using "local elapsed=$(( ... ))") with a plain
reassignment so you don't declare it twice; locate the second "local elapsed" in
the function and change it to "elapsed=..." so the original local declaration at
the top of wait_for_ready remains the single declaration.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 125-143: The image-capture block currently runs unconditionally
and will fail the playbook if CRI-O isn't available; update the tasks so the
crio-df invocation is best-effort by adding error-tolerant behavior: on the "run
crio-df script" task (the task that registers crio_df) add failed_when: false
and/or ignore_errors: true so the playbook continues on failure, and make the
"save container image sizes to results" copy task conditional (when: crio_df is
defined and crio_df.stdout | length > 0) so you only write results when output
exists; reference the task names "run crio-df script", the registered variable
crio_df, and the results destination variable results_dir when making these
changes.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1381b42 and d618e67.

📒 Files selected for processing (8)
  • ansible/roles/common/tasks/files/crio-df.py
  • ansible/roles/common/tasks/files/prom-network-query.py
  • ansible/roles/common/tasks/files/ready.sh
  • ansible/roles/install-logging-exporters/defaults/main.yml
  • ansible/roles/install-logging-exporters/tasks/main.yml
  • ansible/roles/install-microshift/tasks/main.yml
  • ansible/roles/microshift-start/defaults/main.yml
  • ansible/roles/microshift-start/tasks/main.yml
💤 Files with no reviewable changes (1)
  • ansible/roles/microshift-start/defaults/main.yml

@sjug sjug force-pushed the ansible_improvements branch from d618e67 to 6318af1 Compare February 24, 2026 16:40
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

  • New Features

  • Added detailed container image storage reporting and a Prometheus-backed network query tool.

  • Improvements

  • Replaced vnStat-based network collection with Prometheus; collects network windows and preserves measurement timestamps.

  • Enhanced readiness progress output with elapsed time and not-ready pod diagnostics.

  • Improved microshift install flow to detect and clean prior installations.

  • Renamed node exporter service/package entries for consistency.

  • Chores

  • Removed default vnstat_db setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (6)
ansible/roles/microshift-start/tasks/main.yml (3)

149-166: ⚠️ Potential issue | 🟡 Minor

Hardcoded zen3 fallback hides misconfiguration.

Please fail fast or warn when no logging host/endpoint is configured.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 149 - 166, The
prom_network_endpoint resolution currently falls back to a hardcoded 'zen3'
which hides misconfiguration; update the logic in the prom_network_endpoint
set_fact (referencing prometheus_query_endpoint, prometheus_port,
groups['logging'] and hostvars) to detect when groups['logging'] is empty and
either (a) fail the playbook with a descriptive error via ansible.builtin.fail
explaining no logging host/endpoint is configured, or (b) emit a warning via
ansible.builtin.debug/ansible.builtin.warn and require an explicit
prometheus_query_endpoint; ensure the fallback to 'zen3' is removed so missing
configuration causes a visible failure or warning.

173-181: ⚠️ Potential issue | 🟡 Minor

Use a role/playbook‑relative path for prom-network-query.py.

The current relative path is fragile.

Suggested fix
-        python3 roles/common/tasks/files/prom-network-query.py
+        python3 {{ role_path }}/../common/tasks/files/prom-network-query.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 173 - 181, The
task "run prometheus network query" uses a fragile relative path to
prom-network-query.py; update the command to use a role/playbook-relative
absolute path (e.g. reference {{ role_path }} when the script lives in the
current role or {{ playbook_dir }}/roles/common/files/prom-network-query.py if
it's in the common role) so the command invokes the correct file regardless of
cwd, keeping the rest of the arguments ({{ prom_network_endpoint }}, {{
prom_network_instance }}, {{ prom_network_device }}, {{
network_start_epoch.stdout }}, {{ network_end_epoch.stdout }}) unchanged.

125-133: ⚠️ Potential issue | 🟡 Minor

Use role_path to reference files in other roles instead of CWD-relative paths.

The path roles/common/tasks/files/crio-df.py depends on the playbook being run from the ansible/ directory. Running playbooks from elsewhere will break this reference.

Suggested fix
-        src: roles/common/tasks/files/crio-df.py
+        src: "{{ role_path }}/../common/tasks/files/crio-df.py"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 125 - 133, The
src path in the "copy crio-df script to remote host" task is CWD-relative and
can break; update the ansible.builtin.copy src to use role_path so it references
the common role reliably, e.g. replace the literal
"roles/common/tasks/files/crio-df.py" in the src field with a path built from
role_path (for this role) like "{{ role_path
}}/../common/tasks/files/crio-df.py" so the file is found regardless of working
directory.
ansible/roles/common/tasks/files/prom-network-query.py (1)

143-144: ⚠️ Potential issue | 🟡 Minor

Use UTC for timestamps to match the rest of the script.

fmt_ts currently uses local time; the window logic is UTC.

Suggested fix
-def fmt_ts(epoch):
-    return datetime.fromtimestamp(epoch).strftime("%Y-%m-%d %H:%M:%S")
+def fmt_ts(epoch):
+    return datetime.fromtimestamp(epoch, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 143 -
144, The fmt_ts function currently formats epoch times in local time; change it
to produce UTC timestamps to match the rest of the script by using a UTC-based
conversion (e.g., datetime.utcfromtimestamp or datetime.fromtimestamp with
timezone.utc) so fmt_ts(epoch) returns a UTC-formatted string like "%Y-%m-%d
%H:%M:%S"; update the implementation of fmt_ts accordingly so all downstream
window logic remains consistent.
ansible/roles/install-microshift/tasks/main.yml (2)

33-39: ⚠️ Potential issue | 🔴 Critical

dnf won’t expand microshift-* for state: absent.

This matches nothing, so packages remain installed. Use list/filter → remove.

Suggested fix
+    - name: list installed microshift packages
+      ansible.builtin.dnf:
+        list: installed
+      register: installed_packages
+
     - name: uninstall microshift packages and dependencies
       become: yes
       ansible.builtin.dnf:
-        name:
-          - microshift-*
+        name: "{{ installed_packages.results | selectattr('name', 'search', '^microshift') | map(attribute='name') | list }}"
         autoremove: yes
         state: absent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/install-microshift/tasks/main.yml` around lines 33 - 39, The
current task "uninstall microshift packages and dependencies" uses
ansible.builtin.dnf with name: microshift-* which won't expand when state:
absent; instead first gather the installed package names that match the pattern
(e.g., via package_facts or a shell/command that lists installed packages
matching "microshift-*"), store them in a variable (e.g., microshift_packages),
then call ansible.builtin.dnf with name: "{{ microshift_packages }}" autoremove:
yes state: absent so you pass an explicit list to the dnf module and only remove
packages that are actually installed.

4-8: ⚠️ Potential issue | 🟡 Minor

Detect sub‑packages before skipping cleanup.

rpm -q microshift only matches the base package; if only microshift-* sub‑packages are installed, the cleanup block never runs.

Suggested fix
-  ansible.builtin.command: rpm -q microshift
+  ansible.builtin.command: rpm -q --whatprovides microshift
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/install-microshift/tasks/main.yml` around lines 4 - 8, The
current check uses "rpm -q microshift" which only matches the base package;
update the task that registers microshift_installed to detect any installed
package whose name begins with "microshift" (e.g., replace the command with
something like `rpm -qa | grep -E '^microshift'` or `rpm -q "microshift*"`),
keep register: microshift_installed, ignore_errors: true and changed_when:
false, and ensure any later cleanup conditional continues to reference
microshift_installed (e.g., microshift_installed.rc or
microshift_installed.stdout) so the cleanup block runs when any microshift-*
subpackage is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Around line 169-175: The variables flagged as unused should be removed or
renamed to underscore-prefixed names to satisfy linters: in the block using
get_image_display_name(image) remove the unused tag (use repo =
get_image_display_name(image) or assign repo, _ = get_image_display_name(image))
and in the section that computes compression metrics remove or underscore the
unused comp_layer_usage, comp_naive, and comp_shared_size variables; update the
per_image construction (per_image.append({...})) to only reference actual fields
(e.g., "id", "name", "compressed", "layers") so no unused local names remain and
lint warnings RUF059/F841 are resolved.
- Around line 296-297: Several print statements use f-strings with no
placeholders (e.g. print(f"\nContainers space usage:\n"), print(f"{'CONTAINER
ID':<12} {'IMAGE':<35} ... {'NAMES'}")) which triggers Ruff/Flake8 F541; remove
the unnecessary f prefix on those string literals (also for the similar prints
at the other occurrences around the block shown — the lines printing the header
and the formatted columns at 296-297 and the ones at 306-307 and 311-312) so
they become plain string arguments to print without changing the string content
or formatting expressions.
- Line 333: Rename the ambiguous loop variable 'l' in the list comprehension to
a clearer name (e.g., 'layer') to satisfy Flake8 E741; specifically update the
expression that builds the count of shared layers (it references layer_count and
currently uses "for l, c in layer_count.items()") so it reads "for layer, c in
layer_count.items()" (and update any matching variable references in that
expression).

---

Duplicate comments:
In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 143-144: The fmt_ts function currently formats epoch times in
local time; change it to produce UTC timestamps to match the rest of the script
by using a UTC-based conversion (e.g., datetime.utcfromtimestamp or
datetime.fromtimestamp with timezone.utc) so fmt_ts(epoch) returns a
UTC-formatted string like "%Y-%m-%d %H:%M:%S"; update the implementation of
fmt_ts accordingly so all downstream window logic remains consistent.

In `@ansible/roles/install-microshift/tasks/main.yml`:
- Around line 33-39: The current task "uninstall microshift packages and
dependencies" uses ansible.builtin.dnf with name: microshift-* which won't
expand when state: absent; instead first gather the installed package names that
match the pattern (e.g., via package_facts or a shell/command that lists
installed packages matching "microshift-*"), store them in a variable (e.g.,
microshift_packages), then call ansible.builtin.dnf with name: "{{
microshift_packages }}" autoremove: yes state: absent so you pass an explicit
list to the dnf module and only remove packages that are actually installed.
- Around line 4-8: The current check uses "rpm -q microshift" which only matches
the base package; update the task that registers microshift_installed to detect
any installed package whose name begins with "microshift" (e.g., replace the
command with something like `rpm -qa | grep -E '^microshift'` or `rpm -q
"microshift*"`), keep register: microshift_installed, ignore_errors: true and
changed_when: false, and ensure any later cleanup conditional continues to
reference microshift_installed (e.g., microshift_installed.rc or
microshift_installed.stdout) so the cleanup block runs when any microshift-*
subpackage is present.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 149-166: The prom_network_endpoint resolution currently falls back
to a hardcoded 'zen3' which hides misconfiguration; update the logic in the
prom_network_endpoint set_fact (referencing prometheus_query_endpoint,
prometheus_port, groups['logging'] and hostvars) to detect when
groups['logging'] is empty and either (a) fail the playbook with a descriptive
error via ansible.builtin.fail explaining no logging host/endpoint is
configured, or (b) emit a warning via ansible.builtin.debug/ansible.builtin.warn
and require an explicit prometheus_query_endpoint; ensure the fallback to 'zen3'
is removed so missing configuration causes a visible failure or warning.
- Around line 173-181: The task "run prometheus network query" uses a fragile
relative path to prom-network-query.py; update the command to use a
role/playbook-relative absolute path (e.g. reference {{ role_path }} when the
script lives in the current role or {{ playbook_dir
}}/roles/common/files/prom-network-query.py if it's in the common role) so the
command invokes the correct file regardless of cwd, keeping the rest of the
arguments ({{ prom_network_endpoint }}, {{ prom_network_instance }}, {{
prom_network_device }}, {{ network_start_epoch.stdout }}, {{
network_end_epoch.stdout }}) unchanged.
- Around line 125-133: The src path in the "copy crio-df script to remote host"
task is CWD-relative and can break; update the ansible.builtin.copy src to use
role_path so it references the common role reliably, e.g. replace the literal
"roles/common/tasks/files/crio-df.py" in the src field with a path built from
role_path (for this role) like "{{ role_path
}}/../common/tasks/files/crio-df.py" so the file is found regardless of working
directory.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between d618e67 and 6318af1.

📒 Files selected for processing (8)
  • ansible/roles/common/tasks/files/crio-df.py
  • ansible/roles/common/tasks/files/prom-network-query.py
  • ansible/roles/common/tasks/files/ready.sh
  • ansible/roles/install-logging-exporters/defaults/main.yml
  • ansible/roles/install-logging-exporters/tasks/main.yml
  • ansible/roles/install-microshift/tasks/main.yml
  • ansible/roles/microshift-start/defaults/main.yml
  • ansible/roles/microshift-start/tasks/main.yml
💤 Files with no reviewable changes (1)
  • ansible/roles/microshift-start/defaults/main.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • ansible/roles/common/tasks/files/ready.sh

@sjug sjug force-pushed the ansible_improvements branch from 6318af1 to 836c547 Compare February 24, 2026 16:56
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

  • New Features

  • Added detailed container image storage reporting and a Prometheus-backed network query tool.

  • Improvements

  • Replaced vnStat-based network collection with Prometheus; preserves measurement start/end timestamps.

  • Readiness checks now include elapsed time and diagnostics for not-ready pods.

  • Detects and cleans prior MicroShift installations; renamed node exporter service/package entries for consistency.

  • Chores

  • Removed default vnstat_db setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
ansible/roles/microshift-start/tasks/main.yml (3)

171-182: ⚠️ Potential issue | 🟡 Minor

Relative script path for prom-network-query.py is CWD-dependent.

python3 roles/common/tasks/files/prom-network-query.py relies on the ansible-playbook working directory. This was flagged in a previous review and remains unaddressed. Using {{ playbook_dir }} to anchor the path removes the fragility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 171 - 182, The
task "run prometheus network query" is invoking the script with a relative path
which makes execution CWD-dependent; change the command to use an anchored path
by prefacing the script with {{ playbook_dir }} (e.g., {{ playbook_dir
}}/roles/common/tasks/files/prom-network-query.py) so the
ansible.builtin.command uses an absolute path, keeping the existing
flags/variables ({{ prom_network_endpoint }}, {{ prom_network_instance }}, {{
prom_network_device }}, {{ network_start_epoch.stdout }}, {{
network_end_epoch.stdout }}) and keeping the same register/ delegate_to
behavior.

149-169: ⚠️ Potential issue | 🟡 Minor

Hardcoded zen3 fallback will silently fail in other environments.

When neither prometheus_query_endpoint is set nor the logging group exists, the endpoint resolves to http://zen3:9091 — a lab-specific hostname. This was raised in a previous review and remains unaddressed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 149 - 169, The
prom_network_endpoint fact currently falls back to the hardcoded hostname "zen3"
when neither prometheus_query_endpoint nor a logging group exists; update the
resolution in the prom_network_endpoint set_fact to use a safe, portable default
(for example inventory_hostname or ansible_default_ipv4.address or "localhost")
instead of "zen3" so it won't silently point at a lab-specific host—change the
fallback logic around prometheus_query_endpoint and the groups['logging'] branch
in the prom_network_endpoint expression to use that portable default and ensure
prom_network_device and prom_network_instance remain unchanged.

127-132: ⚠️ Potential issue | 🟡 Minor

Relative src path for crio-df.py copy is CWD-dependent.

src: roles/common/tasks/files/crio-df.py breaks if ansible-playbook is invoked from anywhere other than ansible/. Use {{ role_path }} or a role-relative reference instead, consistent with how prom-network-query.py should be referenced.

🛠️ Proposed fix
-        src: roles/common/tasks/files/crio-df.py
+        src: "{{ playbook_dir }}/../roles/common/tasks/files/crio-df.py"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 127 - 132, The
copy task "copy crio-df script to remote host" uses a CWD-dependent src path;
change src to a role-relative reference (e.g. src: "{{ role_path
}}/files/crio-df.py") or otherwise use the same pattern as prom-network-query.py
so the file is resolved regardless of where ansible-playbook is invoked; ensure
the crio-df.py file lives in that role's files directory (or update the
role_path to point to the common role's files if it intentionally lives in
roles/common/files).
ansible/roles/common/tasks/files/prom-network-query.py (1)

143-144: ⚠️ Potential issue | 🟡 Minor

fmt_ts uses local time while the rest of the script uses UTC.

datetime.fromtimestamp(epoch) returns local time; parse_window_args uses datetime.now(timezone.utc). Displayed timestamps will silently shift based on the controller's timezone.

🛠️ Proposed fix
 def fmt_ts(epoch):
-    return datetime.fromtimestamp(epoch).strftime("%Y-%m-%d %H:%M:%S")
+    return datetime.fromtimestamp(epoch, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 143 -
144, The helper fmt_ts currently uses datetime.fromtimestamp(epoch) which yields
local time and conflicts with the UTC-based logic in parse_window_args (which
uses datetime.now(timezone.utc)); update fmt_ts to produce UTC-aware timestamps
(e.g., use datetime.fromtimestamp(epoch, timezone.utc) or equivalent) and keep
the same strftime format so all displayed timestamps are in UTC and consistent
with parse_window_args.
🧹 Nitpick comments (1)
ansible/roles/common/tasks/files/crio-df.py (1)

29-29: Optional: Ruff BLE001 — broad Exception catches still flagged.

Both catches are intentional (line 29 prints a warning, line 94 returns "Unknown"), so no behavior change is needed. If you want to silence the linter cleanly, you can narrow line 29 to the expected OS/IO errors and leave line 94 as-is with a # noqa: BLE001 comment.

Also applies to: 94-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` at line 29, The broad except at
the first catch (the "except Exception as e:" that logs a warning) should be
narrowed to the specific expected errors (e.g., except (OSError, IOError) as e:)
so the linter BLE001 is satisfied while preserving behavior; leave the second
broad catch that returns "Unknown" as-is but add a "# noqa: BLE001" comment to
that line to explicitly silence Ruff. Ensure you update the warning log
statement using the same exception variable name (e) so existing logging remains
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 264-266: The loop converting Prometheus sample strings uses
int(val) which raises ValueError for scientific-notation strings; change the
conversion to parse as float first and then to int (e.g. int(float(val))) inside
the loop that iterates over values before calling fmt_ts(ts) so
scientific-notation and large float64 strings are handled correctly; update the
conversion in the block where values are unpacked (the for ts, val in values
loop) so downstream code (fmt_ts and any consumers of val) receives an integer.

---

Duplicate comments:
In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 143-144: The helper fmt_ts currently uses
datetime.fromtimestamp(epoch) which yields local time and conflicts with the
UTC-based logic in parse_window_args (which uses datetime.now(timezone.utc));
update fmt_ts to produce UTC-aware timestamps (e.g., use
datetime.fromtimestamp(epoch, timezone.utc) or equivalent) and keep the same
strftime format so all displayed timestamps are in UTC and consistent with
parse_window_args.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 171-182: The task "run prometheus network query" is invoking the
script with a relative path which makes execution CWD-dependent; change the
command to use an anchored path by prefacing the script with {{ playbook_dir }}
(e.g., {{ playbook_dir }}/roles/common/tasks/files/prom-network-query.py) so the
ansible.builtin.command uses an absolute path, keeping the existing
flags/variables ({{ prom_network_endpoint }}, {{ prom_network_instance }}, {{
prom_network_device }}, {{ network_start_epoch.stdout }}, {{
network_end_epoch.stdout }}) and keeping the same register/ delegate_to
behavior.
- Around line 149-169: The prom_network_endpoint fact currently falls back to
the hardcoded hostname "zen3" when neither prometheus_query_endpoint nor a
logging group exists; update the resolution in the prom_network_endpoint
set_fact to use a safe, portable default (for example inventory_hostname or
ansible_default_ipv4.address or "localhost") instead of "zen3" so it won't
silently point at a lab-specific host—change the fallback logic around
prometheus_query_endpoint and the groups['logging'] branch in the
prom_network_endpoint expression to use that portable default and ensure
prom_network_device and prom_network_instance remain unchanged.
- Around line 127-132: The copy task "copy crio-df script to remote host" uses a
CWD-dependent src path; change src to a role-relative reference (e.g. src: "{{
role_path }}/files/crio-df.py") or otherwise use the same pattern as
prom-network-query.py so the file is resolved regardless of where
ansible-playbook is invoked; ensure the crio-df.py file lives in that role's
files directory (or update the role_path to point to the common role's files if
it intentionally lives in roles/common/files).

---

Nitpick comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Line 29: The broad except at the first catch (the "except Exception as e:"
that logs a warning) should be narrowed to the specific expected errors (e.g.,
except (OSError, IOError) as e:) so the linter BLE001 is satisfied while
preserving behavior; leave the second broad catch that returns "Unknown" as-is
but add a "# noqa: BLE001" comment to that line to explicitly silence Ruff.
Ensure you update the warning log statement using the same exception variable
name (e) so existing logging remains unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6318af1 and 836c547.

📒 Files selected for processing (4)
  • ansible/roles/common/tasks/files/crio-df.py
  • ansible/roles/common/tasks/files/prom-network-query.py
  • ansible/roles/microshift-start/defaults/main.yml
  • ansible/roles/microshift-start/tasks/main.yml
💤 Files with no reviewable changes (1)
  • ansible/roles/microshift-start/defaults/main.yml

@sjug sjug force-pushed the ansible_improvements branch from 836c547 to ec77ae3 Compare February 24, 2026 17:17
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

  • New Features

  • Added detailed container image storage reporting and a Prometheus-backed network query tool.

  • Improvements

  • Replaced vnStat-based network collection with Prometheus; preserves measurement start/end timestamps.

  • Readiness checks now show elapsed time and list not-ready pods.

  • Detects and cleans prior MicroShift installations; renamed node exporter service/package entries for consistency.

  • Chores

  • Removed default vnstat_db setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ansible/roles/microshift-start/tasks/main.yml (1)

9-29: ⚠️ Potential issue | 🟠 Major

Fail fast if the cleanup script isn’t found.
If find_cleanup.files is empty, Line 25 can render an empty command and fail opaquely.

✅ Suggested fix
   - name: set microshift_cleanup_bin variable
     set_fact:
       microshift_cleanup_bin: "{{ find_cleanup.files[0].path if find_cleanup.files }}"
+
+  - name: fail if cleanup script not found
+    ansible.builtin.fail:
+      msg: "MicroShift cleanup script not found under /usr/bin."
+    when:
+      - cleanup_microshift | default('true') | bool
+      - microshift_active.stdout | int != 0
+      - find_cleanup.files | length == 0
 
   - name: cleanup microshift data
     become: yes
     ansible.builtin.shell: echo 1 | {{ microshift_cleanup_bin }} --all
   when:
     - cleanup_microshift | default('true') | bool
     - microshift_active.stdout | int != 0
+    - microshift_cleanup_bin is defined
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 9 - 29, The task
assigns microshift_cleanup_bin from find_cleanup.files but doesn’t fail when no
script is found, leading to an empty/opaque shell call; update the logic around
the registered result (find_cleanup) and the set_fact/microshift cleanup task so
that if find_cleanup.files is empty you fail fast with a clear message (e.g.,
using a fail task or conditional) before running the "cleanup microshift data"
shell command; reference the registered variable find_cleanup, the set_fact that
sets microshift_cleanup_bin, and the "cleanup microshift data" task to implement
the check and explicit failure.
♻️ Duplicate comments (4)
ansible/roles/common/tasks/files/prom-network-query.py (2)

266-279: ⚠️ Potential issue | 🟡 Minor

Guard against scientific-notation values from Prometheus.
Line 269 uses int(val), which throws for strings like "1.23e+10"; this can break large counters.

✅ Suggested fix
-        val = int(val)
+        val = int(float(val))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 266 -
279, The loop currently does val = int(val) which fails on scientific-notation
strings like "1.23e+10"; change the conversion to parse as float first and then
to int (e.g. val = int(float(val))) and optionally wrap that conversion in a
try/except to log or skip malformed values; update uses in the loop (prev_val,
resets, delta calculation and print) to rely on the new val numeric type and
ensure fmt_ts(ts) remains unchanged.

147-148: ⚠️ Potential issue | 🟡 Minor

Use UTC in fmt_ts for consistency.
Line 148 currently uses local time, while window math is UTC-based; this silently shifts outputs by timezone.

✅ Suggested fix
-def fmt_ts(epoch):
-    return datetime.fromtimestamp(epoch).strftime("%Y-%m-%d %H:%M:%S")
+def fmt_ts(epoch):
+    return datetime.fromtimestamp(epoch, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 147 -
148, The fmt_ts function currently formats epoch using local time; change it to
use UTC so outputs align with UTC-based window math—update fmt_ts to convert the
epoch to a UTC datetime (e.g., via datetime.utcfromtimestamp(...) or
datetime.fromtimestamp(..., tz=timezone.utc)) and then strftime the same
"%Y-%m-%d %H:%M:%S" format (or include a trailing Z if you want to mark UTC),
keeping the function name fmt_ts unchanged.
ansible/roles/microshift-start/tasks/main.yml (2)

149-168: ⚠️ Potential issue | 🟡 Minor

Avoid hardcoded zen3 fallback for Prometheus.
Line 161 silently defaults to a lab-specific host, which is brittle in other environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 149 - 168, The
prom_network_endpoint set_fact uses a hardcoded 'zen3' fallback which should be
removed; update the prom_network_endpoint expression to not reference 'zen3' and
instead fall back to a sensible, environment-agnostic value (for example
inventory_hostname or '127.0.0.1') or fail loudly if no logging host is
found—modify the templated expression that computes prom_network_endpoint (and
optionally prom_network_device/prom_network_instance if related) to prefer
groups['logging'] first, then inventory_hostname or localhost, and avoid any
lab-specific literal.

171-181: ⚠️ Potential issue | 🟠 Major

Use a role/playbook-relative path for prom-network-query.py.
Line 175 depends on the working directory; this can break when playbooks are invoked from repo root.

✅ Example using playbook_dir
-        python3 roles/common/tasks/files/prom-network-query.py
+        python3 {{ playbook_dir }}/roles/common/tasks/files/prom-network-query.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 171 - 181, The
task "run prometheus network query" currently invokes the script
"prom-network-query.py" with a repository-relative path which breaks depending
on the current working directory; change the command to build an absolute path
using Ansible variables (e.g. use playbook_dir or role_path) to point to
prom-network-query.py in the role/playbook files directory, leaving the rest of
the arguments (prom_network_endpoint, prom_network_instance,
prom_network_device, network_start_epoch.stdout, network_end_epoch.stdout)
unchanged so the task reliably finds the script regardless of where the playbook
is invoked.
🧹 Nitpick comments (1)
ansible/roles/microshift-start/tasks/main.yml (1)

125-133: Prefer a stable path for the local script source.
Line 130 relies on the current working directory; using playbook_dir or role_path is more robust.

✅ Example using playbook_dir
-        src: roles/common/tasks/files/crio-df.py
+        src: "{{ playbook_dir }}/roles/common/tasks/files/crio-df.py"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 125 - 133, The
copy task "copy crio-df script to remote host" currently uses a relative src
path which depends on the CWD; change the src to a stable variable-based path
(e.g. use role_path or playbook_dir) so the file is resolved reliably regardless
of where the playbook is executed, update the ansible.builtin.copy task that
references src: roles/common/tasks/files/crio-df.py to use "{{ role_path }}/..."
or "{{ playbook_dir }}/..." accordingly and leave dest and mode unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 9-29: The task assigns microshift_cleanup_bin from
find_cleanup.files but doesn’t fail when no script is found, leading to an
empty/opaque shell call; update the logic around the registered result
(find_cleanup) and the set_fact/microshift cleanup task so that if
find_cleanup.files is empty you fail fast with a clear message (e.g., using a
fail task or conditional) before running the "cleanup microshift data" shell
command; reference the registered variable find_cleanup, the set_fact that sets
microshift_cleanup_bin, and the "cleanup microshift data" task to implement the
check and explicit failure.

---

Duplicate comments:
In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 266-279: The loop currently does val = int(val) which fails on
scientific-notation strings like "1.23e+10"; change the conversion to parse as
float first and then to int (e.g. val = int(float(val))) and optionally wrap
that conversion in a try/except to log or skip malformed values; update uses in
the loop (prev_val, resets, delta calculation and print) to rely on the new val
numeric type and ensure fmt_ts(ts) remains unchanged.
- Around line 147-148: The fmt_ts function currently formats epoch using local
time; change it to use UTC so outputs align with UTC-based window math—update
fmt_ts to convert the epoch to a UTC datetime (e.g., via
datetime.utcfromtimestamp(...) or datetime.fromtimestamp(..., tz=timezone.utc))
and then strftime the same "%Y-%m-%d %H:%M:%S" format (or include a trailing Z
if you want to mark UTC), keeping the function name fmt_ts unchanged.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 149-168: The prom_network_endpoint set_fact uses a hardcoded
'zen3' fallback which should be removed; update the prom_network_endpoint
expression to not reference 'zen3' and instead fall back to a sensible,
environment-agnostic value (for example inventory_hostname or '127.0.0.1') or
fail loudly if no logging host is found—modify the templated expression that
computes prom_network_endpoint (and optionally
prom_network_device/prom_network_instance if related) to prefer
groups['logging'] first, then inventory_hostname or localhost, and avoid any
lab-specific literal.
- Around line 171-181: The task "run prometheus network query" currently invokes
the script "prom-network-query.py" with a repository-relative path which breaks
depending on the current working directory; change the command to build an
absolute path using Ansible variables (e.g. use playbook_dir or role_path) to
point to prom-network-query.py in the role/playbook files directory, leaving the
rest of the arguments (prom_network_endpoint, prom_network_instance,
prom_network_device, network_start_epoch.stdout, network_end_epoch.stdout)
unchanged so the task reliably finds the script regardless of where the playbook
is invoked.

---

Nitpick comments:
In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 125-133: The copy task "copy crio-df script to remote host"
currently uses a relative src path which depends on the CWD; change the src to a
stable variable-based path (e.g. use role_path or playbook_dir) so the file is
resolved reliably regardless of where the playbook is executed, update the
ansible.builtin.copy task that references src:
roles/common/tasks/files/crio-df.py to use "{{ role_path }}/..." or "{{
playbook_dir }}/..." accordingly and leave dest and mode unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 836c547 and ec77ae3.

📒 Files selected for processing (4)
  • ansible/roles/common/tasks/files/crio-df.py
  • ansible/roles/common/tasks/files/prom-network-query.py
  • ansible/roles/microshift-start/defaults/main.yml
  • ansible/roles/microshift-start/tasks/main.yml
💤 Files with no reviewable changes (1)
  • ansible/roles/microshift-start/defaults/main.yml

@sjug sjug force-pushed the ansible_improvements branch from ec77ae3 to f77d12d Compare February 24, 2026 17:28
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@sjug: This pull request references USHIFT-6637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Add elapsed timestamps and not ready pod tracking to boot metrics
  • Uninstall existing MicroShift before initial disk capture for clean baseline measurements
  • Fix node exporter package and service names for RHEL
  • Replace vnstat network measurement with Prometheus increase() queries
  • Add container image size capture via crio-df.py

Summary by CodeRabbit

  • New Features

  • Detailed container image storage reporting and a Prometheus-backed network query tool.

  • Improvements

  • Replaced vnStat-based network collection with Prometheus; records measurement start/end timestamps.

  • Readiness checks now include elapsed time and list not-ready pods.

  • Detects and cleans prior MicroShift installations; standardizes node_exporter service/package name.

  • Chores

  • Removed default vnstat_db setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
ansible/roles/common/tasks/files/prom-network-query.py (2)

154-155: ⚠️ Potential issue | 🟡 Minor

fmt_ts still uses local time while the rest of the script uses UTC.

datetime.fromtimestamp(epoch) returns local time; epochs are computed in UTC. Displayed timestamps will shift depending on the host's timezone.

Proposed fix
 def fmt_ts(epoch):
-    return datetime.fromtimestamp(epoch).strftime("%Y-%m-%d %H:%M:%S")
+    return datetime.fromtimestamp(epoch, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 154 -
155, The helper fmt_ts uses datetime.fromtimestamp(epoch) which yields local
time; change it to produce UTC timestamps instead by using a UTC-aware
conversion (e.g., datetime.utcfromtimestamp or datetime.fromtimestamp with
timezone.utc) so that epoch values are formatted consistently with the rest of
the script; update the fmt_ts function to convert the epoch to UTC before
calling strftime("%Y-%m-%d %H:%M:%S").

275-276: ⚠️ Potential issue | 🟠 Major

int(val) crashes on scientific-notation strings from Prometheus.

Prometheus counter values can arrive as e.g. "1.23456789e+10". int("1.23456789e+10") raises ValueError. Still unfixed from prior review.

Proposed fix
     for ts, val in values:
-        val = int(val)
+        val = int(float(val))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 275 -
276, The code crashes when converting Prometheus scientific-notation strings
because it does val = int(val) in the loop "for ts, val in values"; change the
conversion to first parse as a float (e.g. int(float(val))) and wrap it in a
small try/except to catch ValueError/TypeError so malformed values are
logged/skipped instead of crashing; update the conversion at the "for ts, val in
values" site and ensure any downstream logic expects an int after this safe
conversion.
ansible/roles/microshift-start/tasks/main.yml (2)

171-182: ⚠️ Potential issue | 🟡 Minor

Relative script path still fragile — depends on CWD being ansible/.

Previously flagged. roles/common/tasks/files/prom-network-query.py resolves relative to the playbook directory, not the role. Consider copying the script to the controller temp dir or using an absolute path via {{ playbook_dir }}.

Proposed fix
     - name: run prometheus network query
       ansible.builtin.command: >
-        python3 roles/common/tasks/files/prom-network-query.py
+        python3 {{ playbook_dir }}/roles/common/tasks/files/prom-network-query.py
         --prometheus {{ prom_network_endpoint }}
         --instance {{ prom_network_instance }}
         --device {{ prom_network_device }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 171 - 182, The
task "run prometheus network query" uses a relative script path
"roles/common/tasks/files/prom-network-query.py" which is fragile; update the
task so the controller runs an explicit absolute path or a copied temp file:
either change the command to use "{{ playbook_dir
}}/roles/common/tasks/files/prom-network-query.py" (keep delegate_to: localhost
and register: prom_network) or add a prior step that uses ansible.builtin.copy
(or fetch) to copy that script to a controller temp location (e.g.,
/tmp/prom-network-query.py) and run the copied file from that absolute path in
the ansible.builtin.command invocation.

149-169: ⚠️ Potential issue | 🟡 Minor

Hardcoded 'zen3' fallback is still present.

Previously flagged: if no logging group and no prometheus_query_endpoint override, the endpoint silently resolves to http://zen3:9091 which will fail opaquely in non-lab environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/microshift-start/tasks/main.yml` around lines 149 - 169, The
prom_network_endpoint resolution currently falls back to a hardcoded 'zen3' when
there is no logging group and no prometheus_query_endpoint; replace that
hardcoded 'zen3' fallback in the prom_network_endpoint expression with a safe
alternative (for example inventory_hostname or an explicit failure) so the
playbook does not silently produce an unusable URL — update the logic around
prom_network_endpoint (the block referencing prometheus_query_endpoint,
groups['logging'], and 'zen3') to either use inventory_hostname (or
ansible_default_ipv4.address) as the default host or call ansible.builtin.fail
with a clear message requiring prometheus_query_endpoint when logging group is
absent.
ansible/roles/common/tasks/files/crio-df.py (2)

59-95: Past bare-except issue addressed; except Exception is fine here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` around lines 59 - 95, The
function format_time_ago currently catches all exceptions with "except
Exception" and the reviewer approves this approach, so leave format_time_ago
unchanged; do not revert to a bare except or add more granular exception
handling here.

284-335: Past lint issues (f-string placeholders, ambiguous l) are resolved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/crio-df.py` around lines 284 - 335, The
updated verbose and summary printing blocks no longer contain the previous
f-string placeholder/l ambiguity and look correct; no code changes required—keep
the current print blocks (the verbose branch that iterates over image_data,
volatile_layers and comp_per_image and the summary branch that uses total_size,
total_reclaimable and reclaimable_pct) and ensure format_size and the variables
layer_count/comp_deduplicated remain defined upstream so these print lines
(including the compression ratio expression total_size / comp_deduplicated) do
not raise runtime errors.
🧹 Nitpick comments (1)
ansible/roles/common/tasks/files/prom-network-query.py (1)

296-309: Consider using null instead of 0 when increase data is absent.

Lines 303–304: reporting 0 when rx_increase is None makes "no data" indistinguishable from "zero transfer" for downstream consumers.

Proposed fix
-        "rx_bytes": int(rx_increase) if rx_increase is not None else 0,
-        "tx_bytes": int(tx_increase) if tx_increase is not None else 0,
+        "rx_bytes": int(rx_increase) if rx_increase is not None else None,
+        "tx_bytes": int(tx_increase) if tx_increase is not None else None,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/common/tasks/files/prom-network-query.py` around lines 296 -
309, The summary currently sets "rx_bytes" and "tx_bytes" to 0 when rx_increase
or tx_increase is None, which hides "no data" vs "zero transfer"; update the
summary construction (the dict assigned to summary) to set "rx_bytes":
int(rx_increase) if rx_increase is not None else None and "tx_bytes":
int(tx_increase) if tx_increase is not None else None so that json.dumps will
emit JSON null for missing data; keep all other fields and behavior the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Around line 130-138: The current parsing uses name.rsplit(":", 1) which
misinterprets registry ports as tags; after removing any "@sha256:" suffix, when
you split on the last ":" check the right-side segment (parts[1]) for a "/" — if
it contains "/" then it is a port (no tag) so return (name, "latest"); otherwise
treat it as a tag and return (parts[0], parts[1]). Update the logic around the
rsplit usage (the block handling ":" in name) to perform this extra "/" check
before deciding to return the split as repository and tag.
- Around line 215-218: The current size selection uses "or" chaining which
treats 0 as falsy and wrongfully falls through; in the loop over layer_count
(using layers_by_id and total_size) replace the or-chained lookup for
"diff-size" with a None-sentinel check: fetch diff = layer.get("diff-size",
None) and if diff is None use layer.get("uncompress_size", 0), assigning the
resulting integer to size before adding to total_size; apply the same
None-sentinel change to the similar lookup at the other occurrence (line 246) so
explicit zero sizes are preserved.

---

Duplicate comments:
In `@ansible/roles/common/tasks/files/crio-df.py`:
- Around line 59-95: The function format_time_ago currently catches all
exceptions with "except Exception" and the reviewer approves this approach, so
leave format_time_ago unchanged; do not revert to a bare except or add more
granular exception handling here.
- Around line 284-335: The updated verbose and summary printing blocks no longer
contain the previous f-string placeholder/l ambiguity and look correct; no code
changes required—keep the current print blocks (the verbose branch that iterates
over image_data, volatile_layers and comp_per_image and the summary branch that
uses total_size, total_reclaimable and reclaimable_pct) and ensure format_size
and the variables layer_count/comp_deduplicated remain defined upstream so these
print lines (including the compression ratio expression total_size /
comp_deduplicated) do not raise runtime errors.

In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 154-155: The helper fmt_ts uses datetime.fromtimestamp(epoch)
which yields local time; change it to produce UTC timestamps instead by using a
UTC-aware conversion (e.g., datetime.utcfromtimestamp or datetime.fromtimestamp
with timezone.utc) so that epoch values are formatted consistently with the rest
of the script; update the fmt_ts function to convert the epoch to UTC before
calling strftime("%Y-%m-%d %H:%M:%S").
- Around line 275-276: The code crashes when converting Prometheus
scientific-notation strings because it does val = int(val) in the loop "for ts,
val in values"; change the conversion to first parse as a float (e.g.
int(float(val))) and wrap it in a small try/except to catch ValueError/TypeError
so malformed values are logged/skipped instead of crashing; update the
conversion at the "for ts, val in values" site and ensure any downstream logic
expects an int after this safe conversion.

In `@ansible/roles/microshift-start/tasks/main.yml`:
- Around line 171-182: The task "run prometheus network query" uses a relative
script path "roles/common/tasks/files/prom-network-query.py" which is fragile;
update the task so the controller runs an explicit absolute path or a copied
temp file: either change the command to use "{{ playbook_dir
}}/roles/common/tasks/files/prom-network-query.py" (keep delegate_to: localhost
and register: prom_network) or add a prior step that uses ansible.builtin.copy
(or fetch) to copy that script to a controller temp location (e.g.,
/tmp/prom-network-query.py) and run the copied file from that absolute path in
the ansible.builtin.command invocation.
- Around line 149-169: The prom_network_endpoint resolution currently falls back
to a hardcoded 'zen3' when there is no logging group and no
prometheus_query_endpoint; replace that hardcoded 'zen3' fallback in the
prom_network_endpoint expression with a safe alternative (for example
inventory_hostname or an explicit failure) so the playbook does not silently
produce an unusable URL — update the logic around prom_network_endpoint (the
block referencing prometheus_query_endpoint, groups['logging'], and 'zen3') to
either use inventory_hostname (or ansible_default_ipv4.address) as the default
host or call ansible.builtin.fail with a clear message requiring
prometheus_query_endpoint when logging group is absent.

---

Nitpick comments:
In `@ansible/roles/common/tasks/files/prom-network-query.py`:
- Around line 296-309: The summary currently sets "rx_bytes" and "tx_bytes" to 0
when rx_increase or tx_increase is None, which hides "no data" vs "zero
transfer"; update the summary construction (the dict assigned to summary) to set
"rx_bytes": int(rx_increase) if rx_increase is not None else None and
"tx_bytes": int(tx_increase) if tx_increase is not None else None so that
json.dumps will emit JSON null for missing data; keep all other fields and
behavior the same.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between ec77ae3 and f77d12d.

📒 Files selected for processing (4)
  • ansible/roles/common/tasks/files/crio-df.py
  • ansible/roles/common/tasks/files/prom-network-query.py
  • ansible/roles/microshift-start/defaults/main.yml
  • ansible/roles/microshift-start/tasks/main.yml
💤 Files with no reviewable changes (1)
  • ansible/roles/microshift-start/defaults/main.yml

@sjug sjug force-pushed the ansible_improvements branch 2 times, most recently from 907298e to 6989217 Compare February 24, 2026 18:06
@sjug sjug force-pushed the ansible_improvements branch from 6989217 to 47823b5 Compare February 24, 2026 18:34
@ggiguash
Copy link
Contributor

/lgtm
/verified by @sjug

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 26, 2026
@openshift-ci-robot
Copy link

@ggiguash: This PR has been marked as verified by @sjug.

Details

In response to this:

/lgtm
/verified by @sjug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash, sjug

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2026
@ggiguash
Copy link
Contributor

/test verify-deps

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@sjug: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 380d437 into openshift:main Feb 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants