NO-JIRA: add Loki readiness check and restart logic to observability tests#6589
Conversation
|
@agullon: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Loki readiness gate: a shell Changes
Sequence DiagramsequenceDiagram
participant Suite as Test Suite
participant Keyword as Robot Keyword
participant Check as Python: check_loki_ready
participant HTTP as Loki HTTP /ready
participant Manage as ./bin/manage_loki.sh
Suite->>Keyword: Ensure Loki Is Ready
loop Retry (up to 5 attempts)
Keyword->>Check: Check Loki Ready (host, port)
Check->>HTTP: GET /ready
alt HTTP 200
HTTP-->>Check: 200 OK
Check-->>Keyword: Ready
Keyword-->>Suite: Ready (exit)
else Error / non-200
HTTP-->>Check: non-200 / error
Check-->>Keyword: Not ready
Keyword->>Manage: ./bin/manage_loki.sh restart <port>
Manage-->>Keyword: stopped & started
Keyword->>Keyword: Sleep 2s
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/bin/manage_loki.sh`:
- Around line 57-61: The restart flow in action_restart should handle a missing
"loki" container gracefully: before calling podman restart, check whether the
container exists (e.g., with podman container exists "${container_name}" or an
equivalent podman ps -a filter) and if it does not exist, log a message and
return success; if it does exist, run podman restart "${container_name}" and
capture its exit status, logging a non-fatal warning on failure instead of
letting a non-zero exit break the suite. Ensure the changes are made inside the
action_restart function and reference the local variable container_name and the
podman restart invocation.
In `@test/suites/optional/observability.robot`:
- Around line 96-104: The retry loop using FOR IN RANGE 5 restarts Loki on every
failed check but on the final iteration it restarts and immediately fails
without verifying post-restart readiness; update the loop around the Check Loki
Ready keyword to perform a final Check Loki Ready after any restart (including
the last attempt) and replace the fixed Sleep 2s with a short backoff strategy
(e.g., incremental sleep based on ${attempt} or a slightly larger fixed delay)
before the post-restart check so that Local Command Should Work
./bin/manage_loki.sh restart is followed by a verified readiness via Check Loki
Ready ${LOKI_HOST} ${LOKI_PORT} instead of failing immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 805e3ac9-3cba-41da-a8ac-efd08738d3c3
📒 Files selected for processing (3)
test/bin/manage_loki.shtest/resources/loki.pytest/suites/optional/observability.robot
8d7118a to
827448f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/suites/optional/observability.robot`:
- Line 75: The readiness test "Ensure Loki Is Ready" can start probes before
`${LOKI_PORT}` is set; update the "Check Required Observability Variables"
keyword to validate `${LOKI_PORT}` (in addition to `${LOKI_HOST}`) and fail fast
when it's missing or empty so the readiness probe doesn't run on a
misconfiguration; locate the "Check Required Observability Variables" keyword
and add a presence/assertion for `${LOKI_PORT}` (or call Robot's Fail with a
clear message) before any readiness checks are invoked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 00e08fca-3e37-469e-898f-f78f58f4b345
📒 Files selected for processing (1)
test/suites/optional/observability.robot
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/suites/optional/observability.robot (1)
96-100: Consider 1-based attempt numbers in logs.Current logs print attempts as
0..4. Using1..5makes failure output easier to read.Suggested tweak
- FOR ${attempt} IN RANGE 5 + FOR ${attempt} IN RANGE 1 6 @@ - Log Loki is not ready (attempt ${attempt}), restarting container console=True + Log Loki is not ready (attempt ${attempt}/5), restarting container console=True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/suites/optional/observability.robot` around lines 96 - 100, The loop currently logs attempts as 0..4; change it to 1..5 by updating the FOR header to "FOR ${attempt} IN RANGE 1 6" (so ${attempt} is 1-based) or alternatively compute a 1-based display value in the Log call via "Log Loki is not ready (attempt ${${attempt + 1}}), restarting container console=True"; prefer the FOR change and update the FOR line that wraps the Check Loki Ready / Log calls to use IN RANGE 1 6 so logs show 1..5.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/resources/loki.py`:
- Around line 71-78: The readiness URL construction in check_loki_ready(host:
str, port: int) fails for IPv6 literals; modify the function to detect IPv6
addresses (e.g., using the ipaddress module or by checking for ':' in host and
not already bracketed) and wrap the host in square brackets when building the
URL (i.e., use "[host]" for IPv6) so the f"http://{host}:{port}/ready" becomes
f"http://{bracketed_host}:{port}/ready"; keep logging and error behavior
unchanged.
---
Nitpick comments:
In `@test/suites/optional/observability.robot`:
- Around line 96-100: The loop currently logs attempts as 0..4; change it to
1..5 by updating the FOR header to "FOR ${attempt} IN RANGE 1 6" (so
${attempt} is 1-based) or alternatively compute a 1-based display value in the
Log call via "Log Loki is not ready (attempt ${${attempt + 1}}), restarting
container console=True"; prefer the FOR change and update the FOR line that
wraps the Check Loki Ready / Log calls to use IN RANGE 1 6 so logs show 1..5.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bf3f88b7-38ab-4a37-a7b0-45730f8ce89a
📒 Files selected for processing (3)
test/bin/manage_loki.shtest/resources/loki.pytest/suites/optional/observability.robot
🚧 Files skipped from review as they are similar to previous changes (1)
- test/bin/manage_loki.sh
827448f to
12d2766
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/suites/optional/observability.robot (1)
98-103: Use 1-based attempt numbers in logs for readability.Current logs print attempt
0..4, which is a bit confusing during triage.Proposed tweak
FOR ${attempt} IN RANGE 5 + ${attempt_no} Evaluate ${attempt} + 1 ${status} Run Keyword And Return Status ... Check Loki Ready ${LOKI_HOST} ${LOKI_PORT} IF ${status} RETURN - Log Loki is not ready (attempt ${attempt}), restarting container console=True + Log Loki is not ready (attempt ${attempt_no}/5), restarting container console=True Local Command Should Work ./bin/manage_loki.sh restart Sleep 2s END🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/suites/optional/observability.robot` around lines 98 - 103, The loop logs zero-based ${attempt} values; change the log to use a 1-based display by computing a display variable (e.g. ${attempt_display} = Evaluate ${attempt} + 1) inside the FOR and then use ${attempt_display} in the Log message and any user-facing output; update the block around the FOR / Check Loki Ready / Log / Local Command Should Work sequence to create and use ${attempt_display} instead of raw ${attempt} so logs read 1..5 instead of 0..4.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/suites/optional/observability.robot`:
- Around line 98-103: The loop logs zero-based ${attempt} values; change the log
to use a 1-based display by computing a display variable (e.g.
${attempt_display} = Evaluate ${attempt} + 1) inside the FOR and then use
${attempt_display} in the Log message and any user-facing output; update the
block around the FOR / Check Loki Ready / Log / Local Command Should Work
sequence to create and use ${attempt_display} instead of raw ${attempt} so logs
read 1..5 instead of 0..4.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 43fec769-bf11-4b5d-b8bd-a88745404856
📒 Files selected for processing (3)
test/bin/manage_loki.shtest/resources/loki.pytest/suites/optional/observability.robot
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/suites/optional/observability.robot (1)
98-103: Use 1-based attempt numbers in the log message.
IN RANGE 5logs attempts as0..4while the failure text says “5 attempts”. Switching toIN RANGE 1 6makes diagnostics clearer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/suites/optional/observability.robot` around lines 98 - 103, The loop using "FOR ${attempt} IN RANGE 5" produces 0-based attempt numbers in the Log; change the FOR loop range to "IN RANGE 1 6" so ${attempt} is 1-based (1..5) and keep the existing Log line ("Loki is not ready (attempt ${attempt}), restarting container") and the existing keywords (Check Loki Ready, Local Command Should Work) unchanged so diagnostics match the "5 attempts" message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/resources/loki.py`:
- Around line 96-110: There are two issues: an unclosed try block around the
requests.get call and an IPv6 regression using raw host in the query_loki call.
Close the first try with an appropriate except/except Exception as e:/finally
block that handles request errors (the try that wraps requests.get), then move
the Robot Framework hostname-detection try/except out of that first try (so it
is a separate try block) to fix the SyntaxError; finally, update the query_loki
invocation to use the IPv6-normalized address variable
(query_loki(f"http://{address}:{port}", ...)) instead of raw host so IPv6
literal URLs remain correct.
---
Nitpick comments:
In `@test/suites/optional/observability.robot`:
- Around line 98-103: The loop using "FOR ${attempt} IN RANGE 5" produces
0-based attempt numbers in the Log; change the FOR loop range to "IN RANGE 1
6" so ${attempt} is 1-based (1..5) and keep the existing Log line ("Loki is not
ready (attempt ${attempt}), restarting container") and the existing keywords
(Check Loki Ready, Local Command Should Work) unchanged so diagnostics match the
"5 attempts" message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e8547614-5e3e-480c-b003-74745a427192
📒 Files selected for processing (3)
test/bin/manage_loki.shtest/resources/loki.pytest/suites/optional/observability.robot
255fa1a to
e12059e
Compare
…tests Add a Loki readiness check (/ready endpoint) to the observability test suite setup to detect when Loki's ingester has entered a shutdown state. Automatically restart the Loki container if unhealthy, polling readiness every 5s for up to 30s per attempt (3 attempts max) to cover the ingester's ~15s warmup window. - Add check_loki_ready() to loki.py with proper error handling - Add restart action to manage_loki.sh script - Fix IPv6 address handling in check_loki_query - Split broad except Exception in check_loki_query into specific handlers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> pre-commit.check-secrets: ENABLED
e12059e to
8ab7329
Compare
|
/retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: agullon, ggiguash The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by CI |
|
@agullon: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherrypick release-4.22 |
|
@agullon: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@agullon: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@agullon: new pull request created: #6598 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
/readyendpoint) to the observability test suite setup to detect when Loki's ingester has entered a shutdown staterestartaction tomanage_loki.shscriptcheck_loki_queryto matchcheck_loki_ready(wrap with brackets)except Exceptionincheck_loki_queryintoImportError(expected outside Robot Framework) vs unexpected errors with proper loggingLOKI_PORTvalidation to required observability variablesFollow-up to #6578 which addressed the retry logic but didn't handle the root cause of Loki's ingester becoming unresponsive over time (HTTP 503 on writes).
Test plan
el98-lrel@optionalobservability Loki tests pass consistently🤖 Generated with Claude Code