LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack#1628
Conversation
WalkthroughThis PR raises vLLM ServingRuntime ChangesE2E Testing and vLLM Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
35564e8 to
2fe6b0c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 236-237: The current failure check compares scenario.status to the
string "failed" which never matches the behave Status enum; import the Status
enum (e.g., from behave.model import Status) and change the condition to compare
against Status.failed (use scenario.status != Status.failed or scenario.status
== Status.failed depending on the intended early return) so the log-dump path
runs when Scenario.status is a failed Status; update the import and replace the
string literal comparison in the function/block that contains scenario.status.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0b196e9c-77a4-49c2-882d-7e731f5b48ad
📒 Files selected for processing (4)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/scripts/e2e-ops.shtests/e2e/features/environment.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/e2e/features/environment.py
tests/e2e/**/*.{py,feature}
📄 CodeRabbit inference engine (AGENTS.md)
Use behave (BDD) framework for end-to-end testing with Gherkin feature files
Files:
tests/e2e/features/environment.py
🧠 Learnings (3)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.
Applied to files:
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.
Applied to files:
tests/e2e/features/environment.py
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.
Applied to files:
tests/e2e/features/environment.py
🔇 Additional comments (4)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
27-27: Looks good — runtime limit bump is consistent.This isolated
--max-model-lenupdate is clean and aligns with the corresponding GPU runtime change.tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)
27-27: LGTM — GPU manifest matches the CPU runtime adjustment.No issues with this parameter update.
tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)
195-198: Good fix for transient disruption windows.Treating
503as a valid liveness signal for/readinessis appropriate here because/v1/modelsis still used to validate full app readiness.tests/e2e/features/environment.py (1)
283-286: Prow-only failure diagnostics hook placement looks good.Calling pod-log dumping at the start of
after_scenariois a good choice for preserving failure context.
| if scenario.status != "failed": | ||
| return |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In behave (>=1.2.6), is Scenario.statusabehave.model_core.Statusenum, and should failure checks useStatus.failedinstead of the string"failed"?
💡 Result:
Yes, in behave (>=1.2.6), Scenario.status is a behave.model_core.Status enum value [1][2][3]. Prior to 1.2.6, it was a string, but it changed to use the Status enum class [2]. Failure checks should use Status.failed (after importing from behave.model_core import Status) instead of the string "failed" for proper type-safe comparison [2]. Official docs confirm Status.failed represents assert-mismatch failures on scenarios [1][4], and source code uses Status.failed directly [5][6].
Citations:
- 1: https://behave.readthedocs.io/en/latest/appendix.status/
- 2: https://stackoverflow.com/questions/35284828/how-to-get-the-status-of-the-scenario-ran-in-behave
- 3: https://behave.readthedocs.io/en/latest/new_and_noteworthy_v1.2.6/
- 4: https://behave.readthedocs.io/en/stable/appendix.status/
- 5: https://github.com/behave/behave/blob/main/behave/formatter/progress.py
- 6: feature.status is Status.failed in before_scenario() hook while all my scenarios passed behave/behave#1068
🏁 Script executed:
# First, let's locate and read the environment.py file around lines 236-237
fd -type f -name "environment.py" | grep "e2e"Repository: lightspeed-core/lightspeed-stack
Length of output: 247
🏁 Script executed:
# Check the content around lines 234-240
head -250 tests/e2e/features/environment.py | tail -20Repository: lightspeed-core/lightspeed-stack
Length of output: 862
🏁 Script executed:
# Search for behave version in dependencies
fd -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) | xargs grep -l "behave" 2>/dev/nullRepository: lightspeed-core/lightspeed-stack
Length of output: 249
🏁 Script executed:
# Check if Status is already imported in the file
grep -n "Status\|from behave" tests/e2e/features/environment.pyRepository: lightspeed-core/lightspeed-stack
Length of output: 238
🏁 Script executed:
# Search for behave in various dependency files
find . -maxdepth 3 \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name "Pipfile" \) -exec grep -l "behave" {} \;Repository: lightspeed-core/lightspeed-stack
Length of output: 93
🏁 Script executed:
# Also check if there's any version constraint for behave
rg "behave" -A 1 -B 1 --type py | head -30Repository: lightspeed-core/lightspeed-stack
Length of output: 1565
🏁 Script executed:
# Check the behave version requirement in pyproject.toml
cat pyproject.toml | grep -A 5 -B 5 "behave"Repository: lightspeed-core/lightspeed-stack
Length of output: 522
Import and use Status enum for failure check in Scenario.status comparison.
The repository requires behave>=1.3.0, which uses Status enum values instead of string literals for Scenario.status. The string comparison scenario.status != "failed" will always evaluate to True (since Status.failed enum will never equal the string "failed"), causing the function to return early and never dump logs on failure.
Fix
-from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
+from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
+from behave.model_core import Status
from behave.model import Feature, Scenario
from behave.runner import Context
def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None:
"""Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow."""
- if scenario.status != "failed":
+ if scenario.status != Status.failed:
return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/features/environment.py` around lines 236 - 237, The current
failure check compares scenario.status to the string "failed" which never
matches the behave Status enum; import the Status enum (e.g., from behave.model
import Status) and change the condition to compare against Status.failed (use
scenario.status != Status.failed or scenario.status == Status.failed depending
on the intended early return) so the log-dump path runs when Scenario.status is
a failed Status; update the import and replace the string literal comparison in
the function/block that contains scenario.status.
…lama-stack In Prow, e2e-ops restart-lightspeed ensures llama-stack is running first. This implicitly restores a previously disrupted llama-stack, but the disrupt-once tracking flag was only reset for explicit llama-stack restarts. Subsequent disruption steps were skipped, causing tests to see 200 instead of 503. Reset the tracking flag for lightspeed-stack restarts in Prow as well. test_list.txt temporarily scoped to disruption feature only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…step The _llama_stack_disrupt_once flag assumed llama-stack stays disrupted across scenarios, but restart-lightspeed in Prow restores llama-stack as a prerequisite, making the flag stale. Each disruption step now checks actual container state, which is idempotent and correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…zation - Dump last 200 lines of llama-stack and lightspeed-stack pod logs plus pod status when Prow e2e tests fail, to aid CI debugging. - Remove fragile _llama_stack_disrupt_once flag from disruption step; each scenario now checks actual container state (idempotent). - Revert restart_container Prow guard to llama-stack only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A10G GPU has 23GB; after loading Llama-3.1-8B weights (~15.7GB), only 4.39GB remains for KV cache — max supported context is 35936. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A 503 from /readiness means the app responded (port-forward works) but Llama Stack handshake hasn't completed yet. Previously this was treated as a dead port-forward, exhausting all retries without ever checking /v1/models. Now the retry loop proceeds to the /v1/models gate, giving the app time to fully initialize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4f4da8c to
5f6bbe8
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)
397-399:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale diagnostic message after accepting 503.
The readiness status is logged for CI debugging, but
"(expect 200 or 401)"no longer reflects the full set of acceptable codes — 503 is now valid too. When a disruption test scenario establishes its port-forward, CI logs will show a misleading annotation.📝 Proposed fix
-echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or 401)" +echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200, 401, or 503)"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh` around lines 397 - 399, The log message after checking readiness_code is outdated: update the echo that prints "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or 401)" so it reflects the new acceptable status codes (include 503). Locate the readiness check that sets readiness_code and the subsequent echo (look for the readiness_code variable and the echo that mentions "(expect 200 or 401)") and change the annotation to list the accepted codes, e.g. "(expect 200, 401 or 503)". Ensure the rest of the logging (PID and pf_resource echo) remains unchanged.
♻️ Duplicate comments (1)
tests/e2e/features/environment.py (1)
242-243:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
scenario.statusis aStatusenum — string comparison always returns early, logs are never dumped.
behave >= 1.2.6(this repo requires>= 1.3.0) changedScenario.statusfrom a plain string to abehave.model_core.Statusenum. The expressionscenario.status != "failed"compares an enum instance to a string, which is alwaysTrue, so_dump_pod_logs_on_failureimmediately returns on every invocation and no logs are ever captured.🐛 Proposed fix
+from behave.model_core import Status from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry from behave.model import Feature, Scenario def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None: """Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow.""" - if scenario.status != "failed": + if scenario.status != Status.failed: return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/features/environment.py` around lines 242 - 243, The check in _dump_pod_logs_on_failure uses string comparison against scenario.status (an enum), so it always returns early; change the guard to compare against the Status enum by importing behave.model_core.Status and replacing the condition with "if scenario.status != Status.failed: return" (or alternatively use str(scenario.status) == "failed"); update the imports to include Status and ensure the function references scenario.status and _dump_pod_logs_on_failure so the new enum comparison is applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh`:
- Around line 397-399: The log message after checking readiness_code is
outdated: update the echo that prints "[e2e-ops] LCS through port-forward: GET
http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or
401)" so it reflects the new acceptable status codes (include 503). Locate the
readiness check that sets readiness_code and the subsequent echo (look for the
readiness_code variable and the echo that mentions "(expect 200 or 401)") and
change the annotation to list the accepted codes, e.g. "(expect 200, 401 or
503)". Ensure the rest of the logging (PID and pf_resource echo) remains
unchanged.
---
Duplicate comments:
In `@tests/e2e/features/environment.py`:
- Around line 242-243: The check in _dump_pod_logs_on_failure uses string
comparison against scenario.status (an enum), so it always returns early; change
the guard to compare against the Status enum by importing
behave.model_core.Status and replacing the condition with "if scenario.status !=
Status.failed: return" (or alternatively use str(scenario.status) == "failed");
update the imports to include Status and ensure the function references
scenario.status and _dump_pod_logs_on_failure so the new enum comparison is
applied.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2f75ccae-c676-4635-a1bb-621c7ae80b06
📒 Files selected for processing (4)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/scripts/e2e-ops.shtests/e2e/features/environment.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: spectral
- GitHub Check: Pyright
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/e2e/features/environment.py
tests/e2e/**/*.{py,feature}
📄 CodeRabbit inference engine (AGENTS.md)
Use behave (BDD) framework for end-to-end testing with Gherkin feature files
Files:
tests/e2e/features/environment.py
🧠 Learnings (3)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.
Applied to files:
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.
Applied to files:
tests/e2e/features/environment.py
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.
Applied to files:
tests/e2e/features/environment.py
🔇 Additional comments (3)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)
27-27: LGTM —--max-model-lenbump aligns with CPU manifest.tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
27-27: LGTM — consistent with GPU manifest update.tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)
195-208: 503 acceptance inverify_connectivitylooks correct.Treating 503 as "tunnel alive" prevents unnecessary
restart-lightspeedcalls when llama is intentionally disrupted — the core behavioural fix for this PR.
…t restores llama-stack (lightspeed-core#1628) * Add diagnostic pod logs on e2e failure and remove disrupt-once optimization * Increase vLLM max-model-len to 35936 (GPU memory limit) * Accept 503 as valid port-forward proof in e2e connectivity check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t restores llama-stack (lightspeed-core#1628) * Add diagnostic pod logs on e2e failure and remove disrupt-once optimization * Increase vLLM max-model-len to 35936 (GPU memory limit) * Accept 503 as valid port-forward proof in e2e connectivity check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
e2e-ops restart-lightspeedensures llama-stack is running first, implicitly restoring a previously disrupted llama-stack. Thedisrupt-oncetracking flag was only reset for explicitllama-stackrestarts, so subsequent disruption steps in the same feature were skipped — causing tests to see 200 instead of 503.lightspeed-stackrestarts in Prow as well, since they also restore llama-stack.test_list.txttemporarily scoped to the disruption feature only for targeted CI validation.Type of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
llama_stack_disrupted.featuree2e tests in Prow.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests