LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack by are-ces · Pull Request #1628 · lightspeed-core/lightspeed-stack

are-ces · 2026-04-29T07:42:36Z

Summary

In Prow, e2e-ops restart-lightspeed ensures llama-stack is running first, implicitly restoring a previously disrupted llama-stack. The disrupt-once tracking flag was only reset for explicit llama-stack restarts, so subsequent disruption steps in the same feature were skipped — causing tests to see 200 instead of 503.
Fix: reset the disruption tracking flag for lightspeed-stack restarts in Prow as well, since they also restore llama-stack.
test_list.txt temporarily scoped to the disruption feature only for targeted CI validation.

Type of change

Bug fix
End to end tests improvement

Tools used to create PR

Assisted-by: Claude Code (Claude Opus 4.6)

Related Tickets & Documents

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Run the llama_stack_disrupted.feature e2e tests in Prow.
Verify that Scenario 2 ("readiness state when llama stack is not available") now correctly returns 503 instead of 200.
Confirm that other features are unaffected — the flag reset is a no-op when disruption was never applied.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Increased supported model length for vLLM serving runtimes (CPU & GPU).
Tests
- Test connectivity now accepts additional readiness status codes.
- On test failure, automatically capture recent service logs for easier troubleshooting.

coderabbitai · 2026-04-29T07:42:48Z

Walkthrough

This PR raises vLLM ServingRuntime --max-model-len from 32768 to 35936, treats HTTP 503 as an acceptable readiness response in the e2e port-forward check, and adds pod log capture on failing Behave scenarios in Prow/OpenShift.

Changes

E2E Testing and vLLM Infrastructure

Layer / File(s)	Summary
vLLM Model Context Configuration `tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml`, `tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml`	Increase `--max-model-len` from `32768` to `35936` in both CPU and GPU ServingRuntime specs.
Port Forward Verification `tests/e2e-prow/rhoai/scripts/e2e-ops.sh`	Expand `verify_connectivity` readiness check to accept HTTP `503` (Service Unavailable) alongside `200` and `401` before validating `/v1/models`.
Failure Diagnostics Logging `tests/e2e/features/environment.py`	Add `_dump_pod_logs_on_failure` helper to capture the last 100 lines of logs for `llama-stack-service` and `lightspeed-stack-service`, and invoke it from `after_scenario` when running in Prow and a scenario fails.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main bugfix: resetting the disruption flag when Prow lightspeed restarts restore llama-stack, which is the core problem identified in the PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tisnik

LGTM

are-ces · 2026-04-30T07:38:14Z

/retest

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 236-237: The current failure check compares scenario.status to the
string "failed" which never matches the behave Status enum; import the Status
enum (e.g., from behave.model import Status) and change the condition to compare
against Status.failed (use scenario.status != Status.failed or scenario.status
== Status.failed depending on the intended early return) so the log-dump path
runs when Scenario.status is a failed Status; update the import and replace the
string literal comparison in the function/block that contains scenario.status.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b196e9c-77a4-49c2-882d-7e731f5b48ad

📥 Commits

Reviewing files that changed from the base of the PR and between d7e51cf and 4f4da8c.

📒 Files selected for processing (4)

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
tests/e2e-prow/rhoai/scripts/e2e-ops.sh
tests/e2e/features/environment.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

tests/e2e/features/environment.py

tests/e2e/**/*.{py,feature}

📄 CodeRabbit inference engine (AGENTS.md)

Use behave (BDD) framework for end-to-end testing with Gherkin feature files

Files:

tests/e2e/features/environment.py

🧠 Learnings (3)

📚 Learning: 2026-02-19T10:06:50.647Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml

📚 Learning: 2026-04-07T09:20:26.590Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.

Applied to files:

tests/e2e/features/environment.py

📚 Learning: 2026-04-13T13:39:54.963Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.

Applied to files:

tests/e2e/features/environment.py

🔇 Additional comments (4)

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)

27-27: Looks good — runtime limit bump is consistent.

This isolated --max-model-len update is clean and aligns with the corresponding GPU runtime change.

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)

27-27: LGTM — GPU manifest matches the CPU runtime adjustment.

No issues with this parameter update.

tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

195-198: Good fix for transient disruption windows.

Treating 503 as a valid liveness signal for /readiness is appropriate here because /v1/models is still used to validate full app readiness.

tests/e2e/features/environment.py (1)

283-286: Prow-only failure diagnostics hook placement looks good.

Calling pod-log dumping at the start of after_scenario is a good choice for preserving failure context.

coderabbitai · 2026-05-07T09:15:42Z

+    if scenario.status != "failed":
+        return


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In behave (>=1.2.6), is Scenario.statusabehave.model_core.Statusenum, and should failure checks useStatus.failedinstead of the string"failed"?

💡 Result:

Yes, in behave (>=1.2.6), Scenario.status is a behave.model_core.Status enum value [1][2][3]. Prior to 1.2.6, it was a string, but it changed to use the Status enum class [2]. Failure checks should use Status.failed (after importing from behave.model_core import Status) instead of the string "failed" for proper type-safe comparison [2]. Official docs confirm Status.failed represents assert-mismatch failures on scenarios [1][4], and source code uses Status.failed directly [5][6].

Citations:

1: https://behave.readthedocs.io/en/latest/appendix.status/

2: https://stackoverflow.com/questions/35284828/how-to-get-the-status-of-the-scenario-ran-in-behave

3: https://behave.readthedocs.io/en/latest/new_and_noteworthy_v1.2.6/

4: https://behave.readthedocs.io/en/stable/appendix.status/

5: https://github.com/behave/behave/blob/main/behave/formatter/progress.py

6: feature.status is Status.failed in before_scenario() hook while all my scenarios passed behave/behave#1068

🏁 Script executed:

# First, let's locate and read the environment.py file around lines 236-237 fd -type f -name "environment.py" | grep "e2e"

Repository: lightspeed-core/lightspeed-stack

Length of output: 247

🏁 Script executed:

# Check the content around lines 234-240 head -250 tests/e2e/features/environment.py | tail -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 862

🏁 Script executed:

# Search for behave version in dependencies fd -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) | xargs grep -l "behave" 2>/dev/null

Repository: lightspeed-core/lightspeed-stack

Length of output: 249

🏁 Script executed:

# Check if Status is already imported in the file grep -n "Status\|from behave" tests/e2e/features/environment.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 238

🏁 Script executed:

# Search for behave in various dependency files find . -maxdepth 3 \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name "Pipfile" \) -exec grep -l "behave" {} \;

Repository: lightspeed-core/lightspeed-stack

Length of output: 93

🏁 Script executed:

# Also check if there's any version constraint for behave rg "behave" -A 1 -B 1 --type py | head -30

Repository: lightspeed-core/lightspeed-stack

Length of output: 1565

🏁 Script executed:

# Check the behave version requirement in pyproject.toml cat pyproject.toml | grep -A 5 -B 5 "behave"

Repository: lightspeed-core/lightspeed-stack

Length of output: 522

Import and use Status enum for failure check in Scenario.status comparison.

The repository requires behave>=1.3.0, which uses Status enum values instead of string literals for Scenario.status. The string comparison scenario.status != "failed" will always evaluate to True (since Status.failed enum will never equal the string "failed"), causing the function to return early and never dump logs on failure.

Fix

-from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry +from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry +from behave.model_core import Status from behave.model import Feature, Scenario from behave.runner import Context def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None: """Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow.""" - if scenario.status != "failed": + if scenario.status != Status.failed: return

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/features/environment.py` around lines 236 - 237, The current failure check compares scenario.status to the string "failed" which never matches the behave Status enum; import the Status enum (e.g., from behave.model import Status) and change the condition to compare against Status.failed (use scenario.status != Status.failed or scenario.status == Status.failed depending on the intended early return) so the log-dump path runs when Scenario.status is a failed Status; update the import and replace the string literal comparison in the function/block that contains scenario.status.

…lama-stack In Prow, e2e-ops restart-lightspeed ensures llama-stack is running first. This implicitly restores a previously disrupted llama-stack, but the disrupt-once tracking flag was only reset for explicit llama-stack restarts. Subsequent disruption steps were skipped, causing tests to see 200 instead of 503. Reset the tracking flag for lightspeed-stack restarts in Prow as well. test_list.txt temporarily scoped to disruption feature only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…step The _llama_stack_disrupt_once flag assumed llama-stack stays disrupted across scenarios, but restart-lightspeed in Prow restores llama-stack as a prerequisite, making the flag stale. Each disruption step now checks actual container state, which is idempotent and correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…zation - Dump last 200 lines of llama-stack and lightspeed-stack pod logs plus pod status when Prow e2e tests fail, to aid CI debugging. - Remove fragile _llama_stack_disrupt_once flag from disruption step; each scenario now checks actual container state (idempotent). - Revert restart_container Prow guard to llama-stack only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A10G GPU has 23GB; after loading Llama-3.1-8B weights (~15.7GB), only 4.39GB remains for KV cache — max supported context is 35936. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A 503 from /readiness means the app responded (port-forward works) but Llama Stack handshake hasn't completed yet. Previously this was treated as a dead port-forward, exhausting all retries without ever checking /v1/models. Now the retry loop proceeds to the /v1/models gate, giving the app time to fully initialize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

397-399: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale diagnostic message after accepting 503.

The readiness status is logged for CI debugging, but "(expect 200 or 401)" no longer reflects the full set of acceptable codes — 503 is now valid too. When a disruption test scenario establishes its port-forward, CI logs will show a misleading annotation.

📝 Proposed fix

-echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or 401)"
+echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200, 401, or 503)"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh` around lines 397 - 399, The log
message after checking readiness_code is outdated: update the echo that prints
"[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness
-> HTTP $readiness_code (expect 200 or 401)" so it reflects the new acceptable
status codes (include 503). Locate the readiness check that sets readiness_code
and the subsequent echo (look for the readiness_code variable and the echo that
mentions "(expect 200 or 401)") and change the annotation to list the accepted
codes, e.g. "(expect 200, 401 or 503)". Ensure the rest of the logging (PID and
pf_resource echo) remains unchanged.

♻️ Duplicate comments (1)

tests/e2e/features/environment.py (1)
242-243: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

scenario.status is a Status enum — string comparison always returns early, logs are never dumped.

behave >= 1.2.6 (this repo requires >= 1.3.0) changed Scenario.status from a plain string to a behave.model_core.Status enum. The expression scenario.status != "failed" compares an enum instance to a string, which is always True, so _dump_pod_logs_on_failure immediately returns on every invocation and no logs are ever captured.
🐛 Proposed fix
+from behave.model_core import Status
 from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
 from behave.model import Feature, Scenario

 def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None:
     """Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow."""
-    if scenario.status != "failed":
+    if scenario.status != Status.failed:
         return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/features/environment.py` around lines 242 - 243, The check in
_dump_pod_logs_on_failure uses string comparison against scenario.status (an
enum), so it always returns early; change the guard to compare against the
Status enum by importing behave.model_core.Status and replacing the condition
with "if scenario.status != Status.failed: return" (or alternatively use
str(scenario.status) == "failed"); update the imports to include Status and
ensure the function references scenario.status and _dump_pod_logs_on_failure so
the new enum comparison is applied.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh`:
- Around line 397-399: The log message after checking readiness_code is
outdated: update the echo that prints "[e2e-ops] LCS through port-forward: GET
http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or
401)" so it reflects the new acceptable status codes (include 503). Locate the
readiness check that sets readiness_code and the subsequent echo (look for the
readiness_code variable and the echo that mentions "(expect 200 or 401)") and
change the annotation to list the accepted codes, e.g. "(expect 200, 401 or
503)". Ensure the rest of the logging (PID and pf_resource echo) remains
unchanged.

---

Duplicate comments:
In `@tests/e2e/features/environment.py`:
- Around line 242-243: The check in _dump_pod_logs_on_failure uses string
comparison against scenario.status (an enum), so it always returns early; change
the guard to compare against the Status enum by importing
behave.model_core.Status and replacing the condition with "if scenario.status !=
Status.failed: return" (or alternatively use str(scenario.status) == "failed");
update the imports to include Status and ensure the function references
scenario.status and _dump_pod_logs_on_failure so the new enum comparison is
applied.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f75ccae-c676-4635-a1bb-621c7ae80b06

📥 Commits

Reviewing files that changed from the base of the PR and between 4f4da8c and 5f6bbe8.

📒 Files selected for processing (4)

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
tests/e2e-prow/rhoai/scripts/e2e-ops.sh
tests/e2e/features/environment.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: spectral
GitHub Check: Pyright
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

tests/e2e/features/environment.py

tests/e2e/**/*.{py,feature}

📄 CodeRabbit inference engine (AGENTS.md)

Use behave (BDD) framework for end-to-end testing with Gherkin feature files

Files:

tests/e2e/features/environment.py

🧠 Learnings (3)

📚 Learning: 2026-02-19T10:06:50.647Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml

📚 Learning: 2026-04-07T09:20:26.590Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.

Applied to files:

tests/e2e/features/environment.py

📚 Learning: 2026-04-13T13:39:54.963Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.

Applied to files:

tests/e2e/features/environment.py

🔇 Additional comments (3)

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)

27-27: LGTM — --max-model-len bump aligns with CPU manifest.

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)

27-27: LGTM — consistent with GPU manifest update.

tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

195-208: 503 acceptance in verify_connectivity looks correct.

Treating 503 as "tunnel alive" prevents unnecessary restart-lightspeed calls when llama is intentionally disrupted — the core behavioural fix for this PR.

radofuchs

LGTM

…t restores llama-stack (lightspeed-core#1628) * Add diagnostic pod logs on e2e failure and remove disrupt-once optimization * Increase vLLM max-model-len to 35936 (GPU memory limit) * Accept 503 as valid port-forward proof in e2e connectivity check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tisnik approved these changes Apr 29, 2026

View reviewed changes

are-ces marked this pull request as draft April 30, 2026 06:41

are-ces force-pushed the LCORE-1496-rhoai-version-bump branch 5 times, most recently from 35564e8 to 2fe6b0c Compare May 5, 2026 12:40

are-ces changed the title ~~Fix disruption flag not reset when Prow lightspeed restart restores llama-stack~~ LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack May 5, 2026

are-ces marked this pull request as ready for review May 7, 2026 09:12

are-ces requested a review from radofuchs May 7, 2026 09:12

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

are-ces and others added 6 commits May 7, 2026 14:10

Increase vLLM max-model-len to 35936 (GPU memory limit)

2b045a3

A10G GPU has 23GB; after loading Llama-3.1-8B weights (~15.7GB), only 4.39GB remains for KV cache — max supported context is 35936. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore disrupt-once optimization for llama-stack disruption step

5f6bbe8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

are-ces force-pushed the LCORE-1496-rhoai-version-bump branch from 4f4da8c to 5f6bbe8 Compare May 7, 2026 12:11

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

radofuchs approved these changes May 7, 2026

View reviewed changes

are-ces merged commit d15d37d into lightspeed-core:main May 8, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack#1628

LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack#1628
are-ces merged 6 commits intolightspeed-core:mainfrom
are-ces:LCORE-1496-rhoai-version-bump

are-ces commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

tisnik left a comment

Uh oh!

are-ces commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

radofuchs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

are-ces commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

are-ces commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

radofuchs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

are-ces commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading