Skip to content

LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack#1628

Merged
are-ces merged 6 commits intolightspeed-core:mainfrom
are-ces:LCORE-1496-rhoai-version-bump
May 8, 2026
Merged

LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack#1628
are-ces merged 6 commits intolightspeed-core:mainfrom
are-ces:LCORE-1496-rhoai-version-bump

Conversation

@are-ces
Copy link
Copy Markdown
Contributor

@are-ces are-ces commented Apr 29, 2026

Summary

  • In Prow, e2e-ops restart-lightspeed ensures llama-stack is running first, implicitly restoring a previously disrupted llama-stack. The disrupt-once tracking flag was only reset for explicit llama-stack restarts, so subsequent disruption steps in the same feature were skipped — causing tests to see 200 instead of 503.
  • Fix: reset the disruption tracking flag for lightspeed-stack restarts in Prow as well, since they also restore llama-stack.
  • test_list.txt temporarily scoped to the disruption feature only for targeted CI validation.

Type of change

  • Bug fix
  • End to end tests improvement

Tools used to create PR

  • Assisted-by: Claude Code (Claude Opus 4.6)

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  1. Run the llama_stack_disrupted.feature e2e tests in Prow.
  2. Verify that Scenario 2 ("readiness state when llama stack is not available") now correctly returns 503 instead of 200.
  3. Confirm that other features are unaffected — the flag reset is a no-op when disruption was never applied.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Increased supported model length for vLLM serving runtimes (CPU & GPU).
  • Tests

    • Test connectivity now accepts additional readiness status codes.
    • On test failure, automatically capture recent service logs for easier troubleshooting.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Review Change Stack

Walkthrough

This PR raises vLLM ServingRuntime --max-model-len from 32768 to 35936, treats HTTP 503 as an acceptable readiness response in the e2e port-forward check, and adds pod log capture on failing Behave scenarios in Prow/OpenShift.

Changes

E2E Testing and vLLM Infrastructure

Layer / File(s) Summary
vLLM Model Context Configuration
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml, tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
Increase --max-model-len from 32768 to 35936 in both CPU and GPU ServingRuntime specs.
Port Forward Verification
tests/e2e-prow/rhoai/scripts/e2e-ops.sh
Expand verify_connectivity readiness check to accept HTTP 503 (Service Unavailable) alongside 200 and 401 before validating /v1/models.
Failure Diagnostics Logging
tests/e2e/features/environment.py
Add _dump_pod_logs_on_failure helper to capture the last 100 lines of logs for llama-stack-service and lightspeed-stack-service, and invoke it from after_scenario when running in Prow and a scenario fails.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main bugfix: resetting the disruption flag when Prow lightspeed restarts restore llama-stack, which is the core problem identified in the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@are-ces are-ces marked this pull request as draft April 30, 2026 06:41
@are-ces
Copy link
Copy Markdown
Contributor Author

are-ces commented Apr 30, 2026

/retest

@are-ces are-ces force-pushed the LCORE-1496-rhoai-version-bump branch 5 times, most recently from 35564e8 to 2fe6b0c Compare May 5, 2026 12:40
@are-ces are-ces changed the title Fix disruption flag not reset when Prow lightspeed restart restores llama-stack LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack May 5, 2026
@are-ces are-ces marked this pull request as ready for review May 7, 2026 09:12
@are-ces are-ces requested a review from radofuchs May 7, 2026 09:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 236-237: The current failure check compares scenario.status to the
string "failed" which never matches the behave Status enum; import the Status
enum (e.g., from behave.model import Status) and change the condition to compare
against Status.failed (use scenario.status != Status.failed or scenario.status
== Status.failed depending on the intended early return) so the log-dump path
runs when Scenario.status is a failed Status; update the import and replace the
string literal comparison in the function/block that contains scenario.status.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b196e9c-77a4-49c2-882d-7e731f5b48ad

📥 Commits

Reviewing files that changed from the base of the PR and between d7e51cf and 4f4da8c.

📒 Files selected for processing (4)
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
  • tests/e2e-prow/rhoai/scripts/e2e-ops.sh
  • tests/e2e/features/environment.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/e2e/features/environment.py
tests/e2e/**/*.{py,feature}

📄 CodeRabbit inference engine (AGENTS.md)

Use behave (BDD) framework for end-to-end testing with Gherkin feature files

Files:

  • tests/e2e/features/environment.py
🧠 Learnings (3)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.

Applied to files:

  • tests/e2e/features/environment.py
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.

Applied to files:

  • tests/e2e/features/environment.py
🔇 Additional comments (4)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)

27-27: Looks good — runtime limit bump is consistent.

This isolated --max-model-len update is clean and aligns with the corresponding GPU runtime change.

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)

27-27: LGTM — GPU manifest matches the CPU runtime adjustment.

No issues with this parameter update.

tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

195-198: Good fix for transient disruption windows.

Treating 503 as a valid liveness signal for /readiness is appropriate here because /v1/models is still used to validate full app readiness.

tests/e2e/features/environment.py (1)

283-286: Prow-only failure diagnostics hook placement looks good.

Calling pod-log dumping at the start of after_scenario is a good choice for preserving failure context.

Comment on lines +236 to +237
if scenario.status != "failed":
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In behave (>=1.2.6), is Scenario.statusabehave.model_core.Statusenum, and should failure checks useStatus.failedinstead of the string"failed"?

💡 Result:

Yes, in behave (>=1.2.6), Scenario.status is a behave.model_core.Status enum value [1][2][3]. Prior to 1.2.6, it was a string, but it changed to use the Status enum class [2]. Failure checks should use Status.failed (after importing from behave.model_core import Status) instead of the string "failed" for proper type-safe comparison [2]. Official docs confirm Status.failed represents assert-mismatch failures on scenarios [1][4], and source code uses Status.failed directly [5][6].

Citations:


🏁 Script executed:

# First, let's locate and read the environment.py file around lines 236-237
fd -type f -name "environment.py" | grep "e2e"

Repository: lightspeed-core/lightspeed-stack

Length of output: 247


🏁 Script executed:

# Check the content around lines 234-240
head -250 tests/e2e/features/environment.py | tail -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 862


🏁 Script executed:

# Search for behave version in dependencies
fd -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) | xargs grep -l "behave" 2>/dev/null

Repository: lightspeed-core/lightspeed-stack

Length of output: 249


🏁 Script executed:

# Check if Status is already imported in the file
grep -n "Status\|from behave" tests/e2e/features/environment.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 238


🏁 Script executed:

# Search for behave in various dependency files
find . -maxdepth 3 \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name "Pipfile" \) -exec grep -l "behave" {} \;

Repository: lightspeed-core/lightspeed-stack

Length of output: 93


🏁 Script executed:

# Also check if there's any version constraint for behave
rg "behave" -A 1 -B 1 --type py | head -30

Repository: lightspeed-core/lightspeed-stack

Length of output: 1565


🏁 Script executed:

# Check the behave version requirement in pyproject.toml
cat pyproject.toml | grep -A 5 -B 5 "behave"

Repository: lightspeed-core/lightspeed-stack

Length of output: 522


Import and use Status enum for failure check in Scenario.status comparison.

The repository requires behave>=1.3.0, which uses Status enum values instead of string literals for Scenario.status. The string comparison scenario.status != "failed" will always evaluate to True (since Status.failed enum will never equal the string "failed"), causing the function to return early and never dump logs on failure.

Fix
-from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
+from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
+from behave.model_core import Status
 from behave.model import Feature, Scenario
 from behave.runner import Context
 
 def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None:
     """Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow."""
-    if scenario.status != "failed":
+    if scenario.status != Status.failed:
         return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/features/environment.py` around lines 236 - 237, The current
failure check compares scenario.status to the string "failed" which never
matches the behave Status enum; import the Status enum (e.g., from behave.model
import Status) and change the condition to compare against Status.failed (use
scenario.status != Status.failed or scenario.status == Status.failed depending
on the intended early return) so the log-dump path runs when Scenario.status is
a failed Status; update the import and replace the string literal comparison in
the function/block that contains scenario.status.

are-ces and others added 6 commits May 7, 2026 14:10
…lama-stack

In Prow, e2e-ops restart-lightspeed ensures llama-stack is running first.
This implicitly restores a previously disrupted llama-stack, but the
disrupt-once tracking flag was only reset for explicit llama-stack
restarts. Subsequent disruption steps were skipped, causing tests to see
200 instead of 503.

Reset the tracking flag for lightspeed-stack restarts in Prow as well.

test_list.txt temporarily scoped to disruption feature only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…step

The _llama_stack_disrupt_once flag assumed llama-stack stays disrupted
across scenarios, but restart-lightspeed in Prow restores llama-stack
as a prerequisite, making the flag stale. Each disruption step now
checks actual container state, which is idempotent and correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…zation

- Dump last 200 lines of llama-stack and lightspeed-stack pod logs plus
  pod status when Prow e2e tests fail, to aid CI debugging.
- Remove fragile _llama_stack_disrupt_once flag from disruption step;
  each scenario now checks actual container state (idempotent).
- Revert restart_container Prow guard to llama-stack only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A10G GPU has 23GB; after loading Llama-3.1-8B weights (~15.7GB),
only 4.39GB remains for KV cache — max supported context is 35936.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A 503 from /readiness means the app responded (port-forward works) but
Llama Stack handshake hasn't completed yet. Previously this was treated
as a dead port-forward, exhausting all retries without ever checking
/v1/models. Now the retry loop proceeds to the /v1/models gate, giving
the app time to fully initialize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@are-ces are-ces force-pushed the LCORE-1496-rhoai-version-bump branch from 4f4da8c to 5f6bbe8 Compare May 7, 2026 12:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

397-399: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale diagnostic message after accepting 503.

The readiness status is logged for CI debugging, but "(expect 200 or 401)" no longer reflects the full set of acceptable codes — 503 is now valid too. When a disruption test scenario establishes its port-forward, CI logs will show a misleading annotation.

📝 Proposed fix
-echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or 401)"
+echo "[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200, 401, or 503)"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh` around lines 397 - 399, The log
message after checking readiness_code is outdated: update the echo that prints
"[e2e-ops] LCS through port-forward: GET http://127.0.0.1:$local_port/readiness
-> HTTP $readiness_code (expect 200 or 401)" so it reflects the new acceptable
status codes (include 503). Locate the readiness check that sets readiness_code
and the subsequent echo (look for the readiness_code variable and the echo that
mentions "(expect 200 or 401)") and change the annotation to list the accepted
codes, e.g. "(expect 200, 401 or 503)". Ensure the rest of the logging (PID and
pf_resource echo) remains unchanged.
♻️ Duplicate comments (1)
tests/e2e/features/environment.py (1)

242-243: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

scenario.status is a Status enum — string comparison always returns early, logs are never dumped.

behave >= 1.2.6 (this repo requires >= 1.3.0) changed Scenario.status from a plain string to a behave.model_core.Status enum. The expression scenario.status != "failed" compares an enum instance to a string, which is always True, so _dump_pod_logs_on_failure immediately returns on every invocation and no logs are ever captured.

🐛 Proposed fix
+from behave.model_core import Status
 from behave.contrib.scenario_autoretry import patch_scenario_with_autoretry
 from behave.model import Feature, Scenario

 def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None:
     """Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow."""
-    if scenario.status != "failed":
+    if scenario.status != Status.failed:
         return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/features/environment.py` around lines 242 - 243, The check in
_dump_pod_logs_on_failure uses string comparison against scenario.status (an
enum), so it always returns early; change the guard to compare against the
Status enum by importing behave.model_core.Status and replacing the condition
with "if scenario.status != Status.failed: return" (or alternatively use
str(scenario.status) == "failed"); update the imports to include Status and
ensure the function references scenario.status and _dump_pod_logs_on_failure so
the new enum comparison is applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh`:
- Around line 397-399: The log message after checking readiness_code is
outdated: update the echo that prints "[e2e-ops] LCS through port-forward: GET
http://127.0.0.1:$local_port/readiness -> HTTP $readiness_code (expect 200 or
401)" so it reflects the new acceptable status codes (include 503). Locate the
readiness check that sets readiness_code and the subsequent echo (look for the
readiness_code variable and the echo that mentions "(expect 200 or 401)") and
change the annotation to list the accepted codes, e.g. "(expect 200, 401 or
503)". Ensure the rest of the logging (PID and pf_resource echo) remains
unchanged.

---

Duplicate comments:
In `@tests/e2e/features/environment.py`:
- Around line 242-243: The check in _dump_pod_logs_on_failure uses string
comparison against scenario.status (an enum), so it always returns early; change
the guard to compare against the Status enum by importing
behave.model_core.Status and replacing the condition with "if scenario.status !=
Status.failed: return" (or alternatively use str(scenario.status) == "failed");
update the imports to include Status and ensure the function references
scenario.status and _dump_pod_logs_on_failure so the new enum comparison is
applied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f75ccae-c676-4635-a1bb-621c7ae80b06

📥 Commits

Reviewing files that changed from the base of the PR and between 4f4da8c and 5f6bbe8.

📒 Files selected for processing (4)
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
  • tests/e2e-prow/rhoai/scripts/e2e-ops.sh
  • tests/e2e/features/environment.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: spectral
  • GitHub Check: Pyright
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/e2e/features/environment.py
tests/e2e/**/*.{py,feature}

📄 CodeRabbit inference engine (AGENTS.md)

Use behave (BDD) framework for end-to-end testing with Gherkin feature files

Files:

  • tests/e2e/features/environment.py
🧠 Learnings (3)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.

Applied to files:

  • tests/e2e/features/environment.py
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.

Applied to files:

  • tests/e2e/features/environment.py
🔇 Additional comments (3)
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)

27-27: LGTM — --max-model-len bump aligns with CPU manifest.

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)

27-27: LGTM — consistent with GPU manifest update.

tests/e2e-prow/rhoai/scripts/e2e-ops.sh (1)

195-208: 503 acceptance in verify_connectivity looks correct.

Treating 503 as "tunnel alive" prevents unnecessary restart-lightspeed calls when llama is intentionally disrupted — the core behavioural fix for this PR.

Copy link
Copy Markdown
Contributor

@radofuchs radofuchs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@are-ces are-ces merged commit d15d37d into lightspeed-core:main May 8, 2026
31 checks passed
major pushed a commit to major/lightspeed-stack that referenced this pull request May 8, 2026
…t restores llama-stack (lightspeed-core#1628)

* Add diagnostic pod logs on e2e failure and remove disrupt-once optimization

* Increase vLLM max-model-len to 35936 (GPU memory limit)

* Accept 503 as valid port-forward proof in e2e connectivity check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
major pushed a commit to major/lightspeed-stack that referenced this pull request May 8, 2026
…t restores llama-stack (lightspeed-core#1628)

* Add diagnostic pod logs on e2e failure and remove disrupt-once optimization

* Increase vLLM max-model-len to 35936 (GPU memory limit)

* Accept 503 as valid port-forward proof in e2e connectivity check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants