RSPEED-2943: add Responses API inference metrics by major · Pull Request #1641 · lightspeed-core/lightspeed-stack

major · 2026-04-29T23:27:51Z

Description

Add bounded inference latency and failure metrics around the Responses API backend LLM calls, including streaming terminal events and stream iteration failures.

Type of change

Tools used to create PR

N/A

Related Tickets & Documents

Related Issue # RSPEED-2943

Checklist before requesting a review

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Testing

uv run pytest tests/unit/metrics/test_recording.py tests/unit/app/endpoints/test_responses.py
uv run radon cc -s src/app/endpoints/responses.py src/metrics/recording.py
uv run make verify

Summary by CodeRabbit

New Features
- Added LLM inference-duration metrics for API responses (streaming and non-streaming) and bounded result labels to "success"/"failure".
Bug Fixes
- Ensured inference metrics are reliably recorded on streaming errors and prevented duplicate metric emission.
Tests
- Added unit tests for streaming failure metric recording and for normalization of unexpected result labels.

coderabbitai · 2026-04-29T23:27:58Z

Warning

Rate limit exceeded

@major has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 43 minutes and 32 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a0712542-2014-43b1-888d-98a75c4d4c80

📥 Commits

Reviewing files that changed from the base of the PR and between 31849b3 and 80c64db.

📒 Files selected for processing (4)

src/app/endpoints/responses.py
src/metrics/recording.py
tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

Walkthrough

This PR adds monotonic inference timing capture and metric recording to the Responses API endpoints for both streaming and non-streaming calls, with a new metric label normalization layer to bound result labels to allowed values ("success" or "failure").

Changes

LLM Inference Timing & Metric Recording

Layer / File(s)	Summary
Metrics Foundation `src/metrics/recording.py`	Exported `LLM_INFERENCE_RESULT_SUCCESS`, `LLM_INFERENCE_RESULT_FAILURE` constants and `ALLOWED_LLM_INFERENCE_RESULTS` set; added `normalize_llm_inference_result()` helper to clamp unrecognized labels to `"failure"`.
Metric Recording Integration `src/metrics/recording.py`	Updated `record_llm_inference_duration()` to normalize the provided `result` label before emitting the Prometheus histogram observation.
Module Imports `src/app/endpoints/responses.py`	Added `time` and `metrics.recording` imports.
Response Metric Helper `src/app/endpoints/responses.py`	Introduced `_record_response_inference_result()` to extract provider and model from composite model ID and emit duration and optional failure counter.
Streaming Response Timing `src/app/endpoints/responses.py`	Added `inference_start_time = time.monotonic()` capture in `handle_streaming_response()` before backend call; exception handler records failure metric; `response_generator()` signature extended with `inference_start_time` parameter.
Streaming Metric Emission `src/app/endpoints/responses.py`	Refactored `response_generator()` to track inference metric emission state; records success-labeled duration metric on terminal chunks (`response.completed`, `response.incomplete`), failure-labeled metric on `response.failed`, and failure metric in exception handler if no terminal event was recorded.
Non-Streaming Response Timing `src/app/endpoints/responses.py`	Capture `inference_start_time` before backend `responses.create()` call; records success-labeled duration metric on success; records failure-labeled metric in exception path only if success metric was not already recorded.
Tests `tests/unit/app/endpoints/test_responses.py`, `tests/unit/metrics/test_recording.py`	Added parametrized streaming generator failure test verifying metric emission on exception; updated non-streaming tests to assert helper calls; added test for label normalization mapping unrecognized result values to `"failure"`.

Sequence Diagram

sequenceDiagram
    participant Handler as Request Handler
    participant Client as Backend Client
    participant Generator as Stream Generator
    participant Metrics as Metrics Recorder
    
    Handler->>Handler: capture inference_start_time
    Handler->>Client: create(...)
    Client-->>Generator: return async stream
    Handler->>Generator: start consuming chunks
    
    loop Process Chunks
        Generator->>Generator: yield chunk
        alt Terminal Chunk
            Generator->>Metrics: record_llm_inference_duration(result=success/failure, duration)
            Metrics->>Metrics: normalize_llm_inference_result()
            Metrics->>Metrics: emit histogram
        end
    end
    
    alt Stream Error Before Terminal Event
        Generator->>Generator: exception raised
        Generator->>Metrics: record_llm_inference_duration(result=failure, duration)
        Metrics->>Metrics: normalize & emit
        Generator->>Handler: re-raise exception
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding inference metrics to the Responses API backend, which is the core purpose of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/app/endpoints/responses.py (1)

1216-1318: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid emitting inference failure metrics after a successful backend call.

At Line 1223, success is already recorded right after client.responses.create(...). Because the same try continues through local post-processing, any later exception (e.g., append/persistence path) enters these except blocks and records a failure for the same inference call.

💡 Suggested fix (gate failure recording once success is emitted)

-    else:
-        inference_start_time = time.monotonic()
+    else:
+        inference_start_time = time.monotonic()
+        inference_metric_recorded = False
         try:
             api_response = cast(
                 OpenAIResponseObject,
                 await client.responses.create(
                     **api_params.model_dump(exclude_none=True)
                 ),
             )
             _record_response_inference_result(
                 api_params.model,
                 endpoint_path,
                 "success",
                 time.monotonic() - inference_start_time,
             )
+            inference_metric_recorded = True
             token_usage = extract_token_usage(
                 api_response.usage, api_params.model, endpoint_path
             )
@@
         except RuntimeError as e:
-            _record_response_inference_result(
-                api_params.model,
-                endpoint_path,
-                "failure",
-                time.monotonic() - inference_start_time,
-                record_failure=True,
-            )
+            if not inference_metric_recorded:
+                _record_response_inference_result(
+                    api_params.model,
+                    endpoint_path,
+                    "failure",
+                    time.monotonic() - inference_start_time,
+                    record_failure=True,
+                )
@@
         except APIConnectionError as e:
-            _record_response_inference_result(
-                api_params.model,
-                endpoint_path,
-                "failure",
-                time.monotonic() - inference_start_time,
-                record_failure=True,
-            )
+            if not inference_metric_recorded:
+                _record_response_inference_result(
+                    api_params.model,
+                    endpoint_path,
+                    "failure",
+                    time.monotonic() - inference_start_time,
+                    record_failure=True,
+                )
@@
         except (LLSApiStatusError, OpenAIAPIStatusError) as e:
-            _record_response_inference_result(
-                api_params.model,
-                endpoint_path,
-                "failure",
-                time.monotonic() - inference_start_time,
-                record_failure=True,
-            )
+            if not inference_metric_recorded:
+                _record_response_inference_result(
+                    api_params.model,
+                    endpoint_path,
+                    "failure",
+                    time.monotonic() - inference_start_time,
+                    record_failure=True,
+                )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 1216 - 1318, The code records
success immediately after client.responses.create(...) but later post-processing
(e.g., extract_text_from_response_items, consume_query_tokens,
append_turn_items_to_conversation) can raise and trigger the except blocks which
then emit a failure metric; add a guard so failure is only recorded if success
wasn't already emitted—set a local flag like success_recorded = False before the
try, set it True right after the _record_response_inference_result(...) success
call (the call that uses api_params.model and endpoint_path), and in every
except handler check that flag before calling
_record_response_inference_result(..., "failure", ...). This preserves the
current success metric when post-response processing fails.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/app/endpoints/responses.py`:
- Around line 1216-1318: The code records success immediately after
client.responses.create(...) but later post-processing (e.g.,
extract_text_from_response_items, consume_query_tokens,
append_turn_items_to_conversation) can raise and trigger the except blocks which
then emit a failure metric; add a guard so failure is only recorded if success
wasn't already emitted—set a local flag like success_recorded = False before the
try, set it True right after the _record_response_inference_result(...) success
call (the call that uses api_params.model and endpoint_path), and in every
except handler check that flag before calling
_record_response_inference_result(..., "failure", ...). This preserves the
current success metric when post-response processing fails.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e26c242-f7f0-432b-81d3-31e5059dd352

📥 Commits

Reviewing files that changed from the base of the PR and between ca125c4 and c00a020.

📒 Files selected for processing (5)

src/app/endpoints/responses.py
src/metrics/__init__.py
src/metrics/recording.py
tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: bandit
GitHub Check: integration_tests (3.13)
GitHub Check: integration_tests (3.12)
GitHub Check: build-pr
GitHub Check: Pylinter
GitHub Check: unit_tests (3.13)
GitHub Check: unit_tests (3.12)
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E Tests for Lightspeed Evaluation job

🧰 Additional context used

📓 Path-based instructions (6)

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types
Use Union types with modern syntax: str | int
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns: return new data structures instead of modifying parameters
Use async def for I/O operations and external API calls
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes use ABC with @abstractmethod decorators
Complete type annotations for all class attributes, use specific types, not Any
Follow Google Python docstring conventions for all modules, classes, and functions
Docstring Parameters section documents function parameters
Docstring Returns section documents function return values
Docstring Raises section documents exceptions that may be raised
Use black for code formatting
Use pylint for static analysis with source-roots configuration set to "src"
Use pyright for type checking
Use ruff for fast linting
Use pydocstyle for docstring style validation
Use mypy for additional type checking
Use bandit for security issue detection

Files:

src/metrics/__init__.py
src/metrics/recording.py
src/app/endpoints/responses.py

src/**/__init__.py

📄 CodeRabbit inference engine (AGENTS.md)

Package __init__.py files contain brief package descriptions

Files:

src/metrics/__init__.py

tests/unit/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/unit/**/*.py: Use pytest for all unit and integration tests
Use pytest-mock for AsyncMock objects in unit tests
Use marker pytest.mark.asyncio for async tests

Files:

tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not use unittest - pytest is the standard testing framework for this project

Files:

tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends

Files:

src/app/endpoints/responses.py

src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

src/app/endpoints/responses.py

🧠 Learnings (1)

📚 Learning: 2026-04-06T20:18:07.852Z

Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

src/app/endpoints/responses.py

🔇 Additional comments (4)

tests/unit/app/endpoints/test_responses.py (1)

1270-1319: Strong coverage for streaming iterator failure metrics.

Nice addition: this asserts both exception propagation and the single failure metric emission with record_failure=True.

tests/unit/metrics/test_recording.py (1)

186-237: Good shared test pattern for histogram recorders.

The table-driven structure keeps success/failure-path validation concise while still checking warning behavior.

src/metrics/__init__.py (1)

75-83: Histogram definition looks consistent with the new inference recorder.

Label set and bucket wiring are aligned with the recording helper contract.

src/metrics/recording.py (1)

135-154: Bounded result normalization and metric recording are implemented cleanly.

This keeps label cardinality controlled while preserving non-fatal metric error handling.

coderabbitai · 2026-04-30T21:14:23Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me verify the fix in the current branch.

🏁 Script executed:

#!/bin/bash
# Read the non-streaming path in responses.py to verify the guard flag implementation
rg -n "inference_metric_recorded|_record_response_inference_result|record_llm_inference_duration|inference_start_time" src/app/endpoints/responses.py -A2 -B2

Length of output: 5709

@major: The fix looks great — verified in the current branch. The implementation matches exactly what you described:

Non-streaming path (around lines 1060–1156):
- inference_metric_recorded = False is initialized before the try block (line 1061).
- Set to True immediately after the success metric call (line 1075).
- All three except handlers — RuntimeError, APIConnectionError, and LLSApiStatusError/OpenAIAPIStatusError — correctly gate the failure metric behind if not inference_metric_recorded (lines 1097, 1123, 1149).
Streaming path (around lines 836–930):
- The same pattern is already in place: inference_metric_recorded guards the except Exception fallback (line 923), preventing double-recording when a terminal event was already processed.

The guard flag cleanly isolates inference failures from post-processing failures (extract_token_usage, consume_query_tokens, etc.) across both code paths. Well done!

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

tisnik

Please ignore docs/demos directory completely - the sources are incorrect there on purpose.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/responses.py`:
- Around line 1034-1074: The streaming path records the inference metric only
after post-processing which can throw and cause a spurious failure metric; move
the success/failure metric recording to immediately after casting
latest_response_object (after latest_response_object = cast(...)) so the LLM
terminal event is recorded before
extract_token_usage/consume_query_tokens/get_available_quotas/extract_text_from_response_items
run, calling _record_response_inference_result(api_params.model,
context.endpoint_path, "failure" if chunk.type == "response.failed" else
"success", time.monotonic() - inference_start_time, record_failure=(chunk.type
== "response.failed")) and set inference_metric_recorded = True there, and
ensure the later metric-call is removed or guarded to avoid double-recording.

In `@tests/unit/app/endpoints/test_responses.py`:
- Around line 2702-2761: Add assertions that _record_response_inference_result
is invoked for non-streaming paths: patch the module-level
_record_response_inference_result in the TestHandleNonStreamingResponse tests
(e.g., test_handle_non_streaming_success_returns_response and a non-streaming
error test) using mocker.patch(f"{MODULE}._record_response_inference_result")
and assert_called_once_with (or assert_called_once) verifying the expected
arguments (result="success" for success path, "failure" and record_failure=True
for error path); update those tests to build api_params/context the same way and
then check the patched _record_response_inference_result was called with the
correct positional and keyword args.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e11706d-2f5f-4529-962d-feba65eec42b

📥 Commits

Reviewing files that changed from the base of the PR and between c00a020 and 1b033fa.

📒 Files selected for processing (4)

src/app/endpoints/responses.py
src/metrics/recording.py
tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: list_outdated_dependencies
GitHub Check: black
GitHub Check: Pylinter
GitHub Check: unit_tests (3.12)
GitHub Check: Pyright
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 3

🧰 Additional context used

📓 Path-based instructions (5)

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types
Use Union types with modern syntax: str | int
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns: return new data structures instead of modifying parameters
Use async def for I/O operations and external API calls
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes use ABC with @abstractmethod decorators
Complete type annotations for all class attributes, use specific types, not Any
Follow Google Python docstring conventions for all modules, classes, and functions
Docstring Parameters section documents function parameters
Docstring Returns section documents function return values
Docstring Raises section documents exceptions that may be raised
Use black for code formatting
Use pylint for static analysis with source-roots configuration set to "src"
Use pyright for type checking
Use ruff for fast linting
Use pydocstyle for docstring style validation
Use mypy for additional type checking
Use bandit for security issue detection

Files:

src/metrics/recording.py
src/app/endpoints/responses.py

tests/unit/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/unit/**/*.py: Use pytest for all unit and integration tests
Use pytest-mock for AsyncMock objects in unit tests
Use marker pytest.mark.asyncio for async tests

Files:

tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not use unittest - pytest is the standard testing framework for this project

Files:

tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends

Files:

src/app/endpoints/responses.py

src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

src/app/endpoints/responses.py

🧠 Learnings (1)

📚 Learning: 2026-04-06T20:18:07.852Z

Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

src/app/endpoints/responses.py

🔇 Additional comments (4)

src/metrics/recording.py (1)

115-160: LGTM — normalization layer and bounded metric recording are well-structured.

Constants use Final[frozenset[str]] for O(1) lookup and immutability, normalize_llm_inference_result is correctly documented with an explicit Returns: section, and bounded_result is computed outside the try block (no risk of suppression). Good defensive design.

tests/unit/metrics/test_recording.py (1)

221-236: LGTM — precisely tests the label-bounding behavior.

The test exercises the only missing dimension (out-of-set result → "failure") while the parametrized test above covers the passthrough of valid "success". The duration is verified to remain unchanged, confirming the normalization only touches the label.

src/app/endpoints/responses.py (2)

635-660: LGTM — _record_response_inference_result is well-factored.

Clean helper that centralises provider/model extraction, failure counter, and duration histogram recording in one place. The record_failure=False default means callers on the success path don't need to pass it explicitly.

1177-1224: LGTM — non-streaming metric recording is correctly placed.

Success is recorded immediately after create() returns and before post-processing (extract_token_usage, consume_query_tokens). The inference_metric_recorded guard prevents double-recording in all three exception types, and the record_failure=True is correctly passed only in the failure branch.

tisnik

LGTM

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/responses.py`:
- Around line 1045-1052: Replace the raw "success"/"failure" string literals
passed to _record_response_inference_result with the shared constants
recording.LLM_INFERENCE_RESULT_SUCCESS and
recording.LLM_INFERENCE_RESULT_FAILURE: determine the result from chunk.type as
before but map it to recording.LLM_INFERENCE_RESULT_FAILURE when chunk.type ==
"response.failed" else recording.LLM_INFERENCE_RESULT_SUCCESS, and pass that
constant into _record_response_inference_result along with the same
api_params.model, context.endpoint_path, elapsed time, and record_failure flag;
apply the same replacement in the other occurrences noted around lines 1188-1193
and 1222-1225 where the raw literals are used.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71b30a61-501e-4ce1-a82b-0b553edac82b

📥 Commits

Reviewing files that changed from the base of the PR and between 1b033fa and 31849b3.

📒 Files selected for processing (4)

src/app/endpoints/responses.py
src/metrics/recording.py
tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: build-pr
GitHub Check: bandit
GitHub Check: Pylinter
GitHub Check: Pyright
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E Tests for Lightspeed Evaluation job

🧰 Additional context used

📓 Path-based instructions (3)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

tests/unit/app/endpoints/test_responses.py
tests/unit/metrics/test_recording.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

src/metrics/recording.py
src/app/endpoints/responses.py

src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

src/app/endpoints/responses.py

🧠 Learnings (1)

📚 Learning: 2026-04-06T20:18:07.852Z

Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

src/app/endpoints/responses.py

🔇 Additional comments (4)

src/metrics/recording.py (1)

115-138: Bounded inference-result labels are implemented correctly.

Great change: normalizing unknown results to "failure" before histogram observation keeps metric cardinality bounded without losing latency samples.

Also applies to: 153-157

tests/unit/metrics/test_recording.py (1)

221-237: Targeted regression coverage looks solid.

This test cleanly proves unexpected inference result labels are coerced to "failure" while preserving the recorded duration.

tests/unit/app/endpoints/test_responses.py (1)

856-879: Instrumentation test coverage is now explicit and well-targeted.

The new assertions and the stream-iteration failure test provide strong confidence that inference metrics are recorded exactly once in both non-streaming and streaming failure paths.

Also applies to: 1024-1045, 2711-2770

src/app/endpoints/responses.py (1)

691-717: Inference metric boundary + double-recording guards look correct.

Recording terminal streaming metrics before post-processing and guarding fallback failure recording with inference_metric_recorded is the right behavior and matches non-streaming semantics.

Also applies to: 1033-1087, 1179-1226

Record inference duration and outcome (success/failure) for both streaming and non-streaming Responses API paths. Metric is recorded at the terminal-event boundary before post-processing to prevent spurious failure metrics when post-processing raises after a successful LLM inference. Guarded with inference_metric_recorded flag to avoid double-recording. Signed-off-by: Major Hayden <major@redhat.com>

major mentioned this pull request Apr 29, 2026

RSPEED-2943: add SLO monitoring metrics #1637

Closed

19 tasks

major force-pushed the split/responses-inference-metrics branch from 15beb51 to c00a020 Compare April 30, 2026 00:44

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

major force-pushed the split/responses-inference-metrics branch 2 times, most recently from b7af86f to e53dfb8 Compare April 30, 2026 21:13

tisnik requested changes May 4, 2026

View reviewed changes

major force-pushed the split/responses-inference-metrics branch 2 times, most recently from 2b79abf to 1b033fa Compare May 5, 2026 13:18

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/app/endpoints/responses.py Outdated

Comment thread tests/unit/app/endpoints/test_responses.py

major force-pushed the split/responses-inference-metrics branch from 1b033fa to 23ea1f1 Compare May 5, 2026 13:32

major requested a review from tisnik May 5, 2026 19:08

tisnik approved these changes May 7, 2026

View reviewed changes

major force-pushed the split/responses-inference-metrics branch from 23ea1f1 to 31849b3 Compare May 7, 2026 11:55

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/app/endpoints/responses.py Outdated

coderabbitai Bot mentioned this pull request May 7, 2026

Use shared inference-result constants instead of raw strings in responses.py #1700

Open

major force-pushed the split/responses-inference-metrics branch from 31849b3 to fa10cf8 Compare May 7, 2026 12:04

major force-pushed the split/responses-inference-metrics branch from fa10cf8 to 80c64db Compare May 7, 2026 12:11

tisnik merged commit 6280606 into lightspeed-core:main May 7, 2026
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSPEED-2943: add Responses API inference metrics#1641

RSPEED-2943: add Responses API inference metrics#1641
tisnik merged 1 commit intolightspeed-core:mainfrom
major:split/responses-inference-metrics

major commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

tisnik left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

tisnik left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

major commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

major commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading