Skip to content

Conversation

@raptorsun
Copy link
Contributor

@raptorsun raptorsun commented Sep 2, 2025

Description

record metrics of tokens sent to LLM provider and those received from LLM provider.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.

call the query or streaming_query endpoint and check the metrics llm_token_received_total and llm_token_sent_total

  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Chores

    • Improved token usage metrics with provider attribution across both standard and streaming queries for more accurate reporting.
    • Integrated per-turn token counting without altering user-facing behavior.
  • Tests

    • Added unit tests validating token counting logic.
    • Updated query and streaming endpoint tests to mock metrics updates, ensuring stable and focused test runs.
  • Note

    • No user-facing changes.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Propagates provider_id through query handling, extends retrieve_response signature, and adds per-turn token metrics updates using a new utility. Streaming handler now updates token metrics on turn completion. Tests mock the new metrics calls and add a unit test for the utility.

Changes

Cohort / File(s) Summary
Endpoints – Query
src/app/endpoints/query.py
Threads provider_id through model selection and retrieve_response; imports and calls update_llm_token_count_from_turn after LLM response; updates retrieve_response signature and docstring.
Endpoints – Streaming
src/app/endpoints/streaming_query.py
On turn completion, computes system_prompt and calls update_llm_token_count_from_turn within try/except; logs on failure.
Metrics Utility
src/metrics/utils.py
Adds update_llm_token_count_from_turn to count input/output tokens via Tokenizer/ChatFormat and increment metrics with provider/model labels.
Tests – Endpoints
tests/unit/app/endpoints/test_query.py, tests/unit/app/endpoints/test_streaming_query.py
Introduces mock_metrics helper to patch metrics updates; applies it across relevant tests.
Tests – Metrics
tests/unit/metrics/test_utis.py
Adds unit test for update_llm_token_count_from_turn with mocked tokenizer/format; asserts sent/received counters with labels.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant QueryHandler as query_endpoint_handler
  participant Retriever as retrieve_response
  participant LLM as LLM Service
  participant Metrics as Metrics Utils

  Client->>QueryHandler: POST /query (model hint)
  QueryHandler->>QueryHandler: select_model_and_provider_id()
  QueryHandler->>Retriever: retrieve_response(model_id, provider_id, request, token, ...)
  Retriever->>LLM: Invoke model
  LLM-->>Retriever: LLM response
  Retriever->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
  Retriever-->>QueryHandler: (summary, transcript)
  QueryHandler-->>Client: 200 OK (response)
Loading
sequenceDiagram
  autonumber
  participant Client
  participant StreamHandler as streaming_query_endpoint_handler
  participant Stream as LLM Stream
  participant Metrics as Metrics Utils

  Client->>StreamHandler: POST /streaming_query
  StreamHandler->>Stream: start streaming
  loop token/turns
    Stream-->>StreamHandler: partials
    alt turn_complete
      StreamHandler->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
      note right of Metrics: Guarded with try/except and logging
    end
  end
  StreamHandler-->>Client: stream closed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • tisnik
  • manstis
  • umago

Poem

I thump my paw—new metrics bloom,
Provider threads now weave their loom.
Tokens counted, turn by turn,
Streams that end, and gauges churn.
In carrot clocks the numbers tick—
A rabbit’s audit, swift and slick. 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/app/endpoints/query.py (2)

217-224: Pass raw model_id to retrieve_response; construct provider/model inside the callee.
Currently you pass llama_stack_model_id (provider/model) as model_id. This later becomes the model label in metrics, causing label inconsistency vs llm_calls_total (which uses raw model_id). Pass model_id instead and let retrieve_response build provider/model for agent calls.

Apply:

-        summary, conversation_id = await retrieve_response(
-            client,
-            llama_stack_model_id,
-            query_request,
-            token,
-            mcp_headers=mcp_headers,
-            provider_id=provider_id,
-        )
+        summary, conversation_id = await retrieve_response(
+            client,
+            model_id,  # raw model id for metrics labels
+            query_request,
+            token,
+            mcp_headers=mcp_headers,
+            provider_id=provider_id,
+        )

Follow-up in retrieve_response below to build the provider/model identifier for get_agent.


453-461: Always send provider/model to get_agent by constructing it locally.
After passing raw model_id into retrieve_response, build llama_stack_model_id here.

Apply:

-    agent, conversation_id, session_id = await get_agent(
-        client,
-        model_id,
+    llama_stack_model_id = f"{provider_id}/{model_id}" if provider_id else model_id
+    agent, conversation_id, session_id = await get_agent(
+        client,
+        llama_stack_model_id,
         system_prompt,
         available_input_shields,
         available_output_shields,
         query_request.conversation_id,
         query_request.no_tools or False,
     )
🧹 Nitpick comments (9)
src/metrics/utils.py (1)

62-64: Minor: avoid repeatedly constructing ChatFormat.

Cache a single formatter instance.

+from functools import lru_cache
+
+@lru_cache(maxsize=1)
+def _get_formatter() -> ChatFormat:
+    return ChatFormat(Tokenizer.get_instance())
@@
-    tokenizer = Tokenizer.get_instance()
-    formatter = ChatFormat(tokenizer)
+    formatter = _get_formatter()
src/metrics/__init__.py (1)

45-55: Clean up stale TODOs now that token metrics exist.

Remove or update the LCORE-411 TODO comments to prevent confusion.

tests/unit/app/endpoints/test_query.py (1)

454-455: Deduplicate metrics patching with an autouse fixture.

Replace repeated mock_metrics(...) calls with one autouse fixture to reduce noise.

Example (place at top of this file or in conftest.py):

import pytest

@pytest.fixture(autouse=True)
def _mock_llm_token_metrics(mocker):
    mocker.patch(
        "app.endpoints.query.metrics.update_llm_token_count_from_turn",
        return_value=None,
    )

Also applies to: 486-487, 519-520, 559-559, 609-609, 661-661, 773-773, 829-829, 884-884, 955-955, 1017-1017, 1113-1113, 1351-1351, 1402-1402

tests/unit/metrics/test_utis.py (4)

1-1: Typo in filename: rename to test_utils.py for consistency and discoverability.
Tests targeting metrics/utils should live under a correctly spelled filename to avoid future confusion.

Apply this git mv:

- tests/unit/metrics/test_utis.py
+ tests/unit/metrics/test_utils.py

6-17: Simplify mocking of Async client construction.
You patch get_client twice and rebind mock_client. Create a single AsyncMock and patch once to reduce brittleness.

Apply:

-    mock_client = mocker.patch(
-        "client.AsyncLlamaStackClientHolder.get_client"
-    ).return_value
-    # Make sure the client is an AsyncMock for async methods
-    mock_client = mocker.AsyncMock()
-    mocker.patch(
-        "client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client
-    )
+    mock_client = mocker.AsyncMock()
+    mocker.patch("client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client)

65-76: Make call assertions resilient by checking chained calls on child mocks explicitly.
While assert_has_calls on the parent may pass via mock_calls, asserting on children is clearer and avoids false positives if internal chaining changes.

Example:

-    mock_metric.assert_has_calls(
-        [
-            mocker.call.labels("test_provider-0", "test_model-0"),
-            mocker.call.labels().set(0),
-            mocker.call.labels("default_provider", "default_model"),
-            mocker.call.labels().set(1),
-            mocker.call.labels("test_provider-1", "test_model-1"),
-            mocker.call.labels().set(0),
-        ],
-        any_order=False,  # Order matters here
-    )
+    mock_metric.labels.assert_has_calls(
+        [
+            mocker.call("test_provider-0", "test_model-0"),
+            mocker.call("default_provider", "default_model"),
+            mocker.call("test_provider-1", "test_model-1"),
+        ]
+    )
+    mock_metric.labels.return_value.set.assert_has_calls(
+        [mocker.call(0), mocker.call(1), mocker.call(0)]
+    )

79-124: Great targeted test for token metrics; add minimal assertions on formatter usage.
Validates both sent/received counters. Add light checks to ensure encode_dialog_prompt is called twice with expected shapes.

Optional add:

+    assert mock_formatter.encode_dialog_prompt.call_count == 2
+    out_call, in_call = mock_formatter.encode_dialog_prompt.call_args_list
+    assert isinstance(out_call.args[0], list) and len(out_call.args[0]) == 1
+    assert isinstance(in_call.args[0], list) and len(in_call.args[0]) >= 1
src/app/endpoints/query.py (2)

331-349: Validation may mismatch on identifier shape; compare against both forms.
ModelListResponse.identifier might be either "model" or "provider/model" depending on upstream. Current check requires identifier == provider/model, which can falsely reject. Consider accepting either.

Suggested change:

-    llama_stack_model_id = f"{provider_id}/{model_id}"
+    llama_stack_model_id = f"{provider_id}/{model_id}"
     # Validate that the model_id and provider_id are in the available models
     logger.debug("Searching for model: %s, provider: %s", model_id, provider_id)
-    if not any(
-        m.identifier == llama_stack_model_id and m.provider_id == provider_id
-        for m in models
-    ):
+    if not any(
+        (m.identifier == llama_stack_model_id)
+        or (m.identifier == model_id and m.provider_id == provider_id)
+        for m in models
+    ):

Also, in the “first available LLM” branch, return a consistent first element:

-            model_id = model.identifier
-            provider_id = model.provider_id
-            logger.info("Selected model: %s", model)
-            return model_id, model_id, provider_id
+            model_id = model.identifier
+            provider_id = model.provider_id
+            llama_stack_model_id = f"{provider_id}/{model_id}"
+            logger.info("Selected model: %s", llama_stack_model_id)
+            return llama_stack_model_id, model_id, provider_id

225-236: Consider incrementing llm_calls_total only after a successful agent turn.
You count success before create_turn; failures will also increment failures_total, producing double counting. Increment success after a successful response.

Proposed move:

-        # Update metrics for the LLM call
-        metrics.llm_calls_total.labels(provider_id, model_id).inc()
...
-    response = await agent.create_turn(
+    response = await agent.create_turn(
         ...
     )
+    # Update metrics for the successful LLM call
+    metrics.llm_calls_total.labels(provider_id, model_id).inc()

Please confirm your intended semantics (attempt vs. success) and adjust dashboards accordingly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c51e77 and 9037b3b.

📒 Files selected for processing (7)
  • src/app/endpoints/query.py (4 hunks)
  • src/app/endpoints/streaming_query.py (1 hunks)
  • src/metrics/__init__.py (1 hunks)
  • src/metrics/utils.py (2 hunks)
  • tests/unit/app/endpoints/test_query.py (16 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
  • tests/unit/metrics/test_utis.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/app/endpoints/streaming_query.py (2)
src/utils/endpoints.py (1)
  • get_system_prompt (53-84)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (58-75)
tests/unit/app/endpoints/test_streaming_query.py (1)
tests/unit/app/endpoints/test_query.py (1)
  • mock_metrics (49-54)
tests/unit/app/endpoints/test_query.py (1)
tests/unit/app/endpoints/test_streaming_query.py (1)
  • mock_metrics (61-66)
src/metrics/__init__.py (1)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (58-75)
tests/unit/metrics/test_utis.py (1)
src/metrics/utils.py (2)
  • setup_model_metrics (18-55)
  • update_llm_token_count_from_turn (58-75)
src/app/endpoints/query.py (1)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (58-75)
🪛 GitHub Actions: Ruff
src/metrics/__init__.py

[error] 9-9: Ruff: F401 'update_llm_token_count_from_turn' imported but unused. Remove the import or re-export the symbol.

🪛 GitHub Actions: Python linter
src/metrics/utils.py

[error] 9-9: Pylint: C0411 - wrong-import-order: third-party import llama_stack_client.types.agents.turn.Turn should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]


[error] 10-10: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.llama3.tokenizer.Tokenizer should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]


[error] 11-11: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.llama3.chat_format.ChatFormat should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]


[error] 12-12: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.datatypes.RawMessage should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]

src/app/endpoints/query.py

[error] 391-391: Pylint: R0913: Too many arguments (6/5) (too-many-arguments). [Command: uv run pylint src tests]


[error] 391-391: Pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments). [Command: uv run pylint src tests]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests
🔇 Additional comments (3)
src/metrics/utils.py (1)

62-64: Plumb model-specific tokenizers or defer to provider/Llama Stack token-count API
Current use of Tokenizer.get_instance() likely defaults to a Llama3 tokenizer and may miscount tokens for other providers. Plumb a tokenizer per (provider, model) or defer to the provider’s token-count API to ensure accurate billing metrics.

tests/unit/app/endpoints/test_streaming_query.py (1)

61-67: LGTM: metrics side effects are properly patched for streaming tests.

Keeps tests focused on endpoint behavior.

tests/unit/app/endpoints/test_query.py (1)

49-55: LGTM: central helper to patch metrics for query tests.

Consistent with streaming tests.

Histogram,
)

from .utils import update_llm_token_count_from_turn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix Ruff F401: mark re-export via all.

Declare all so the import is considered an intentional re-export.

-from .utils import update_llm_token_count_from_turn
+from .utils import update_llm_token_count_from_turn
+
+# Explicit re-exports
+__all__ = (
+    "update_llm_token_count_from_turn",
+    "rest_api_calls_total",
+    "response_duration_seconds",
+    "provider_model_configuration",
+    "llm_calls_total",
+    "llm_calls_failures_total",
+    "llm_calls_validation_errors_total",
+    "llm_token_sent_total",
+    "llm_token_received_total",
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from .utils import update_llm_token_count_from_turn
from .utils import update_llm_token_count_from_turn
# Explicit re-exports
__all__ = (
"update_llm_token_count_from_turn",
"rest_api_calls_total",
"response_duration_seconds",
"provider_model_configuration",
"llm_calls_total",
"llm_calls_failures_total",
"llm_calls_validation_errors_total",
"llm_token_sent_total",
"llm_token_received_total",
)
🧰 Tools
🪛 GitHub Actions: Ruff

[error] 9-9: Ruff: F401 'update_llm_token_count_from_turn' imported but unused. Remove the import or re-export the symbol.

🤖 Prompt for AI Agents
In src/metrics/__init__.py around line 9, the module imports
update_llm_token_count_from_turn but Ruff flags F401 because it looks like an
unused import; declare an explicit __all__ list that includes
"update_llm_token_count_from_turn" so the import is considered an intentional
public re-export. Add or update the __all__ variable to contain that symbol (and
other public names if needed) so linters recognize the re-export.

Comment on lines +58 to +77
def update_llm_token_count_from_turn(
turn: Turn, model: str, provider: str, system_prompt: str = ""
) -> None:
"""Update the LLM calls metrics from a turn."""
tokenizer = Tokenizer.get_instance()
formatter = ChatFormat(tokenizer)

raw_message = cast(RawMessage, turn.output_message)
encoded_output = formatter.encode_dialog_prompt([raw_message])
token_count = len(encoded_output.tokens) if encoded_output.tokens else 0
metrics.llm_token_received_total.labels(provider, model).inc(token_count)

input_messages = [RawMessage(role="user", content=system_prompt)] + cast(
list[RawMessage], turn.input_messages
)
encoded_input = formatter.encode_dialog_prompt(input_messages)
token_count = len(encoded_input.tokens) if encoded_input.tokens else 0
metrics.llm_token_sent_total.labels(provider, model).inc(token_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Runtime type mismatch and incorrect system-prompt role/content in token counting.

  • Casting client messages to RawMessage does not convert shapes; ChatFormat.encode_dialog_prompt will likely raise at runtime.
  • The system prompt should use role="system" and content must be content items, not a bare string.

Apply this refactor to safely convert messages and avoid crashes.

-def update_llm_token_count_from_turn(
-    turn: Turn, model: str, provider: str, system_prompt: str = ""
-) -> None:
-    """Update the LLM calls metrics from a turn."""
-    tokenizer = Tokenizer.get_instance()
-    formatter = ChatFormat(tokenizer)
-
-    raw_message = cast(RawMessage, turn.output_message)
-    encoded_output = formatter.encode_dialog_prompt([raw_message])
-    token_count = len(encoded_output.tokens) if encoded_output.tokens else 0
-    metrics.llm_token_received_total.labels(provider, model).inc(token_count)
-
-    input_messages = [RawMessage(role="user", content=system_prompt)] + cast(
-        list[RawMessage], turn.input_messages
-    )
-    encoded_input = formatter.encode_dialog_prompt(input_messages)
-    token_count = len(encoded_input.tokens) if encoded_input.tokens else 0
-    metrics.llm_token_sent_total.labels(provider, model).inc(token_count)
+def update_llm_token_count_from_turn(
+    turn: Turn, model: str, provider: str, system_prompt: str = ""
+) -> None:
+    """Update token metrics for a completed turn."""
+    tokenizer = Tokenizer.get_instance()
+    formatter = ChatFormat(tokenizer)
+
+    # ----- Output tokens (assistant) -----
+    output_tokens = 0
+    if getattr(turn, "output_message", None) is not None:
+        om = turn.output_message  # type: ignore[attr-defined]
+        if isinstance(om.content, str):
+            om_content = [TextContentItem(text=om.content, type="text")]
+        else:
+            om_content = om.content
+        encoded_output = formatter.encode_dialog_prompt(
+            [RawMessage(role=getattr(om, "role", "assistant"), content=om_content)]
+        )
+        output_tokens = len(encoded_output.tokens) if encoded_output.tokens else 0
+    metrics.llm_token_received_total.labels(provider, model).inc(output_tokens)
+
+    # ----- Input tokens (system + user/tool messages) -----
+    input_raw: list[RawMessage] = []
+    if system_prompt:
+        input_raw.append(
+            RawMessage(
+                role="system",
+                content=[TextContentItem(text=system_prompt, type="text")],
+            )
+        )
+    for m in cast(list, getattr(turn, "input_messages", [])):
+        msg_content = m.content
+        if isinstance(msg_content, str):
+            content_items = [TextContentItem(text=msg_content, type="text")]
+        else:
+            content_items = msg_content
+        input_raw.append(RawMessage(role=m.role, content=content_items))
+
+    input_tokens = 0
+    if input_raw:
+        encoded_input = formatter.encode_dialog_prompt(input_raw)
+        input_tokens = len(encoded_input.tokens) if encoded_input.tokens else 0
+    metrics.llm_token_sent_total.labels(provider, model).inc(input_tokens)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def update_llm_token_count_from_turn(
turn: Turn, model: str, provider: str, system_prompt: str = ""
) -> None:
"""Update the LLM calls metrics from a turn."""
tokenizer = Tokenizer.get_instance()
formatter = ChatFormat(tokenizer)
raw_message = cast(RawMessage, turn.output_message)
encoded_output = formatter.encode_dialog_prompt([raw_message])
token_count = len(encoded_output.tokens) if encoded_output.tokens else 0
metrics.llm_token_received_total.labels(provider, model).inc(token_count)
input_messages = [RawMessage(role="user", content=system_prompt)] + cast(
list[RawMessage], turn.input_messages
)
encoded_input = formatter.encode_dialog_prompt(input_messages)
token_count = len(encoded_input.tokens) if encoded_input.tokens else 0
metrics.llm_token_sent_total.labels(provider, model).inc(token_count)
def update_llm_token_count_from_turn(
turn: Turn, model: str, provider: str, system_prompt: str = ""
) -> None:
"""Update token metrics for a completed turn."""
tokenizer = Tokenizer.get_instance()
formatter = ChatFormat(tokenizer)
# ----- Output tokens (assistant) -----
output_tokens = 0
if getattr(turn, "output_message", None) is not None:
om = turn.output_message # type: ignore[attr-defined]
if isinstance(om.content, str):
om_content = [TextContentItem(text=om.content, type="text")]
else:
om_content = om.content
encoded_output = formatter.encode_dialog_prompt(
[RawMessage(role=getattr(om, "role", "assistant"), content=om_content)]
)
output_tokens = len(encoded_output.tokens) if encoded_output.tokens else 0
metrics.llm_token_received_total.labels(provider, model).inc(output_tokens)
# ----- Input tokens (system + user/tool messages) -----
input_raw: list[RawMessage] = []
if system_prompt:
input_raw.append(
RawMessage(
role="system",
content=[TextContentItem(text=system_prompt, type="text")],
)
)
for m in cast(list, getattr(turn, "input_messages", [])):
msg_content = m.content
if isinstance(msg_content, str):
content_items = [TextContentItem(text=msg_content, type="text")]
else:
content_items = msg_content
input_raw.append(RawMessage(role=m.role, content=content_items))
input_tokens = 0
if input_raw:
encoded_input = formatter.encode_dialog_prompt(input_raw)
input_tokens = len(encoded_input.tokens) if encoded_input.tokens else 0
metrics.llm_token_sent_total.labels(provider, model).inc(input_tokens)
🤖 Prompt for AI Agents
In src/metrics/utils.py around lines 58-75, the current code unsafely casts
messages to RawMessage and passes a bare string as the system prompt, causing
runtime shape mismatches for ChatFormat.encode_dialog_prompt; instead,
explicitly build properly-shaped RawMessage objects for encoding: map
turn.input_messages and turn.output_message into RawMessage instances with the
correct role and a content list (do not use direct cast), and construct the
system prompt as a RawMessage with role="system" and content expressed as the
required content-item list (wrap the string into the proper content-item
structure expected by ChatFormat) before calling formatter.encode_dialog_prompt
so token counting won’t raise at runtime.

Signed-off-by: Haoyu Sun <hasun@redhat.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/app/endpoints/query.py (1)

392-400: Resolved: trailing params now keyword-only.
This addresses the earlier R0913/R0917 concern.

🧹 Nitpick comments (2)
src/app/endpoints/query.py (2)

416-421: Docstring clarity: specify model_id format.
Model here is the combined provider/model used by Llama Stack; make that explicit to avoid confusion with the bare model label used in metrics.

@@
-    Parameters:
-        model_id (str): The identifier of the LLM model to use.
-        provider_id (str): The identifier of the LLM provider to use.
+    Parameters:
+        model_id (str): Llama Stack model identifier in "provider/model" form (e.g., "openai/gpt-4.1").
+        provider_id (str): Provider identifier (e.g., "openai"). Keyword-only.

518-521: Keep label normalization; guard metrics to be non-fatal.
Nice job normalizing the model label. Wrap the metrics call so tokenizer/formatter issues never fail the endpoint.

-    # Update token count metrics for the LLM call
-    model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
-    update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    # Update token count metrics for the LLM call (best-effort)
+    try:
+        model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
+        update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    except Exception as e:  # noqa: BLE001
+        logger.warning("Token metrics update failed: %s", e, exc_info=True)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9037b3b and 0c5e610.

📒 Files selected for processing (6)
  • src/app/endpoints/query.py (5 hunks)
  • src/app/endpoints/streaming_query.py (2 hunks)
  • src/metrics/utils.py (2 hunks)
  • tests/unit/app/endpoints/test_query.py (16 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
  • tests/unit/metrics/test_utis.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/metrics/test_utis.py
  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_query.py
  • src/metrics/utils.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/app/endpoints/query.py (2)
src/metrics/utils.py (1)
  • update_llm_token_count_from_turn (60-77)
src/models/requests.py (1)
  • QueryRequest (72-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests
🔇 Additional comments (2)
src/app/endpoints/query.py (2)

27-27: LGTM: metrics utility import is correct.
Direct import keeps call sites explicit and avoids circular deps.


216-223: Good: provider_id plumbed and passed keyword-only.
Call site matches the updated retrieve_response signature.

@raptorsun
Copy link
Contributor Author

raptorsun commented Sep 3, 2025

E2E test failure is due to timeout(10s) on querying readiness endpoint.

@raptorsun
Copy link
Contributor Author

E2E test passed again.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side, thank you

@tisnik tisnik requested review from manstis and umago September 3, 2025 16:14
Copy link
Contributor

@umago umago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good! Thank u

@tisnik tisnik merged commit 2c7b365 into lightspeed-core:main Sep 4, 2025
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants