LCORE-411: add token usage metrics #489

raptorsun · 2025-09-02T20:56:04Z

Description

record metrics of tokens sent to LLM provider and those received from LLM provider.

Type of change

Related Tickets & Documents

Related Issue #
Closes # LCORE-411

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.

call the query or streaming_query endpoint and check the metrics llm_token_received_total and llm_token_sent_total

How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Chores
- Improved token usage metrics with provider attribution across both standard and streaming queries for more accurate reporting.
- Integrated per-turn token counting without altering user-facing behavior.
Tests
- Added unit tests validating token counting logic.
- Updated query and streaming endpoint tests to mock metrics updates, ensuring stable and focused test runs.
Note
- No user-facing changes.

coderabbitai · 2025-09-02T20:56:11Z

Walkthrough

Propagates provider_id through query handling, extends retrieve_response signature, and adds per-turn token metrics updates using a new utility. Streaming handler now updates token metrics on turn completion. Tests mock the new metrics calls and add a unit test for the utility.

Changes

Cohort / File(s)	Summary
Endpoints – Query `src/app/endpoints/query.py`	Threads provider_id through model selection and retrieve_response; imports and calls update_llm_token_count_from_turn after LLM response; updates retrieve_response signature and docstring.
Endpoints – Streaming `src/app/endpoints/streaming_query.py`	On turn completion, computes system_prompt and calls update_llm_token_count_from_turn within try/except; logs on failure.
Metrics Utility `src/metrics/utils.py`	Adds update_llm_token_count_from_turn to count input/output tokens via Tokenizer/ChatFormat and increment metrics with provider/model labels.
Tests – Endpoints `tests/unit/app/endpoints/test_query.py`, `tests/unit/app/endpoints/test_streaming_query.py`	Introduces mock_metrics helper to patch metrics updates; applies it across relevant tests.
Tests – Metrics `tests/unit/metrics/test_utis.py`	Adds unit test for update_llm_token_count_from_turn with mocked tokenizer/format; asserts sent/received counters with labels.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant QueryHandler as query_endpoint_handler
  participant Retriever as retrieve_response
  participant LLM as LLM Service
  participant Metrics as Metrics Utils

  Client->>QueryHandler: POST /query (model hint)
  QueryHandler->>QueryHandler: select_model_and_provider_id()
  QueryHandler->>Retriever: retrieve_response(model_id, provider_id, request, token, ...)
  Retriever->>LLM: Invoke model
  LLM-->>Retriever: LLM response
  Retriever->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
  Retriever-->>QueryHandler: (summary, transcript)
  QueryHandler-->>Client: 200 OK (response)

sequenceDiagram
  autonumber
  participant Client
  participant StreamHandler as streaming_query_endpoint_handler
  participant Stream as LLM Stream
  participant Metrics as Metrics Utils

  Client->>StreamHandler: POST /streaming_query
  StreamHandler->>Stream: start streaming
  loop token/turns
    Stream-->>StreamHandler: partials
    alt turn_complete
      StreamHandler->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
      note right of Metrics: Guarded with try/except and logging
    end
  end
  StreamHandler-->>Client: stream closed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

LCORE-220: Add metrics to lightspeed-stack #256 — Also threads provider_id and updates metrics paths, touching endpoints and metrics utils in similar areas.
Add tool calls to stored transcripts #404 — Adjusts endpoint response/transcript handling in the same query/streaming code paths modified here.

Suggested reviewers

tisnik
manstis
umago

Poem

I thump my paw—new metrics bloom,
Provider threads now weave their loom.
Tokens counted, turn by turn,
Streams that end, and gauges churn.
In carrot clocks the numbers tick—
A rabbit’s audit, swift and slick. 🥕

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/app/endpoints/query.py (2)

217-224: Pass raw model_id to retrieve_response; construct provider/model inside the callee.
Currently you pass llama_stack_model_id (provider/model) as model_id. This later becomes the model label in metrics, causing label inconsistency vs llm_calls_total (which uses raw model_id). Pass model_id instead and let retrieve_response build provider/model for agent calls.

Apply:

-        summary, conversation_id = await retrieve_response(
-            client,
-            llama_stack_model_id,
-            query_request,
-            token,
-            mcp_headers=mcp_headers,
-            provider_id=provider_id,
-        )
+        summary, conversation_id = await retrieve_response(
+            client,
+            model_id,  # raw model id for metrics labels
+            query_request,
+            token,
+            mcp_headers=mcp_headers,
+            provider_id=provider_id,
+        )

Follow-up in retrieve_response below to build the provider/model identifier for get_agent.

453-461: Always send provider/model to get_agent by constructing it locally.
After passing raw model_id into retrieve_response, build llama_stack_model_id here.

Apply:

-    agent, conversation_id, session_id = await get_agent(
-        client,
-        model_id,
+    llama_stack_model_id = f"{provider_id}/{model_id}" if provider_id else model_id
+    agent, conversation_id, session_id = await get_agent(
+        client,
+        llama_stack_model_id,
         system_prompt,
         available_input_shields,
         available_output_shields,
         query_request.conversation_id,
         query_request.no_tools or False,
     )

🧹 Nitpick comments (9)

src/metrics/utils.py (1)
62-64: Minor: avoid repeatedly constructing ChatFormat.

Cache a single formatter instance.
+from functools import lru_cache
+
+@lru_cache(maxsize=1)
+def _get_formatter() -> ChatFormat:
+    return ChatFormat(Tokenizer.get_instance())
@@
-    tokenizer = Tokenizer.get_instance()
-    formatter = ChatFormat(tokenizer)
+    formatter = _get_formatter()
src/metrics/__init__.py (1)

45-55: Clean up stale TODOs now that token metrics exist.

Remove or update the LCORE-411 TODO comments to prevent confusion.
tests/unit/app/endpoints/test_query.py (1)
454-455: Deduplicate metrics patching with an autouse fixture.

Replace repeated mock_metrics(...) calls with one autouse fixture to reduce noise.

Example (place at top of this file or in conftest.py):
import pytest

@pytest.fixture(autouse=True)
def _mock_llm_token_metrics(mocker):
    mocker.patch(
        "app.endpoints.query.metrics.update_llm_token_count_from_turn",
        return_value=None,
    )
Also applies to: 486-487, 519-520, 559-559, 609-609, 661-661, 773-773, 829-829, 884-884, 955-955, 1017-1017, 1113-1113, 1351-1351, 1402-1402
tests/unit/metrics/test_utis.py (4)
1-1: Typo in filename: rename to test_utils.py for consistency and discoverability.
Tests targeting metrics/utils should live under a correctly spelled filename to avoid future confusion.

Apply this git mv:
- tests/unit/metrics/test_utis.py
+ tests/unit/metrics/test_utils.py
6-17: Simplify mocking of Async client construction.
You patch get_client twice and rebind mock_client. Create a single AsyncMock and patch once to reduce brittleness.

Apply:
-    mock_client = mocker.patch(
-        "client.AsyncLlamaStackClientHolder.get_client"
-    ).return_value
-    # Make sure the client is an AsyncMock for async methods
-    mock_client = mocker.AsyncMock()
-    mocker.patch(
-        "client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client
-    )
+    mock_client = mocker.AsyncMock()
+    mocker.patch("client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client)
65-76: Make call assertions resilient by checking chained calls on child mocks explicitly.
While assert_has_calls on the parent may pass via mock_calls, asserting on children is clearer and avoids false positives if internal chaining changes.

Example:
-    mock_metric.assert_has_calls(
-        [
-            mocker.call.labels("test_provider-0", "test_model-0"),
-            mocker.call.labels().set(0),
-            mocker.call.labels("default_provider", "default_model"),
-            mocker.call.labels().set(1),
-            mocker.call.labels("test_provider-1", "test_model-1"),
-            mocker.call.labels().set(0),
-        ],
-        any_order=False,  # Order matters here
-    )
+    mock_metric.labels.assert_has_calls(
+        [
+            mocker.call("test_provider-0", "test_model-0"),
+            mocker.call("default_provider", "default_model"),
+            mocker.call("test_provider-1", "test_model-1"),
+        ]
+    )
+    mock_metric.labels.return_value.set.assert_has_calls(
+        [mocker.call(0), mocker.call(1), mocker.call(0)]
+    )
79-124: Great targeted test for token metrics; add minimal assertions on formatter usage.
Validates both sent/received counters. Add light checks to ensure encode_dialog_prompt is called twice with expected shapes.

Optional add:
+    assert mock_formatter.encode_dialog_prompt.call_count == 2
+    out_call, in_call = mock_formatter.encode_dialog_prompt.call_args_list
+    assert isinstance(out_call.args[0], list) and len(out_call.args[0]) == 1
+    assert isinstance(in_call.args[0], list) and len(in_call.args[0]) >= 1
src/app/endpoints/query.py (2)
331-349: Validation may mismatch on identifier shape; compare against both forms.
ModelListResponse.identifier might be either "model" or "provider/model" depending on upstream. Current check requires identifier == provider/model, which can falsely reject. Consider accepting either.

Suggested change:
-    llama_stack_model_id = f"{provider_id}/{model_id}"
+    llama_stack_model_id = f"{provider_id}/{model_id}"
     # Validate that the model_id and provider_id are in the available models
     logger.debug("Searching for model: %s, provider: %s", model_id, provider_id)
-    if not any(
-        m.identifier == llama_stack_model_id and m.provider_id == provider_id
-        for m in models
-    ):
+    if not any(
+        (m.identifier == llama_stack_model_id)
+        or (m.identifier == model_id and m.provider_id == provider_id)
+        for m in models
+    ):
Also, in the “first available LLM” branch, return a consistent first element:
-            model_id = model.identifier
-            provider_id = model.provider_id
-            logger.info("Selected model: %s", model)
-            return model_id, model_id, provider_id
+            model_id = model.identifier
+            provider_id = model.provider_id
+            llama_stack_model_id = f"{provider_id}/{model_id}"
+            logger.info("Selected model: %s", llama_stack_model_id)
+            return llama_stack_model_id, model_id, provider_id
225-236: Consider incrementing llm_calls_total only after a successful agent turn.
You count success before create_turn; failures will also increment failures_total, producing double counting. Increment success after a successful response.

Proposed move:
-        # Update metrics for the LLM call
-        metrics.llm_calls_total.labels(provider_id, model_id).inc()
...
-    response = await agent.create_turn(
+    response = await agent.create_turn(
         ...
     )
+    # Update metrics for the successful LLM call
+    metrics.llm_calls_total.labels(provider_id, model_id).inc()
Please confirm your intended semantics (attempt vs. success) and adjust dashboards accordingly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c51e77 and 9037b3b.

📒 Files selected for processing (7)

src/app/endpoints/query.py (4 hunks)
src/app/endpoints/streaming_query.py (1 hunks)
src/metrics/__init__.py (1 hunks)
src/metrics/utils.py (2 hunks)
tests/unit/app/endpoints/test_query.py (16 hunks)
tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
tests/unit/metrics/test_utis.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

src/app/endpoints/streaming_query.py (2)

src/utils/endpoints.py (1)

get_system_prompt (53-84)

src/metrics/utils.py (1)

update_llm_token_count_from_turn (58-75)

tests/unit/app/endpoints/test_streaming_query.py (1)

tests/unit/app/endpoints/test_query.py (1)

mock_metrics (49-54)

tests/unit/app/endpoints/test_query.py (1)

tests/unit/app/endpoints/test_streaming_query.py (1)

mock_metrics (61-66)

src/metrics/__init__.py (1)

src/metrics/utils.py (1)

update_llm_token_count_from_turn (58-75)

tests/unit/metrics/test_utis.py (1)

src/metrics/utils.py (2)

setup_model_metrics (18-55)

update_llm_token_count_from_turn (58-75)

src/app/endpoints/query.py (1)

src/metrics/utils.py (1)

update_llm_token_count_from_turn (58-75)

🪛 GitHub Actions: Ruff

src/metrics/__init__.py

[error] 9-9: Ruff: F401 'update_llm_token_count_from_turn' imported but unused. Remove the import or re-export the symbol.

🪛 GitHub Actions: Python linter

src/metrics/utils.py

[error] 9-9: Pylint: C0411 - wrong-import-order: third-party import llama_stack_client.types.agents.turn.Turn should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]

[error] 10-10: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.llama3.tokenizer.Tokenizer should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]

[error] 11-11: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.llama3.chat_format.ChatFormat should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]

[error] 12-12: Pylint: C0411 - wrong-import-order: third-party import llama_stack.models.llama.datatypes.RawMessage should be placed before first-party imports (configuration.configuration, client.AsyncLlamaStackClientHolder, log.get_logger, metrics, utils.common.run_once_async). [Command: uv run pylint src tests]

src/app/endpoints/query.py

[error] 391-391: Pylint: R0913: Too many arguments (6/5) (too-many-arguments). [Command: uv run pylint src tests]

[error] 391-391: Pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments). [Command: uv run pylint src tests]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-pr
GitHub Check: e2e_tests

🔇 Additional comments (3)

src/metrics/utils.py (1)

62-64: Plumb model-specific tokenizers or defer to provider/Llama Stack token-count API
Current use of Tokenizer.get_instance() likely defaults to a Llama3 tokenizer and may miscount tokens for other providers. Plumb a tokenizer per (provider, model) or defer to the provider’s token-count API to ensure accurate billing metrics.

tests/unit/app/endpoints/test_streaming_query.py (1)

61-67: LGTM: metrics side effects are properly patched for streaming tests.

Keeps tests focused on endpoint behavior.

tests/unit/app/endpoints/test_query.py (1)

49-55: LGTM: central helper to patch metrics for query tests.

Consistent with streaming tests.

src/app/endpoints/query.py

src/app/endpoints/streaming_query.py

coderabbitai · 2025-09-02T21:03:56Z

src/metrics/__init__.py

    Histogram,
 )

+from .utils import update_llm_token_count_from_turn


⚠️ Potential issue

Fix Ruff F401: mark re-export via all.

Declare all so the import is considered an intentional re-export.

-from .utils import update_llm_token_count_from_turn +from .utils import update_llm_token_count_from_turn + +# Explicit re-exports +__all__ = ( + "update_llm_token_count_from_turn", + "rest_api_calls_total", + "response_duration_seconds", + "provider_model_configuration", + "llm_calls_total", + "llm_calls_failures_total", + "llm_calls_validation_errors_total", + "llm_token_sent_total", + "llm_token_received_total", +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from .utils import update_llm_token_count_from_turn

from .utils import update_llm_token_count_from_turn

# Explicit re-exports

__all__ = (

"update_llm_token_count_from_turn",

"rest_api_calls_total",

"response_duration_seconds",

"provider_model_configuration",

"llm_calls_total",

"llm_calls_failures_total",

"llm_calls_validation_errors_total",

"llm_token_sent_total",

"llm_token_received_total",

)

🧰 Tools

🪛 GitHub Actions: Ruff

[error] 9-9: Ruff: F401 'update_llm_token_count_from_turn' imported but unused. Remove the import or re-export the symbol.

🤖 Prompt for AI Agents

In src/metrics/__init__.py around line 9, the module imports update_llm_token_count_from_turn but Ruff flags F401 because it looks like an unused import; declare an explicit __all__ list that includes "update_llm_token_count_from_turn" so the import is considered an intentional public re-export. Add or update the __all__ variable to contain that symbol (and other public names if needed) so linters recognize the re-export.

src/metrics/utils.py

coderabbitai · 2025-09-02T21:03:56Z

src/metrics/utils.py

+def update_llm_token_count_from_turn(
+    turn: Turn, model: str, provider: str, system_prompt: str = ""
+) -> None:
+    """Update the LLM calls metrics from a turn."""
+    tokenizer = Tokenizer.get_instance()
+    formatter = ChatFormat(tokenizer)
+
+    raw_message = cast(RawMessage, turn.output_message)
+    encoded_output = formatter.encode_dialog_prompt([raw_message])
+    token_count = len(encoded_output.tokens) if encoded_output.tokens else 0
+    metrics.llm_token_received_total.labels(provider, model).inc(token_count)
+
+    input_messages = [RawMessage(role="user", content=system_prompt)] + cast(
+        list[RawMessage], turn.input_messages
+    )
+    encoded_input = formatter.encode_dialog_prompt(input_messages)
+    token_count = len(encoded_input.tokens) if encoded_input.tokens else 0
+    metrics.llm_token_sent_total.labels(provider, model).inc(token_count)


🛠️ Refactor suggestion

⚠️ Potential issue

Runtime type mismatch and incorrect system-prompt role/content in token counting.

Casting client messages to RawMessage does not convert shapes; ChatFormat.encode_dialog_prompt will likely raise at runtime.

The system prompt should use role="system" and content must be content items, not a bare string.

Apply this refactor to safely convert messages and avoid crashes.

-def update_llm_token_count_from_turn( - turn: Turn, model: str, provider: str, system_prompt: str = "" -) -> None: - """Update the LLM calls metrics from a turn.""" - tokenizer = Tokenizer.get_instance() - formatter = ChatFormat(tokenizer) - - raw_message = cast(RawMessage, turn.output_message) - encoded_output = formatter.encode_dialog_prompt([raw_message]) - token_count = len(encoded_output.tokens) if encoded_output.tokens else 0 - metrics.llm_token_received_total.labels(provider, model).inc(token_count) - - input_messages = [RawMessage(role="user", content=system_prompt)] + cast( - list[RawMessage], turn.input_messages - ) - encoded_input = formatter.encode_dialog_prompt(input_messages) - token_count = len(encoded_input.tokens) if encoded_input.tokens else 0 - metrics.llm_token_sent_total.labels(provider, model).inc(token_count) +def update_llm_token_count_from_turn( + turn: Turn, model: str, provider: str, system_prompt: str = "" +) -> None: + """Update token metrics for a completed turn.""" + tokenizer = Tokenizer.get_instance() + formatter = ChatFormat(tokenizer) + + # ----- Output tokens (assistant) ----- + output_tokens = 0 + if getattr(turn, "output_message", None) is not None: + om = turn.output_message # type: ignore[attr-defined] + if isinstance(om.content, str): + om_content = [TextContentItem(text=om.content, type="text")] + else: + om_content = om.content + encoded_output = formatter.encode_dialog_prompt( + [RawMessage(role=getattr(om, "role", "assistant"), content=om_content)] + ) + output_tokens = len(encoded_output.tokens) if encoded_output.tokens else 0 + metrics.llm_token_received_total.labels(provider, model).inc(output_tokens) + + # ----- Input tokens (system + user/tool messages) ----- + input_raw: list[RawMessage] = [] + if system_prompt: + input_raw.append( + RawMessage( + role="system", + content=[TextContentItem(text=system_prompt, type="text")], + ) + ) + for m in cast(list, getattr(turn, "input_messages", [])): + msg_content = m.content + if isinstance(msg_content, str): + content_items = [TextContentItem(text=msg_content, type="text")] + else: + content_items = msg_content + input_raw.append(RawMessage(role=m.role, content=content_items)) + + input_tokens = 0 + if input_raw: + encoded_input = formatter.encode_dialog_prompt(input_raw) + input_tokens = len(encoded_input.tokens) if encoded_input.tokens else 0 + metrics.llm_token_sent_total.labels(provider, model).inc(input_tokens)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def update_llm_token_count_from_turn(

turn: Turn, model: str, provider: str, system_prompt: str = ""

) -> None:

"""Update the LLM calls metrics from a turn."""

tokenizer = Tokenizer.get_instance()

formatter = ChatFormat(tokenizer)

raw_message = cast(RawMessage, turn.output_message)

encoded_output = formatter.encode_dialog_prompt([raw_message])

token_count = len(encoded_output.tokens) if encoded_output.tokens else 0

metrics.llm_token_received_total.labels(provider, model).inc(token_count)

input_messages = [RawMessage(role="user", content=system_prompt)] + cast(

list[RawMessage], turn.input_messages

)

encoded_input = formatter.encode_dialog_prompt(input_messages)

token_count = len(encoded_input.tokens) if encoded_input.tokens else 0

metrics.llm_token_sent_total.labels(provider, model).inc(token_count)

def update_llm_token_count_from_turn(

turn: Turn, model: str, provider: str, system_prompt: str = ""

) -> None:

"""Update token metrics for a completed turn."""

tokenizer = Tokenizer.get_instance()

formatter = ChatFormat(tokenizer)

# ----- Output tokens (assistant) -----

output_tokens = 0

if getattr(turn, "output_message", None) is not None:

om = turn.output_message # type: ignore[attr-defined]

if isinstance(om.content, str):

om_content = [TextContentItem(text=om.content, type="text")]

else:

om_content = om.content

encoded_output = formatter.encode_dialog_prompt(

[RawMessage(role=getattr(om, "role", "assistant"), content=om_content)]

)

output_tokens = len(encoded_output.tokens) if encoded_output.tokens else 0

metrics.llm_token_received_total.labels(provider, model).inc(output_tokens)

# ----- Input tokens (system + user/tool messages) -----

input_raw: list[RawMessage] = []

if system_prompt:

input_raw.append(

RawMessage(

role="system",

content=[TextContentItem(text=system_prompt, type="text")],

)

)

for m in cast(list, getattr(turn, "input_messages", [])):

msg_content = m.content

if isinstance(msg_content, str):

content_items = [TextContentItem(text=msg_content, type="text")]

else:

content_items = msg_content

input_raw.append(RawMessage(role=m.role, content=content_items))

input_tokens = 0

if input_raw:

encoded_input = formatter.encode_dialog_prompt(input_raw)

input_tokens = len(encoded_input.tokens) if encoded_input.tokens else 0

metrics.llm_token_sent_total.labels(provider, model).inc(input_tokens)

🤖 Prompt for AI Agents

In src/metrics/utils.py around lines 58-75, the current code unsafely casts messages to RawMessage and passes a bare string as the system prompt, causing runtime shape mismatches for ChatFormat.encode_dialog_prompt; instead, explicitly build properly-shaped RawMessage objects for encoding: map turn.input_messages and turn.output_message into RawMessage instances with the correct role and a content list (do not use direct cast), and construct the system prompt as a RawMessage with role="system" and content expressed as the required content-item list (wrap the string into the proper content-item structure expected by ChatFormat) before calling formatter.encode_dialog_prompt so token counting won’t raise at runtime.

Signed-off-by: Haoyu Sun <hasun@redhat.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/app/endpoints/query.py (1)

392-400: Resolved: trailing params now keyword-only.
This addresses the earlier R0913/R0917 concern.

🧹 Nitpick comments (2)

src/app/endpoints/query.py (2)

416-421: Docstring clarity: specify model_id format.
Model here is the combined provider/model used by Llama Stack; make that explicit to avoid confusion with the bare model label used in metrics.

@@
-    Parameters:
-        model_id (str): The identifier of the LLM model to use.
-        provider_id (str): The identifier of the LLM provider to use.
+    Parameters:
+        model_id (str): Llama Stack model identifier in "provider/model" form (e.g., "openai/gpt-4.1").
+        provider_id (str): Provider identifier (e.g., "openai"). Keyword-only.

518-521: Keep label normalization; guard metrics to be non-fatal.
Nice job normalizing the model label. Wrap the metrics call so tokenizer/formatter issues never fail the endpoint.

-    # Update token count metrics for the LLM call
-    model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
-    update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    # Update token count metrics for the LLM call (best-effort)
+    try:
+        model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
+        update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+    except Exception as e:  # noqa: BLE001
+        logger.warning("Token metrics update failed: %s", e, exc_info=True)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9037b3b and 0c5e610.

📒 Files selected for processing (6)

src/app/endpoints/query.py (5 hunks)
src/app/endpoints/streaming_query.py (2 hunks)
src/metrics/utils.py (2 hunks)
tests/unit/app/endpoints/test_query.py (16 hunks)
tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
tests/unit/metrics/test_utis.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

tests/unit/app/endpoints/test_streaming_query.py
tests/unit/metrics/test_utis.py
src/app/endpoints/streaming_query.py
tests/unit/app/endpoints/test_query.py
src/metrics/utils.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/app/endpoints/query.py (2)

src/metrics/utils.py (1)

update_llm_token_count_from_turn (60-77)

src/models/requests.py (1)

QueryRequest (72-222)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-pr
GitHub Check: e2e_tests

🔇 Additional comments (2)

src/app/endpoints/query.py (2)

27-27: LGTM: metrics utility import is correct.
Direct import keeps call sites explicit and avoids circular deps.

216-223: Good: provider_id plumbed and passed keyword-only.
Call site matches the updated retrieve_response signature.

raptorsun · 2025-09-03T11:33:17Z

E2E test failure is due to timeout(10s) on querying readiness endpoint.

raptorsun · 2025-09-03T12:42:27Z

E2E test passed again.

tisnik

LGTM on my side, thank you

umago

That looks good! Thank u

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

LCORE-411: add token usage metrics

0c5e610

Signed-off-by: Haoyu Sun <hasun@redhat.com>

raptorsun force-pushed the lcore-411 branch from 9037b3b to 0c5e610 Compare September 2, 2025 21:37

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

tisnik approved these changes Sep 3, 2025

View reviewed changes

tisnik requested review from manstis and umago September 3, 2025 16:14

umago approved these changes Sep 4, 2025

View reviewed changes

tisnik merged commit 2c7b365 into lightspeed-core:main Sep 4, 2025
35 of 36 checks passed

coderabbitai bot mentioned this pull request Oct 7, 2025

LCORE-178: add token usage tracking to LLM responses #635

Merged

18 tasks

-from .utils import update_llm_token_count_from_turn
+from .utils import update_llm_token_count_from_turn
+# Explicit re-exports
+__all__ = (
+    "update_llm_token_count_from_turn",
+    "rest_api_calls_total",
+    "response_duration_seconds",
+    "provider_model_configuration",
+    "llm_calls_total",
+    "llm_calls_failures_total",
+    "llm_calls_validation_errors_total",
+    "llm_token_sent_total",
+    "llm_token_received_total",
+)

LCORE-411: add token usage metrics #489

LCORE-411: add token usage metrics #489

Uh oh!

Conversation

raptorsun commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

raptorsun commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raptorsun commented Sep 3, 2025

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

umago left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

raptorsun commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

raptorsun commented Sep 3, 2025 •

edited

Loading