record service tier for anthropic and openai #218

dinmukhamedm · 2025-11-18T16:47:59Z

Note

Record service_tier for OpenAI and Anthropic requests/responses (including streaming), update utils, and refresh tests/cassettes accordingly.

Instrumentation – Anthropic:
- Record anthropic.request.service_tier and anthropic.response.service_tier in spans for sync/async and streaming paths (span_utils.py, streaming.py).
- Aggregate service_tier from stream events and attach to spans; continue setting token usage and completion content.
Instrumentation – OpenAI:
- Capture openai.request.service_tier and openai.response.service_tier in chat and responses APIs (request/response and streaming) via shared handlers and chat wrappers.
- Extend TracedData with request_service_tier/response_service_tier; set_data_attributes writes them to span attributes.
Utils:
- Replace asyncio.iscoroutinefunction with inspect.iscoroutinefunction in decorators/helpers to detect coroutines reliably.
Tests & Cassettes:
- Update chat/responses tests to assert service_tier attributes and switch models where needed; add async responses test.
- Refresh VCR cassettes to include service_tier in requests/responses and new SDK versions/IDs.

^{Written by Cursor Bugbot for commit 6412f54. This will update automatically on new commits. Configure here.}

Important

This PR records service_tier for Anthropic and OpenAI requests and responses, updating span attributes and tests accordingly.

Instrumentation:
- Anthropic:
  - Set anthropic.request.service_tier from request kwargs in span_utils.py.
  - Capture anthropic.response.service_tier from usage/stream events in span_utils.py and streaming.py.
  - Streaming builders aggregate service_tier and response id into spans in streaming.py.
- OpenAI:
  - Set openai.request.service_tier on requests in chat_wrappers.py.
  - Set openai.response.service_tier from responses and stream chunks in chat_wrappers.py and responses_wrappers.py.
  - TracedData stores request/response service_tier in responses_wrappers.py.
Utilities:
- Replace asyncio.iscoroutinefunction with inspect.iscoroutinefunction in utils for robustness.
Tests & Cassettes:
- Update expectations and VCRs for service_tier fields across Anthropic/OpenAI tests.
- Add async responses test in test_responses.py.

^{This description was created by}^{for 6412f54. You can customize this summary. It will automatically update as commits are pushed.}

cursor · 2025-11-18T16:49:28Z

src/lmnr/opentelemetry_lib/opentelemetry/instrumentation/anthropic/streaming.py

        _process_response_item(item, complete_response)

    set_span_attribute(span, GEN_AI_RESPONSE_ID, complete_response.get("id"))
+    set_span_attribute(


Bug: Async/Sync Divergence: Service Tier

The abuild_from_streaming_response function initializes complete_response dictionary without the "service_tier" key (line 257), while the synchronous build_from_streaming_response function properly includes it (line 168-175). This inconsistency breaks parity between sync and async implementations and could cause the service tier information to not be properly initialized when accessed in line 268-270.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 27bb1ba in 3 minutes and 10 seconds. Click for details.

Reviewed 3279 lines of code in 21 files
Skipped 0 files when reviewing.
Skipped posting 10 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:1

Draft comment:
The tests now include verification for the new 'service_tier' attribute. All assertions for both sync and async responses correctly check that 'openai.request.service_tier' and 'openai.response.service_tier' are set to 'default'. Good integration of added functionality.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:124

Draft comment:
Test for tool calls now validates both legacy (llm.request.functions) and new (gen_ai.prompt) attributes. Ensure consistency between these keys in instrumentation.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:206

Draft comment:
The reasoning tests correctly check for serialization of reasoning dicts. The assertion that the reasoning attribute is a valid JSON string (and then parsed) is a good practice.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:1300

Draft comment:
Overall, tests maintain comprehensive coverage of different response scenarios, including history, tool calls, and reasoning. No typos or logical bugs detected.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_anthropic_message_streaming_legacy.yaml:58

Draft comment:
Typo detected: "helpe" should likely be "helped" in the phrase "Because it helpe d them".
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment is flagging "helpe" as a typo, but this is a cassette file recording streaming API responses. In streaming responses, words are often split across multiple chunks/deltas. Looking at lines 57-58, I can see "it helpe" on line 57 ends with "}" and then line 58 starts with "d them", which means the full word is "helped" - it's just split across two streaming chunks. This is not a typo at all, it's the expected behavior of streaming responses. The comment is incorrect because it doesn't understand the context of streaming API responses. Could there be a legitimate issue where the API actually returned malformed text? However, this seems unlikely given that this is a test cassette file that presumably passed tests, and the streaming format clearly shows the word is split across chunks intentionally. While it's theoretically possible the API returned malformed text, the structure of the streaming response clearly shows this is intentional chunking. The text "helpe" at the end of one chunk followed by "d" at the start of the next chunk is exactly how streaming works. This is not a code issue that needs fixing. This comment should be deleted. It's a false positive that misunderstands how streaming API responses work. The word "helped" is correctly split across two streaming chunks, which is normal and expected behavior. There is no typo to fix.

6. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_anthropic_message_streaming_legacy.yaml:87

Draft comment:
Possible error in word splitting: the word appears as "special" on the previous delta and "izes in!)" on this line. Should it be corrected to "specializes in!"?
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 5% vs. threshold = 50% This is a cassette file used for testing - it records actual API responses for replay in tests. The word splitting across deltas is not an error; it's how the Anthropic streaming API actually returned the data. The test is meant to verify that the code correctly handles streaming responses where words can be split across chunks. The comment fundamentally misunderstands what this file is - it's not code to be corrected, it's a recording of actual API behavior. There's no action to take here because you can't "fix" how an external API chunks its streaming responses. This comment should definitely be deleted. Could this be a generated test fixture that someone manually created and made a typo in? Maybe the test framework allows editing these cassettes and the comment is suggesting a legitimate fix to make the test data cleaner? Even if cassettes can be manually edited, the purpose of a cassette is to faithfully record actual API responses for deterministic test replay. Editing the response to make words not split across chunks would make it less representative of real API behavior. The test should handle word splitting since that's what happens in production. This comment should be deleted. It misunderstands that this is a test cassette recording actual API streaming behavior, where word splitting across chunks is expected and correct. There's no error to fix.

7. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:57

Draft comment:
It looks like the text fragments in the sentence starting with "Why di" (lines 56–58) are mis-split. In particular, the parts "di" and "d the distribute" seem intended to form "did the distributed". Please verify the intended wording.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a test cassette file that records actual API responses from Anthropic's streaming API. The "mis-splitting" the comment refers to is actually the expected behavior of a streaming API - text is sent in chunks as it's generated. The developer didn't write this text or control how it's split; this is recorded data from an actual API call. The comment is asking the developer to "verify the intended wording" but there's nothing to verify or fix here - this is just how the streaming API works. The comment is not about a code change that needs to be made, it's misunderstanding the nature of this test fixture file. Could this be a legitimate concern about test data quality? Maybe the cassette was corrupted during recording and the developer should re-record it? However, the text does form coherent sentences when combined, and this is exactly how streaming APIs work. Even if there were concerns about test data quality, the comment asks to "verify the intended wording" which is not actionable - the developer can't change how the API streams responses. The text is coherent when assembled ("Why did the distributed system..."), so there's no actual issue. This comment doesn't suggest any concrete action. This comment should be deleted. It misunderstands the nature of streaming API responses in test cassettes. The text splitting is expected behavior, not a bug. The comment asks for verification but provides no actionable feedback.

8. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:69

Draft comment:
The phrase "bit distribute" appears to be a typographical error. It likely should be "bit distraught".
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file that records actual API responses for testing. The text "bit distribute" + "d!" = "bit distributed!" which is the intended joke punchline about distributed systems. The comment incorrectly assumes this is a typo when it's actually the correct joke. Additionally, this is recorded API response data from Anthropic's API - it's not code that the PR author wrote or can change. The PR author cannot fix typos in API responses. This comment is fundamentally misunderstanding what this file is and is incorrect about the content being a typo. Could the API have actually made a typo and the comment is trying to be helpful? However, even if it were a typo, this is recorded response data from an external API, not something the PR author can or should change in the cassette file. Even if this were a typo (which it isn't - "distributed" is the correct punchline), cassette files record actual API responses and should not be manually edited to "fix" the content of those responses. The comment is both incorrect about the typo and inappropriate for this type of file. This comment should be deleted. It incorrectly identifies "bit distributed" as a typo when it's actually the intended joke punchline. More importantly, this is a test cassette file recording actual API responses, which should not be manually edited to change response content.

9. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:89

Draft comment:
The split text fragments "diagn" and "ose performance" should be combined into the word "diagnose".
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file used for recording/replaying HTTP interactions in tests (likely using VCR or similar). The file shows a streaming response where text is delivered in chunks via server-sent events. The word "diagnose" being split across "diagn" and "ose" is not a bug - it's the actual behavior of the streaming API, where words can be split across multiple delta events. The cassette file is meant to faithfully record what the API returned, not to present clean, combined text. Modifying this would make the test inaccurate. The comment fundamentally misunderstands the purpose of this file. Could this be a generated file that should have some post-processing applied? Maybe there's a normalization step that's expected? However, cassette files are typically meant to be exact recordings of API responses. Even if there were post-processing, cassette files are specifically designed to record the exact API response for replay in tests. Modifying the response would defeat the purpose of having accurate test fixtures. The split text is the actual API behavior being tested. This comment should be deleted. It misunderstands the purpose of cassette files, which are meant to record exact API responses. The split text is the actual streaming behavior from the Anthropic API, not a bug to fix.

10. tests/test_instrumentations/test_openai/traces/cassettes/test_responses/test_responses_async.yaml:26

Draft comment:
Typo: The operating system should be written as 'macOS' (with a lowercase 'm') instead of 'MacOS'.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file - a recorded HTTP interaction used for testing. The "MacOS" value comes from the actual HTTP headers sent by the OpenAI Python client library (AsyncOpenAI/Python 2.7.2), not from code written by the PR author. This is recorded data from a real API interaction. The PR author didn't write "MacOS" - they just recorded what the library sent. Changing this would make the cassette file inaccurate and potentially break tests that rely on exact matching. This comment violates the rule about not commenting on things that are clearly not actionable code changes. Could this be a manually created test fixture where the author did write "MacOS" and should correct it? Maybe the testing framework allows editing these files and expects them to be accurate? Even if cassette files can be manually edited, they are meant to record actual API interactions faithfully. The value "MacOS" came from the OpenAI client library's headers, not from the PR author's code. Correcting a "typo" in recorded data would be incorrect - it would make the test fixture inaccurate. The comment is not actionable. This comment should be deleted. It's commenting on recorded HTTP header data from an external library, not on code written by the PR author. The "MacOS" value is what was actually sent by the OpenAI client, and changing it would make the test fixture inaccurate.

Workflow ID: wflow_syMBUyJHLGk71gX2

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 6412f54 in 1 minute and 9 seconds. Click for details.

Reviewed 19 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. src/lmnr/opentelemetry_lib/opentelemetry/instrumentation/anthropic/streaming.py:257

Draft comment:
Added explicit 'service_tier': None in complete_response initialization for the async builder. This aligns with the sync version and ensures consistent span attribute handling. Verify that downstream code correctly handles a None value.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is informative and asks the PR author to verify that downstream code handles a None value correctly. This violates the rule against asking the author to ensure behavior is intended or tested. The comment does not provide a specific code suggestion or ask for a specific test to be written.

Workflow ID: wflow_BvfctqRvc2VBx4zj

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

record service tier for anthropic and openai

27bb1ba

cursor bot reviewed Nov 18, 2025

View reviewed changes

ellipsis-dev bot reviewed Nov 18, 2025

View reviewed changes

minor fix: init complete response

6412f54

ellipsis-dev bot reviewed Nov 18, 2025

View reviewed changes

dinmukhamedm merged commit d35c357 into main Nov 18, 2025
8 checks passed

dinmukhamedm deleted the instr/service-tier branch November 18, 2025 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

record service tier for anthropic and openai #218

record service tier for anthropic and openai #218

Uh oh!

dinmukhamedm commented Nov 18, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

cursor bot Nov 18, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

record service tier for anthropic and openai #218

record service tier for anthropic and openai #218

Uh oh!

Conversation

dinmukhamedm commented Nov 18, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Async/Sync Divergence: Service Tier

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dinmukhamedm commented Nov 18, 2025 •

edited by ellipsis-dev bot

Loading