-
Notifications
You must be signed in to change notification settings - Fork 9
record service tier for anthropic and openai #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| _process_response_item(item, complete_response) | ||
|
|
||
| set_span_attribute(span, GEN_AI_RESPONSE_ID, complete_response.get("id")) | ||
| set_span_attribute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Async/Sync Divergence: Service Tier
The abuild_from_streaming_response function initializes complete_response dictionary without the "service_tier" key (line 257), while the synchronous build_from_streaming_response function properly includes it (line 168-175). This inconsistency breaks parity between sync and async implementations and could cause the service tier information to not be properly initialized when accessed in line 268-270.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 27bb1ba in 3 minutes and 10 seconds. Click for details.
- Reviewed
3279lines of code in21files - Skipped
0files when reviewing. - Skipped posting
10draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:1
- Draft comment:
The tests now include verification for the new 'service_tier' attribute. All assertions for both sync and async responses correctly check that 'openai.request.service_tier' and 'openai.response.service_tier' are set to 'default'. Good integration of added functionality. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:124
- Draft comment:
Test for tool calls now validates both legacy (llm.request.functions) and new (gen_ai.prompt) attributes. Ensure consistency between these keys in instrumentation. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:206
- Draft comment:
The reasoning tests correctly check for serialization of reasoning dicts. The assertion that the reasoning attribute is a valid JSON string (and then parsed) is a good practice. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. tests/test_instrumentations/test_openai/traces/test_responses/test_responses.py:1300
- Draft comment:
Overall, tests maintain comprehensive coverage of different response scenarios, including history, tool calls, and reasoning. No typos or logical bugs detected. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_anthropic_message_streaming_legacy.yaml:58
- Draft comment:
Typo detected: "helpe" should likely be "helped" in the phrase "Because it helpe d them". - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment is flagging "helpe" as a typo, but this is a cassette file recording streaming API responses. In streaming responses, words are often split across multiple chunks/deltas. Looking at lines 57-58, I can see "it helpe" on line 57 ends with "}" and then line 58 starts with "d them", which means the full word is "helped" - it's just split across two streaming chunks. This is not a typo at all, it's the expected behavior of streaming responses. The comment is incorrect because it doesn't understand the context of streaming API responses. Could there be a legitimate issue where the API actually returned malformed text? However, this seems unlikely given that this is a test cassette file that presumably passed tests, and the streaming format clearly shows the word is split across chunks intentionally. While it's theoretically possible the API returned malformed text, the structure of the streaming response clearly shows this is intentional chunking. The text "helpe" at the end of one chunk followed by "d" at the start of the next chunk is exactly how streaming works. This is not a code issue that needs fixing. This comment should be deleted. It's a false positive that misunderstands how streaming API responses work. The word "helped" is correctly split across two streaming chunks, which is normal and expected behavior. There is no typo to fix.
6. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_anthropic_message_streaming_legacy.yaml:87
- Draft comment:
Possible error in word splitting: the word appears as "special" on the previous delta and "izes in!)" on this line. Should it be corrected to "specializes in!"? - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 5% vs. threshold = 50% This is a cassette file used for testing - it records actual API responses for replay in tests. The word splitting across deltas is not an error; it's how the Anthropic streaming API actually returned the data. The test is meant to verify that the code correctly handles streaming responses where words can be split across chunks. The comment fundamentally misunderstands what this file is - it's not code to be corrected, it's a recording of actual API behavior. There's no action to take here because you can't "fix" how an external API chunks its streaming responses. This comment should definitely be deleted. Could this be a generated test fixture that someone manually created and made a typo in? Maybe the test framework allows editing these cassettes and the comment is suggesting a legitimate fix to make the test data cleaner? Even if cassettes can be manually edited, the purpose of a cassette is to faithfully record actual API responses for deterministic test replay. Editing the response to make words not split across chunks would make it less representative of real API behavior. The test should handle word splitting since that's what happens in production. This comment should be deleted. It misunderstands that this is a test cassette recording actual API streaming behavior, where word splitting across chunks is expected and correct. There's no error to fix.
7. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:57
- Draft comment:
It looks like the text fragments in the sentence starting with "Why di" (lines 56–58) are mis-split. In particular, the parts "di" and "d the distribute" seem intended to form "did the distributed". Please verify the intended wording. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a test cassette file that records actual API responses from Anthropic's streaming API. The "mis-splitting" the comment refers to is actually the expected behavior of a streaming API - text is sent in chunks as it's generated. The developer didn't write this text or control how it's split; this is recorded data from an actual API call. The comment is asking the developer to "verify the intended wording" but there's nothing to verify or fix here - this is just how the streaming API works. The comment is not about a code change that needs to be made, it's misunderstanding the nature of this test fixture file. Could this be a legitimate concern about test data quality? Maybe the cassette was corrupted during recording and the developer should re-record it? However, the text does form coherent sentences when combined, and this is exactly how streaming APIs work. Even if there were concerns about test data quality, the comment asks to "verify the intended wording" which is not actionable - the developer can't change how the API streams responses. The text is coherent when assembled ("Why did the distributed system..."), so there's no actual issue. This comment doesn't suggest any concrete action. This comment should be deleted. It misunderstands the nature of streaming API responses in test cassettes. The text splitting is expected behavior, not a bug. The comment asks for verification but provides no actionable feedback.
8. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:69
- Draft comment:
The phrase "bit distribute" appears to be a typographical error. It likely should be "bit distraught". - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file that records actual API responses for testing. The text "bit distribute" + "d!" = "bit distributed!" which is the intended joke punchline about distributed systems. The comment incorrectly assumes this is a typo when it's actually the correct joke. Additionally, this is recorded API response data from Anthropic's API - it's not code that the PR author wrote or can change. The PR author cannot fix typos in API responses. This comment is fundamentally misunderstanding what this file is and is incorrect about the content being a typo. Could the API have actually made a typo and the comment is trying to be helpful? However, even if it were a typo, this is recorded response data from an external API, not something the PR author can or should change in the cassette file. Even if this were a typo (which it isn't - "distributed" is the correct punchline), cassette files record actual API responses and should not be manually edited to "fix" the content of those responses. The comment is both incorrect about the typo and inappropriate for this type of file. This comment should be deleted. It incorrectly identifies "bit distributed" as a typo when it's actually the intended joke punchline. More importantly, this is a test cassette file recording actual API responses, which should not be manually edited to change response content.
9. tests/test_instrumentations/test_anthropic/cassettes/test_messages/test_async_anthropic_message_stream_manager.yaml:89
- Draft comment:
The split text fragments "diagn" and "ose performance" should be combined into the word "diagnose". - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file used for recording/replaying HTTP interactions in tests (likely using VCR or similar). The file shows a streaming response where text is delivered in chunks via server-sent events. The word "diagnose" being split across "diagn" and "ose" is not a bug - it's the actual behavior of the streaming API, where words can be split across multiple delta events. The cassette file is meant to faithfully record what the API returned, not to present clean, combined text. Modifying this would make the test inaccurate. The comment fundamentally misunderstands the purpose of this file. Could this be a generated file that should have some post-processing applied? Maybe there's a normalization step that's expected? However, cassette files are typically meant to be exact recordings of API responses. Even if there were post-processing, cassette files are specifically designed to record the exact API response for replay in tests. Modifying the response would defeat the purpose of having accurate test fixtures. The split text is the actual API behavior being tested. This comment should be deleted. It misunderstands the purpose of cassette files, which are meant to record exact API responses. The split text is the actual streaming behavior from the Anthropic API, not a bug to fix.
10. tests/test_instrumentations/test_openai/traces/cassettes/test_responses/test_responses_async.yaml:26
- Draft comment:
Typo: The operating system should be written as 'macOS' (with a lowercase 'm') instead of 'MacOS'. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a cassette file - a recorded HTTP interaction used for testing. The "MacOS" value comes from the actual HTTP headers sent by the OpenAI Python client library (AsyncOpenAI/Python 2.7.2), not from code written by the PR author. This is recorded data from a real API interaction. The PR author didn't write "MacOS" - they just recorded what the library sent. Changing this would make the cassette file inaccurate and potentially break tests that rely on exact matching. This comment violates the rule about not commenting on things that are clearly not actionable code changes. Could this be a manually created test fixture where the author did write "MacOS" and should correct it? Maybe the testing framework allows editing these files and expects them to be accurate? Even if cassette files can be manually edited, they are meant to record actual API interactions faithfully. The value "MacOS" came from the OpenAI client library's headers, not from the PR author's code. Correcting a "typo" in recorded data would be incorrect - it would make the test fixture inaccurate. The comment is not actionable. This comment should be deleted. It's commenting on recorded HTTP header data from an external library, not on code written by the PR author. The "MacOS" value is what was actually sent by the OpenAI client, and changing it would make the test fixture inaccurate.
Workflow ID: wflow_syMBUyJHLGk71gX2
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 6412f54 in 1 minute and 9 seconds. Click for details.
- Reviewed
19lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. src/lmnr/opentelemetry_lib/opentelemetry/instrumentation/anthropic/streaming.py:257
- Draft comment:
Added explicit 'service_tier': None in complete_response initialization for the async builder. This aligns with the sync version and ensures consistent span attribute handling. Verify that downstream code correctly handles a None value. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is informative and asks the PR author to verify that downstream code handles aNonevalue correctly. This violates the rule against asking the author to ensure behavior is intended or tested. The comment does not provide a specific code suggestion or ask for a specific test to be written.
Workflow ID: wflow_BvfctqRvc2VBx4zj
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
Note
Record
service_tierfor OpenAI and Anthropic requests/responses (including streaming), update utils, and refresh tests/cassettes accordingly.anthropic.request.service_tierandanthropic.response.service_tierin spans for sync/async and streaming paths (span_utils.py,streaming.py).service_tierfrom stream events and attach to spans; continue setting token usage and completion content.openai.request.service_tierandopenai.response.service_tierin chat and responses APIs (request/response and streaming) via shared handlers and chat wrappers.TracedDatawithrequest_service_tier/response_service_tier;set_data_attributeswrites them to span attributes.asyncio.iscoroutinefunctionwithinspect.iscoroutinefunctionin decorators/helpers to detect coroutines reliably.service_tierattributes and switch models where needed; add async responses test.service_tierin requests/responses and new SDK versions/IDs.Written by Cursor Bugbot for commit 6412f54. This will update automatically on new commits. Configure here.
Important
This PR records
service_tierfor Anthropic and OpenAI requests and responses, updating span attributes and tests accordingly.anthropic.request.service_tierfrom request kwargs inspan_utils.py.anthropic.response.service_tierfromusage/stream events inspan_utils.pyandstreaming.py.service_tierand responseidinto spans instreaming.py.openai.request.service_tieron requests inchat_wrappers.py.openai.response.service_tierfrom responses and stream chunks inchat_wrappers.pyandresponses_wrappers.py.TracedDatastores request/responseservice_tierinresponses_wrappers.py.asyncio.iscoroutinefunctionwithinspect.iscoroutinefunctionin utils for robustness.service_tierfields across Anthropic/OpenAI tests.responsestest intest_responses.py.This description was created by
for 6412f54. You can customize this summary. It will automatically update as commits are pushed.