[LEADS-348, LEADS-364] API Latency and Token Calculation#230
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (24)
WalkthroughAdds end-to-end agent/API latency measurement and percentile stats (p50/p95/p99): collects per-call latency, aggregates percentiles, persists values, exposes agent token usage with percentile statistics, updates quality/summary outputs, and adds tests and docs. ChangesAPI Latency Tracking Feature
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested Reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a00a8f1 to
861d59e
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
119-122:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSet
agent_latencyin the error path for accurate timing.When the API call fails,
turn_data.agent_latencyis never set. This means it will use the default value (likely0.0), which misrepresents the actual time spent on the failed API call. Consider settingagent_latencybased on the elapsed time even when an error occurs, to accurately track API call overhead.🕒 Proposed fix to track latency in error path
except APIError as e: + api_latency = time.perf_counter() - api_start_time + turn_data.agent_latency = api_latency error_msg = f"API Error for turn {turn_data.turn_id}: {e}" logger.error(error_msg) return error_msg, conversation_id🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 119 - 122, In the APIError except block inside amender.py (the except APIError as e: handler for the call that sets turn_data), record the elapsed time and assign it to turn_data.agent_latency before logging and returning: compute the latency using the same start/end timing mechanism used in the successful path (or capture time at exception entry), set turn_data.agent_latency = elapsed, then include that updated timing prior to calling logger.error and returning error_msg, conversation_id so failed calls reflect actual agent_latency.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
71-76: 💤 Low valueConsider tracking API call time separately from agent processing time.
The current logic sets
agent_latencyto0.0for cached responses (when tokens are zero), which discards network overhead and authentication time. While this appears intentional (tracking only agent processing time), it means you lose visibility into the total API call latency for cached responses.Consider one of these alternatives:
- Track both
api_call_time(total) andagent_compute_time(processing only)- Document that
agent_latencyexcludes cached response overhead- Verify that zero tokens reliably indicates a cached response in all scenarios
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 71 - 76, The current assignment to turn_data.agent_latency overwrites total API latency for cached responses (when turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change this by adding separate fields (e.g., turn_data.api_call_time = api_latency and turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in turn_data.api_call_time while keeping agent processing time in turn_data.agent_compute_time; update any downstream usage of turn_data.agent_latency to use the new fields or document the semantics clearly in the turn_data class/definition.src/lightspeed_evaluation/core/output/statistics.py (1)
311-314: ⚡ Quick winUse Google-style docstring for the new public helper.
calculate_field_numeric_stats_from_evaluation_datais a public API insrc/and should include explicitArgsandReturnssections for consistency with repository standards.As per coding guidelines
src/**/*.py: "Use Google-style docstrings for all public APIs in Python".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 311 - 314, The docstring for the public function calculate_field_numeric_stats_from_evaluation_data lacks the required Google-style format; update its docstring to include an explicit short description followed by Google-style Args and Returns sections (Args: evaluation_data: list[EvaluationData] — list of evaluation records to inspect; field_name: str — name of the numeric field to compute stats for) and Returns: dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median, min, max, std) and any notes about filtering zeros), ensuring the docstring matches repository standards for public APIs.tests/unit/core/models/test_summary.py (2)
223-241: 💤 Low valueConsider adding a test for mixed zero and non-zero latencies.
The current tests cover "all non-zero" and "all zero" scenarios. Consider adding a test where some turns have
agent_latency=0(e.g., cached responses) and others have positive values (actual API calls). This would verify the behavior when API calls are only partially measured.📝 Example test case
def test_with_mixed_api_latency(self) -> None: """Test from_results with mixed zero and non-zero latencies.""" results = [_make_result(turn_id="t1")] eval_data = [ EvaluationData( conversation_group_id="conv1", turns=[ TurnData(turn_id="t1", query="Query 1", agent_latency=1.5), TurnData(turn_id="t2", query="Query 2", agent_latency=0), # cached TurnData(turn_id="t3", query="Query 3", agent_latency=2.0), ], ) ] summary = EvaluationSummary.from_results(results, evaluation_data=eval_data) # Should compute stats only from non-zero values (1.5, 2.0) assert summary.agent_latency_stats is not None assert summary.agent_latency_stats.count == 2 assert summary.agent_latency_stats.min_value == 1.5 assert summary.agent_latency_stats.max_value == 2.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_summary.py` around lines 223 - 241, Add a new unit test that calls EvaluationSummary.from_results with EvaluationData containing a mix of TurnData entries where some have agent_latency == 0 and others have positive values; verify that EvaluationSummary.agent_latency_stats is not None and that its count, min_value, and max_value are computed only from the non-zero latencies (e.g., ensure count==number of positive latencies and min/max reflect those values). Use the existing test pattern and helper _make_result to create results and assert the expected stats for the non-zero latencies.
199-222: 💤 Low valueConsider more precise assertion for mean latency.
Line 219 asserts
mean > 0which is very loose. Given the input values (1.5, 2.0, 1.8), the expected mean is approximately 1.7667. A more precise assertion would validate the calculation correctness.🔍 Suggested improvement
assert summary.agent_latency_stats is not None assert summary.agent_latency_stats.count == 3 - assert summary.agent_latency_stats.mean is not None - assert summary.agent_latency_stats.mean > 0 + expected_mean = (1.5 + 2.0 + 1.8) / 3 # ≈ 1.7667 + assert summary.agent_latency_stats.mean is not None + assert abs(summary.agent_latency_stats.mean - expected_mean) < 0.0001 assert summary.agent_latency_stats.min_value == 1.5 assert summary.agent_latency_stats.max_value == 2.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_summary.py` around lines 199 - 222, The test test_with_api_latency currently asserts summary.agent_latency_stats.mean > 0 which is too loose; update the assertion after calling EvaluationSummary.from_results to check the mean against the expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison (e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4) or use pytest.approx) referencing the test function name and the agent_latency_stats.mean field to ensure the calculation is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 523-529: The NumericStats fields on agent_latency (mean, median,
std, min_value, max_value, p95, p99) may be None and the current f"{...:.3f}"
formatting will raise; update the block that writes these stats in generator.py
to guard each field (e.g., check value is not None) and format conditionally—use
f"{value:.3f}s" when value is present and a safe fallback string like "N/A" (or
"0.000s" if preferred) when None so writing never raises; ensure all referenced
symbols (agent_latency.mean, agent_latency.median, agent_latency.std,
agent_latency.min_value, agent_latency.max_value, agent_latency.p95,
agent_latency.p99) are handled consistently.
- Around line 779-783: The JSON output drops newly added percentile fields
because _streaming_stats_to_dict (used when summary.streaming is not None in
generator.py) does not include p95/p99 for TTFT, duration, and throughput;
update _streaming_stats_to_dict to map and include the percentile fields (p95
and p99) from the incoming streaming stats object into the returned dict (e.g.,
add keys like ttft_p95, ttft_p99, duration_p95, duration_p99, throughput_p95,
throughput_p99 or preserve existing naming conventions) so summary.streaming
serializes the new percentile data into result["streaming_performance"].
In `@tests/unit/pipeline/evaluation/test_amender.py`:
- Around line 267-299: The test is flaky because it asserts turn.agent_latency >
0 based on real timing; patch time.perf_counter inside the test so
amend_single_turn's timing is deterministic: use
mocker.patch("time.perf_counter", side_effect=[START, END]) (or return_value for
a single call if amend_single_turn only reads once) before calling
APIDataAmender.amend_single_turn, choose START and END such that END > START
(e.g. 1.0, 1.5) so agent_latency becomes a known value, then assert the exact
expected latency (END-START) and token fields as before; reference
APIDataAmender.amend_single_turn and TurnData.agent_latency when making the
change.
---
Outside diff comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 119-122: In the APIError except block inside amender.py (the
except APIError as e: handler for the call that sets turn_data), record the
elapsed time and assign it to turn_data.agent_latency before logging and
returning: compute the latency using the same start/end timing mechanism used in
the successful path (or capture time at exception entry), set
turn_data.agent_latency = elapsed, then include that updated timing prior to
calling logger.error and returning error_msg, conversation_id so failed calls
reflect actual agent_latency.
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 311-314: The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 71-76: The current assignment to turn_data.agent_latency
overwrites total API latency for cached responses (when
turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change
this by adding separate fields (e.g., turn_data.api_call_time = api_latency and
turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency
only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in
turn_data.api_call_time while keeping agent processing time in
turn_data.agent_compute_time; update any downstream usage of
turn_data.agent_latency to use the new fields or document the semantics clearly
in the turn_data class/definition.
In `@tests/unit/core/models/test_summary.py`:
- Around line 223-241: Add a new unit test that calls
EvaluationSummary.from_results with EvaluationData containing a mix of TurnData
entries where some have agent_latency == 0 and others have positive values;
verify that EvaluationSummary.agent_latency_stats is not None and that its
count, min_value, and max_value are computed only from the non-zero latencies
(e.g., ensure count==number of positive latencies and min/max reflect those
values). Use the existing test pattern and helper _make_result to create results
and assert the expected stats for the non-zero latencies.
- Around line 199-222: The test test_with_api_latency currently asserts
summary.agent_latency_stats.mean > 0 which is too loose; update the assertion
after calling EvaluationSummary.from_results to check the mean against the
expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison
(e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4)
or use pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e495129c-0580-484c-8dd8-b4f993bdcbae
📒 Files selected for processing (21)
README.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/summary.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/core/output/test_statistics_api.pytests/unit/core/output/test_statistics_detailed.pytests/unit/core/storage/test_sql_storage.pytests/unit/pipeline/evaluation/test_amender.py
be3c850 to
b07a6f5
Compare
VladimirKadlec
left a comment
There was a problem hiding this comment.
Renaming "api" to "agent" is really confusing (and it's probably confusing for Claude&spol as well). For example:
- in
system.yamlwe have:
- "api_input_tokens"
- "api_output_tokens"
- "agent_latency" - in
src/lightspeed_evaluation/core/models/data.py:
- fieldsagent_latencywith descriptions "API latency ..."
turn_data.agent_latency = api_latency def _write_agent_latency_stats( ..... ) -> None:
"""Write API latency statistics section."""
if agent_latency is None:
return # No API latency data availableand many other inconsistent examples in comments/field/variable naming.
@asamal4 I understand we are in the middle of moving to the "agents". Do we want to call all the existing "API" comments/field/variable as "agent"?
Otherwise, LGTM 👍
|
@VladimirKadlec @xmican10 regarding below
Few things may not also apply to all agent type; ex: TTFT. But probably safe to call it as agent_ttft. I am also trying to avoid any backward compatibility in future, so I was suggesting to change only the new fields. But I understand your concern, it is definitely confusing. How about we just change it for new user facing column/property names and method names (library mode) and avoid large change for the internal logic. And have a follow up PR where we make it consistent or at least add coments. |
asamal4
left a comment
There was a problem hiding this comment.
Thank you !!
Note: We have discussed about inconsistencies during caching usage.. We will have to discuss, what is going to be right approach here 1. use 0 for token/latency or 2. previous cached value.. Currently we have both mechanisms.
b07a6f5 to
87e9fd0
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
297-313: 💤 Low valueDocstring inconsistency with actual behavior.
The docstring states the return value "may have count=0 if no valid data", but
compute_numeric_statsreturnsNonefor an empty list (not aNumericStatswithcount=0). The function correctly returnsNonein both cases (no evaluation_data, or no valid latency values), so the docstring should reflect this.📝 Suggested docstring fix
"""Compute agent latency statistics from evaluation data. Args: evaluation_data: List of evaluation data containing turn-level latency values. Returns: - NumericStats instance (may have count=0 if no valid data), or None if no evaluation data. + NumericStats instance with computed statistics, or None if no evaluation data + or no valid (non-zero) latency values exist. """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 297 - 313, Update the compute_agent_latency_stats docstring to accurately describe its return behavior: state that it returns a NumericStats instance when latency values exist and otherwise returns None (both when evaluation_data is empty and when there are no valid agent_latency values). Reference the compute_agent_latency_stats function and the helper compute_field_numeric_stats_from_evaluation_data so reviewers know where the behavior originates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 297-313: Update the compute_agent_latency_stats docstring to
accurately describe its return behavior: state that it returns a NumericStats
instance when latency values exist and otherwise returns None (both when
evaluation_data is empty and when there are no valid agent_latency values).
Reference the compute_agent_latency_stats function and the helper
compute_field_numeric_stats_from_evaluation_data so reviewers know where the
behavior originates.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ed3f2085-84bd-4331-967b-c0a2f6ff2157
📒 Files selected for processing (24)
README.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/statistics.pysrc/lightspeed_evaluation/core/models/summary.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/core/output/test_statistics_api.pytests/unit/core/output/test_statistics_detailed.pytests/unit/core/storage/test_sql_storage.pytests/unit/pipeline/evaluation/test_amender.py
✅ Files skipped from review due to trivial changes (2)
- README.md
- docs/EVALUATION_GUIDE.md
🚧 Files skipped from review as they are similar to previous changes (14)
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/storage/test_sql_storage.py
- src/lightspeed_evaluation/core/storage/sql_storage.py
- tests/unit/core/models/test_summary.py
- src/lightspeed_evaluation/core/models/data.py
- tests/unit/core/models/conftest.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- config/system.yaml
- tests/unit/pipeline/evaluation/test_amender.py
- tests/unit/core/output/test_statistics.py
- tests/unit/core/output/test_generator.py
- src/lightspeed_evaluation/core/output/generator.py
- tests/unit/core/models/test_quality.py
87e9fd0 to
244cc20
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/output/generator.py (1)
990-997:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStreaming percentile stats (p95/p99) are not included in JSON output.
_streaming_stats_to_dictexcludesp95andp99fields that are now part ofNumericStats, causing streaming performance metrics to lose percentile data in reports.📝 Proposed fix
if numeric is not None: result[field_name] = { "count": numeric.count, "mean": numeric.mean, "median": numeric.median, "std": numeric.std, "min": numeric.min_value, "max": numeric.max_value, + "p95": numeric.p95, + "p99": numeric.p99, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/generator.py` around lines 990 - 997, The _streaming_stats_to_dict function currently maps NumericStats fields into the result dict but omits the percentile fields p95 and p99; update the mapping for result[field_name] to include "p95": numeric.p95 and "p99": numeric.p99 alongside count/mean/median/std/min/max so streaming NumericStats percentile values are preserved in the JSON output.
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/models/summary.py (1)
76-78: 💤 Low valueConsider updating docstring to mention latency stats.
The docstring mentions "API token and streaming stats" but now also computes
agent_latency_stats.📝 Suggested docstring update
evaluation_data: Optional evaluation data for API token and streaming stats. + Also used to compute agent latency statistics when available.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/models/summary.py` around lines 76 - 78, Update the docstring for the function or class that takes "results" and "evaluation_data" (where agent_latency_stats is now computed) to explicitly mention that it also computes and returns agent latency statistics (agent_latency_stats) in addition to API token and streaming stats; locate the docstring in src/lightspeed_evaluation/core/models/summary.py near the parameters block (the function/method handling results and evaluation_data) and add a brief description of agent_latency_stats and its purpose/format.src/lightspeed_evaluation/core/output/generator.py (1)
593-594: 💤 Low valueUse
precisionparameter for std dev formatting.Standard deviation uses hardcoded
.3fwhile other fields use theprecisionparameter.📝 Suggested fix
if include_std and stats["count"] > 1: - f.write(f" Std Dev: {stats['std']:.3f}\n") + f.write(f" Std Dev: {stats['std']:{fmt}}\n")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/generator.py` around lines 593 - 594, The Std Dev line currently uses a hardcoded ".3f"; update the formatting to use the precision parameter like the other fields by replacing f.write(f" Std Dev: {stats['std']:.3f}\n") with an f-string that injects precision (e.g. f" Std Dev: {stats['std']:.{precision}f}\n") so the std dev respects the precision variable used elsewhere in the generator (refer to include_std, stats and precision in this file).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 990-997: The _streaming_stats_to_dict function currently maps
NumericStats fields into the result dict but omits the percentile fields p95 and
p99; update the mapping for result[field_name] to include "p95": numeric.p95 and
"p99": numeric.p99 alongside count/mean/median/std/min/max so streaming
NumericStats percentile values are preserved in the JSON output.
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/models/summary.py`:
- Around line 76-78: Update the docstring for the function or class that takes
"results" and "evaluation_data" (where agent_latency_stats is now computed) to
explicitly mention that it also computes and returns agent latency statistics
(agent_latency_stats) in addition to API token and streaming stats; locate the
docstring in src/lightspeed_evaluation/core/models/summary.py near the
parameters block (the function/method handling results and evaluation_data) and
add a brief description of agent_latency_stats and its purpose/format.
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 593-594: The Std Dev line currently uses a hardcoded ".3f"; update
the formatting to use the precision parameter like the other fields by replacing
f.write(f" Std Dev: {stats['std']:.3f}\n") with an f-string that injects
precision (e.g. f" Std Dev: {stats['std']:.{precision}f}\n") so the std dev
respects the precision variable used elsewhere in the generator (refer to
include_std, stats and precision in this file).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4d99b3bc-178f-42c3-bcf0-bc4eece41999
📒 Files selected for processing (24)
README.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/statistics.pysrc/lightspeed_evaluation/core/models/summary.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/core/output/test_statistics_api.pytests/unit/core/output/test_statistics_detailed.pytests/unit/core/storage/test_sql_storage.pytests/unit/pipeline/evaluation/test_amender.py
✅ Files skipped from review due to trivial changes (2)
- src/lightspeed_evaluation/core/storage/sql_storage.py
- README.md
🚧 Files skipped from review as they are similar to previous changes (18)
- tests/unit/core/storage/test_sql_storage.py
- src/lightspeed_evaluation/core/models/init.py
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/models/conftest.py
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/core/models/statistics.py
- tests/unit/core/output/test_statistics_api.py
- tests/unit/core/models/test_summary.py
- docs/EVALUATION_GUIDE.md
- tests/unit/pipeline/evaluation/test_amender.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/models/quality.py
- tests/unit/core/output/test_statistics.py
- tests/unit/core/output/test_generator.py
- tests/unit/core/models/test_quality.py
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/core/output/test_statistics_detailed.py
244cc20 to
6e03548
Compare
|
Rebased, resolved conflicts and addressed all comments. |
asamal4
left a comment
There was a problem hiding this comment.
Thank you !! LGTM..
Note: we will have to start incrementally change api to agent..
Description
This PR addresses the following acceptance criteria (LEADS-348):
For LEADS-364 API token statistics were added into
quality_report.jsontest_statisticsmodule was splited due to reaching file line limitsType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests