Skip to content

[LEADS-348, LEADS-364] API Latency and Token Calculation#230

Merged
asamal4 merged 2 commits into
lightspeed-core:mainfrom
xmican10:LEADS-348-calculate-api-latency
May 20, 2026
Merged

[LEADS-348, LEADS-364] API Latency and Token Calculation#230
asamal4 merged 2 commits into
lightspeed-core:mainfrom
xmican10:LEADS-348-calculate-api-latency

Conversation

@xmican10
Copy link
Copy Markdown
Collaborator

@xmican10 xmican10 commented May 6, 2026

Description

This PR addresses the following acceptance criteria (LEADS-348):

  • Calculate api execution time and store in final result. agent_latency
  • Add statistics for all 3 latencies → 50, 95, 99 percentile.
  • Add information to all relevant reports including the new quality report (LEADS-349)

For LEADS-364 API token statistics were added into quality_report.json

test_statistics module was splited due to reaching file line limits

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • API latency tracking: per-turn and conversation latency with percentile metrics (p50, p95, p99) included in outputs and CSV.
    • Agent-focused token statistics and latency shown in summary and quality reports.
  • Documentation

    • Evaluation guide updated with a Performance Metrics section describing latency percentiles, streaming metrics, and token reporting.
  • Tests

    • Added and updated unit tests covering latency stats, agent token stats, CSV/export presence, and report generation.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Warning

Rate limit exceeded

@xmican10 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 18 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09109dbb-9f90-4618-a890-732d5281d067

📥 Commits

Reviewing files that changed from the base of the PR and between 244cc20 and 6e03548.

📒 Files selected for processing (24)
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/quality.py
  • src/lightspeed_evaluation/core/models/statistics.py
  • src/lightspeed_evaluation/core/models/summary.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/unit/core/models/conftest.py
  • tests/unit/core/models/test_quality.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_statistics_api.py
  • tests/unit/core/output/test_statistics_detailed.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/pipeline/evaluation/test_amender.py

Walkthrough

Adds end-to-end agent/API latency measurement and percentile stats (p50/p95/p99): collects per-call latency, aggregates percentiles, persists values, exposes agent token usage with percentile statistics, updates quality/summary outputs, and adds tests and docs.

Changes

API Latency Tracking Feature

Layer / File(s) Summary
Docs and CSV config
README.md, docs/EVALUATION_GUIDE.md, config/system.yaml
Add README feature bullet, document CSV field and performance metrics section, include agent_latency in CSV columns.
Data models & public exports
src/lightspeed_evaluation/core/models/data.py, src/lightspeed_evaluation/core/models/__init__.py, src/lightspeed_evaluation/core/constants.py
Add agent_latency to TurnData and EvaluationResult, update supported CSV columns and package-level token usage export name.
Stat models & functions
src/lightspeed_evaluation/core/models/statistics.py, src/lightspeed_evaluation/core/output/statistics.py
Add p95/p99 to NumericStats, replace ApiTokenUsage with AgentTokenUsage/AgentTokenStats, extend compute_numeric_stats for percentiles, add helpers to compute field numeric stats and agent token stats/usage.
Summary & Quality report wiring
src/lightspeed_evaluation/core/models/summary.py, src/lightspeed_evaluation/core/models/quality.py
Compute/expose agent_token_usage and agent_latency_stats in EvaluationSummary and accept/populate these in QualityReport.create_report.
API call latency measurement
src/lightspeed_evaluation/pipeline/evaluation/amender.py
Measure API call latency with perf_counter(), set turn_data.agent_latency, skip cached responses (zero tokens), and record on APIError.
Per-request latency compute
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Compute per-request agent_latency (turn-level or conversation sum) and include it in both success and error EvaluationResult payloads; rename token-count helper accordingly.
Database persistence
src/lightspeed_evaluation/core/storage/sql_storage.py
Add nullable agent_latency ORM column and persist EvaluationResult.agent_latency to DB records.
Output generation & rendering
src/lightspeed_evaluation/core/output/generator.py
Pass agent stats to QualityReport, add helper to build agent token stats JSON, include agent_latency_stats in JSON/text summary, extend per-result JSON with agent_latency and tokens_per_second, and render std/p95/p99 where available.
Runner integration
src/lightspeed_evaluation/runner/evaluation.py
Switch runner to AgentTokenUsage and compute_agent_token_usage and update _print_summary typing.
Test fixtures
tests/unit/core/models/conftest.py
Add fixtures returning NumericStats and AgentTokenStats for tests.
Quality report tests
tests/unit/core/models/test_quality.py
Update and add tests covering agent latency and agent token stats in QualityReport creations and edge cases.
Summary tests
tests/unit/core/models/test_summary.py
Add tests asserting agent latency stats computed from nonzero values, ignored zeros, and serialization expectations.
Output generator tests
tests/unit/core/output/test_generator.py
Update tests to pass agent stats and verify new JSON/text output fields.
Statistics tests
tests/unit/core/output/test_statistics.py, tests/unit/core/output/test_statistics_api.py
Refactor tests: narrow/truncate legacy file and add test_statistics_api.py covering compute_agent_token_usage, agent latency stats helper, and overall stats token tracking.
Detailed stats tests
tests/unit/core/output/test_statistics_detailed.py
Add comprehensive tests for compute_detailed_stats: grouping, outcome counts, rates, score stats, CI behavior, and tag breakdowns.
Amender & storage tests
tests/unit/pipeline/evaluation/test_amender.py, tests/unit/core/storage/test_sql_storage.py
Add amender latency measurement tests (mocked perf_counter, missing client, cached responses) and assert DB CSV column includes agent_latency.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Suggested Reviewers

  • asamal4
  • VladimirKadlec
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main changes: API latency calculation and token calculation are the core features implemented across the PR, reflected in all major file modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xmican10 xmican10 force-pushed the LEADS-348-calculate-api-latency branch 4 times, most recently from a00a8f1 to 861d59e Compare May 13, 2026 08:10
@xmican10 xmican10 changed the title [LEADS-348] API Latency Calculation [LEADS-348, LEADS-364] API Latency Calculation May 13, 2026
@xmican10 xmican10 changed the title [LEADS-348, LEADS-364] API Latency Calculation [LEADS-348, LEADS-364] API Latency and Token Calculation May 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

119-122: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Set agent_latency in the error path for accurate timing.

When the API call fails, turn_data.agent_latency is never set. This means it will use the default value (likely 0.0), which misrepresents the actual time spent on the failed API call. Consider setting agent_latency based on the elapsed time even when an error occurs, to accurately track API call overhead.

🕒 Proposed fix to track latency in error path
         except APIError as e:
+            api_latency = time.perf_counter() - api_start_time
+            turn_data.agent_latency = api_latency
             error_msg = f"API Error for turn {turn_data.turn_id}: {e}"
             logger.error(error_msg)
             return error_msg, conversation_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 119 -
122, In the APIError except block inside amender.py (the except APIError as e:
handler for the call that sets turn_data), record the elapsed time and assign it
to turn_data.agent_latency before logging and returning: compute the latency
using the same start/end timing mechanism used in the successful path (or
capture time at exception entry), set turn_data.agent_latency = elapsed, then
include that updated timing prior to calling logger.error and returning
error_msg, conversation_id so failed calls reflect actual agent_latency.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

71-76: 💤 Low value

Consider tracking API call time separately from agent processing time.

The current logic sets agent_latency to 0.0 for cached responses (when tokens are zero), which discards network overhead and authentication time. While this appears intentional (tracking only agent processing time), it means you lose visibility into the total API call latency for cached responses.

Consider one of these alternatives:

  • Track both api_call_time (total) and agent_compute_time (processing only)
  • Document that agent_latency excludes cached response overhead
  • Verify that zero tokens reliably indicates a cached response in all scenarios
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 71 -
76, The current assignment to turn_data.agent_latency overwrites total API
latency for cached responses (when turn_data.api_input_tokens and
turn_data.api_output_tokens are zero); change this by adding separate fields
(e.g., turn_data.api_call_time = api_latency and turn_data.agent_compute_time =
... ) and set agent_compute_time to api_latency only when tokens > 0 (or to 0
otherwise), so you preserve total api_latency in turn_data.api_call_time while
keeping agent processing time in turn_data.agent_compute_time; update any
downstream usage of turn_data.agent_latency to use the new fields or document
the semantics clearly in the turn_data class/definition.
src/lightspeed_evaluation/core/output/statistics.py (1)

311-314: ⚡ Quick win

Use Google-style docstring for the new public helper.

calculate_field_numeric_stats_from_evaluation_data is a public API in src/ and should include explicit Args and Returns sections for consistency with repository standards.

As per coding guidelines src/**/*.py: "Use Google-style docstrings for all public APIs in Python".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 311 - 314,
The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.
tests/unit/core/models/test_summary.py (2)

223-241: 💤 Low value

Consider adding a test for mixed zero and non-zero latencies.

The current tests cover "all non-zero" and "all zero" scenarios. Consider adding a test where some turns have agent_latency=0 (e.g., cached responses) and others have positive values (actual API calls). This would verify the behavior when API calls are only partially measured.

📝 Example test case
def test_with_mixed_api_latency(self) -> None:
    """Test from_results with mixed zero and non-zero latencies."""
    results = [_make_result(turn_id="t1")]
    
    eval_data = [
        EvaluationData(
            conversation_group_id="conv1",
            turns=[
                TurnData(turn_id="t1", query="Query 1", agent_latency=1.5),
                TurnData(turn_id="t2", query="Query 2", agent_latency=0),  # cached
                TurnData(turn_id="t3", query="Query 3", agent_latency=2.0),
            ],
        )
    ]
    
    summary = EvaluationSummary.from_results(results, evaluation_data=eval_data)
    
    # Should compute stats only from non-zero values (1.5, 2.0)
    assert summary.agent_latency_stats is not None
    assert summary.agent_latency_stats.count == 2
    assert summary.agent_latency_stats.min_value == 1.5
    assert summary.agent_latency_stats.max_value == 2.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 223 - 241, Add a new
unit test that calls EvaluationSummary.from_results with EvaluationData
containing a mix of TurnData entries where some have agent_latency == 0 and
others have positive values; verify that EvaluationSummary.agent_latency_stats
is not None and that its count, min_value, and max_value are computed only from
the non-zero latencies (e.g., ensure count==number of positive latencies and
min/max reflect those values). Use the existing test pattern and helper
_make_result to create results and assert the expected stats for the non-zero
latencies.

199-222: 💤 Low value

Consider more precise assertion for mean latency.

Line 219 asserts mean > 0 which is very loose. Given the input values (1.5, 2.0, 1.8), the expected mean is approximately 1.7667. A more precise assertion would validate the calculation correctness.

🔍 Suggested improvement
         assert summary.agent_latency_stats is not None
         assert summary.agent_latency_stats.count == 3
-        assert summary.agent_latency_stats.mean is not None
-        assert summary.agent_latency_stats.mean > 0
+        expected_mean = (1.5 + 2.0 + 1.8) / 3  # ≈ 1.7667
+        assert summary.agent_latency_stats.mean is not None
+        assert abs(summary.agent_latency_stats.mean - expected_mean) < 0.0001
         assert summary.agent_latency_stats.min_value == 1.5
         assert summary.agent_latency_stats.max_value == 2.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 199 - 222, The test
test_with_api_latency currently asserts summary.agent_latency_stats.mean > 0
which is too loose; update the assertion after calling
EvaluationSummary.from_results to check the mean against the expected value
(mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison (e.g., assert
round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4) or use
pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 523-529: The NumericStats fields on agent_latency (mean, median,
std, min_value, max_value, p95, p99) may be None and the current f"{...:.3f}"
formatting will raise; update the block that writes these stats in generator.py
to guard each field (e.g., check value is not None) and format conditionally—use
f"{value:.3f}s" when value is present and a safe fallback string like "N/A" (or
"0.000s" if preferred) when None so writing never raises; ensure all referenced
symbols (agent_latency.mean, agent_latency.median, agent_latency.std,
agent_latency.min_value, agent_latency.max_value, agent_latency.p95,
agent_latency.p99) are handled consistently.
- Around line 779-783: The JSON output drops newly added percentile fields
because _streaming_stats_to_dict (used when summary.streaming is not None in
generator.py) does not include p95/p99 for TTFT, duration, and throughput;
update _streaming_stats_to_dict to map and include the percentile fields (p95
and p99) from the incoming streaming stats object into the returned dict (e.g.,
add keys like ttft_p95, ttft_p99, duration_p95, duration_p99, throughput_p95,
throughput_p99 or preserve existing naming conventions) so summary.streaming
serializes the new percentile data into result["streaming_performance"].

In `@tests/unit/pipeline/evaluation/test_amender.py`:
- Around line 267-299: The test is flaky because it asserts turn.agent_latency >
0 based on real timing; patch time.perf_counter inside the test so
amend_single_turn's timing is deterministic: use
mocker.patch("time.perf_counter", side_effect=[START, END]) (or return_value for
a single call if amend_single_turn only reads once) before calling
APIDataAmender.amend_single_turn, choose START and END such that END > START
(e.g. 1.0, 1.5) so agent_latency becomes a known value, then assert the exact
expected latency (END-START) and token fields as before; reference
APIDataAmender.amend_single_turn and TurnData.agent_latency when making the
change.

---

Outside diff comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 119-122: In the APIError except block inside amender.py (the
except APIError as e: handler for the call that sets turn_data), record the
elapsed time and assign it to turn_data.agent_latency before logging and
returning: compute the latency using the same start/end timing mechanism used in
the successful path (or capture time at exception entry), set
turn_data.agent_latency = elapsed, then include that updated timing prior to
calling logger.error and returning error_msg, conversation_id so failed calls
reflect actual agent_latency.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 311-314: The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 71-76: The current assignment to turn_data.agent_latency
overwrites total API latency for cached responses (when
turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change
this by adding separate fields (e.g., turn_data.api_call_time = api_latency and
turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency
only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in
turn_data.api_call_time while keeping agent processing time in
turn_data.agent_compute_time; update any downstream usage of
turn_data.agent_latency to use the new fields or document the semantics clearly
in the turn_data class/definition.

In `@tests/unit/core/models/test_summary.py`:
- Around line 223-241: Add a new unit test that calls
EvaluationSummary.from_results with EvaluationData containing a mix of TurnData
entries where some have agent_latency == 0 and others have positive values;
verify that EvaluationSummary.agent_latency_stats is not None and that its
count, min_value, and max_value are computed only from the non-zero latencies
(e.g., ensure count==number of positive latencies and min/max reflect those
values). Use the existing test pattern and helper _make_result to create results
and assert the expected stats for the non-zero latencies.
- Around line 199-222: The test test_with_api_latency currently asserts
summary.agent_latency_stats.mean > 0 which is too loose; update the assertion
after calling EvaluationSummary.from_results to check the mean against the
expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison
(e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4)
or use pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e495129c-0580-484c-8dd8-b4f993bdcbae

📥 Commits

Reviewing files that changed from the base of the PR and between 1951eb9 and 861d59e.

📒 Files selected for processing (21)
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/quality.py
  • src/lightspeed_evaluation/core/models/summary.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/core/models/conftest.py
  • tests/unit/core/models/test_quality.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_statistics_api.py
  • tests/unit/core/output/test_statistics_detailed.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/pipeline/evaluation/test_amender.py

Comment thread src/lightspeed_evaluation/core/output/generator.py Outdated
Comment thread src/lightspeed_evaluation/core/output/generator.py
Comment thread tests/unit/pipeline/evaluation/test_amender.py
@xmican10 xmican10 force-pushed the LEADS-348-calculate-api-latency branch 3 times, most recently from be3c850 to b07a6f5 Compare May 13, 2026 12:53
@xmican10 xmican10 marked this pull request as ready for review May 13, 2026 14:51
VladimirKadlec
VladimirKadlec previously approved these changes May 14, 2026
Copy link
Copy Markdown
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming "api" to "agent" is really confusing (and it's probably confusing for Claude&spol as well). For example:

  • in system.yaml we have:
    - "api_input_tokens"
    - "api_output_tokens"
    - "agent_latency"
  • in src/lightspeed_evaluation/core/models/data.py:
    - fields agent_latency with descriptions "API latency ..."
turn_data.agent_latency = api_latency
    def _write_agent_latency_stats( ..... ) -> None:
        """Write API latency statistics section."""
        if agent_latency is None:
            return  # No API latency data available

and many other inconsistent examples in comments/field/variable naming.

@asamal4 I understand we are in the middle of moving to the "agents". Do we want to call all the existing "API" comments/field/variable as "agent"?

Otherwise, LGTM 👍

@xmican10 xmican10 mentioned this pull request May 14, 2026
15 tasks
@asamal4
Copy link
Copy Markdown
Collaborator

asamal4 commented May 15, 2026

@VladimirKadlec @xmican10 regarding below

@asamal4 I understand we are in the middle of moving to the "agents". Do we want to call all the existing "API" comments/field/variable as "agent"?

Few things may not also apply to all agent type; ex: TTFT. But probably safe to call it as agent_ttft. I am also trying to avoid any backward compatibility in future, so I was suggesting to change only the new fields. But I understand your concern, it is definitely confusing.

How about we just change it for new user facing column/property names and method names (library mode) and avoid large change for the internal logic. And have a follow up PR where we make it consistent or at least add coments.

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !!

Note: We have discussed about inconsistencies during caching usage.. We will have to discuss, what is going to be right approach here 1. use 0 for token/latency or 2. previous cached value.. Currently we have both mechanisms.

Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

297-313: 💤 Low value

Docstring inconsistency with actual behavior.

The docstring states the return value "may have count=0 if no valid data", but compute_numeric_stats returns None for an empty list (not a NumericStats with count=0). The function correctly returns None in both cases (no evaluation_data, or no valid latency values), so the docstring should reflect this.

📝 Suggested docstring fix
     """Compute agent latency statistics from evaluation data.

     Args:
         evaluation_data: List of evaluation data containing turn-level latency values.

     Returns:
-        NumericStats instance (may have count=0 if no valid data), or None if no evaluation data.
+        NumericStats instance with computed statistics, or None if no evaluation data
+        or no valid (non-zero) latency values exist.
     """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 297 - 313,
Update the compute_agent_latency_stats docstring to accurately describe its
return behavior: state that it returns a NumericStats instance when latency
values exist and otherwise returns None (both when evaluation_data is empty and
when there are no valid agent_latency values). Reference the
compute_agent_latency_stats function and the helper
compute_field_numeric_stats_from_evaluation_data so reviewers know where the
behavior originates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 297-313: Update the compute_agent_latency_stats docstring to
accurately describe its return behavior: state that it returns a NumericStats
instance when latency values exist and otherwise returns None (both when
evaluation_data is empty and when there are no valid agent_latency values).
Reference the compute_agent_latency_stats function and the helper
compute_field_numeric_stats_from_evaluation_data so reviewers know where the
behavior originates.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ed3f2085-84bd-4331-967b-c0a2f6ff2157

📥 Commits

Reviewing files that changed from the base of the PR and between 861d59e and 87e9fd0.

📒 Files selected for processing (24)
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/quality.py
  • src/lightspeed_evaluation/core/models/statistics.py
  • src/lightspeed_evaluation/core/models/summary.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/unit/core/models/conftest.py
  • tests/unit/core/models/test_quality.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_statistics_api.py
  • tests/unit/core/output/test_statistics_detailed.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/pipeline/evaluation/test_amender.py
✅ Files skipped from review due to trivial changes (2)
  • README.md
  • docs/EVALUATION_GUIDE.md
🚧 Files skipped from review as they are similar to previous changes (14)
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/core/storage/test_sql_storage.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • tests/unit/core/models/test_summary.py
  • src/lightspeed_evaluation/core/models/data.py
  • tests/unit/core/models/conftest.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • config/system.yaml
  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_generator.py
  • src/lightspeed_evaluation/core/output/generator.py
  • tests/unit/core/models/test_quality.py

@xmican10 xmican10 force-pushed the LEADS-348-calculate-api-latency branch from 87e9fd0 to 244cc20 Compare May 18, 2026 11:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/output/generator.py (1)

990-997: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Streaming percentile stats (p95/p99) are not included in JSON output.

_streaming_stats_to_dict excludes p95 and p99 fields that are now part of NumericStats, causing streaming performance metrics to lose percentile data in reports.

📝 Proposed fix
         if numeric is not None:
             result[field_name] = {
                 "count": numeric.count,
                 "mean": numeric.mean,
                 "median": numeric.median,
                 "std": numeric.std,
                 "min": numeric.min_value,
                 "max": numeric.max_value,
+                "p95": numeric.p95,
+                "p99": numeric.p99,
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/generator.py` around lines 990 - 997,
The _streaming_stats_to_dict function currently maps NumericStats fields into
the result dict but omits the percentile fields p95 and p99; update the mapping
for result[field_name] to include "p95": numeric.p95 and "p99": numeric.p99
alongside count/mean/median/std/min/max so streaming NumericStats percentile
values are preserved in the JSON output.
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/models/summary.py (1)

76-78: 💤 Low value

Consider updating docstring to mention latency stats.

The docstring mentions "API token and streaming stats" but now also computes agent_latency_stats.

📝 Suggested docstring update
             evaluation_data: Optional evaluation data for API token and streaming stats.
+                Also used to compute agent latency statistics when available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/models/summary.py` around lines 76 - 78,
Update the docstring for the function or class that takes "results" and
"evaluation_data" (where agent_latency_stats is now computed) to explicitly
mention that it also computes and returns agent latency statistics
(agent_latency_stats) in addition to API token and streaming stats; locate the
docstring in src/lightspeed_evaluation/core/models/summary.py near the
parameters block (the function/method handling results and evaluation_data) and
add a brief description of agent_latency_stats and its purpose/format.
src/lightspeed_evaluation/core/output/generator.py (1)

593-594: 💤 Low value

Use precision parameter for std dev formatting.

Standard deviation uses hardcoded .3f while other fields use the precision parameter.

📝 Suggested fix
         if include_std and stats["count"] > 1:
-            f.write(f"  Std Dev: {stats['std']:.3f}\n")
+            f.write(f"  Std Dev: {stats['std']:{fmt}}\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/generator.py` around lines 593 - 594,
The Std Dev line currently uses a hardcoded ".3f"; update the formatting to use
the precision parameter like the other fields by replacing f.write(f"  Std Dev:
{stats['std']:.3f}\n") with an f-string that injects precision (e.g. f"  Std
Dev: {stats['std']:.{precision}f}\n") so the std dev respects the precision
variable used elsewhere in the generator (refer to include_std, stats and
precision in this file).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 990-997: The _streaming_stats_to_dict function currently maps
NumericStats fields into the result dict but omits the percentile fields p95 and
p99; update the mapping for result[field_name] to include "p95": numeric.p95 and
"p99": numeric.p99 alongside count/mean/median/std/min/max so streaming
NumericStats percentile values are preserved in the JSON output.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/models/summary.py`:
- Around line 76-78: Update the docstring for the function or class that takes
"results" and "evaluation_data" (where agent_latency_stats is now computed) to
explicitly mention that it also computes and returns agent latency statistics
(agent_latency_stats) in addition to API token and streaming stats; locate the
docstring in src/lightspeed_evaluation/core/models/summary.py near the
parameters block (the function/method handling results and evaluation_data) and
add a brief description of agent_latency_stats and its purpose/format.

In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 593-594: The Std Dev line currently uses a hardcoded ".3f"; update
the formatting to use the precision parameter like the other fields by replacing
f.write(f"  Std Dev: {stats['std']:.3f}\n") with an f-string that injects
precision (e.g. f"  Std Dev: {stats['std']:.{precision}f}\n") so the std dev
respects the precision variable used elsewhere in the generator (refer to
include_std, stats and precision in this file).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4d99b3bc-178f-42c3-bcf0-bc4eece41999

📥 Commits

Reviewing files that changed from the base of the PR and between 87e9fd0 and 244cc20.

📒 Files selected for processing (24)
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/quality.py
  • src/lightspeed_evaluation/core/models/statistics.py
  • src/lightspeed_evaluation/core/models/summary.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/unit/core/models/conftest.py
  • tests/unit/core/models/test_quality.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_statistics_api.py
  • tests/unit/core/output/test_statistics_detailed.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/pipeline/evaluation/test_amender.py
✅ Files skipped from review due to trivial changes (2)
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • README.md
🚧 Files skipped from review as they are similar to previous changes (18)
  • tests/unit/core/storage/test_sql_storage.py
  • src/lightspeed_evaluation/core/models/init.py
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/core/models/conftest.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/statistics.py
  • tests/unit/core/output/test_statistics_api.py
  • tests/unit/core/models/test_summary.py
  • docs/EVALUATION_GUIDE.md
  • tests/unit/pipeline/evaluation/test_amender.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/quality.py
  • tests/unit/core/output/test_statistics.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/models/test_quality.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • tests/unit/core/output/test_statistics_detailed.py

@xmican10 xmican10 force-pushed the LEADS-348-calculate-api-latency branch from 244cc20 to 6e03548 Compare May 18, 2026 12:06
@xmican10
Copy link
Copy Markdown
Collaborator Author

Rebased, resolved conflicts and addressed all comments.

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! LGTM..
Note: we will have to start incrementally change api to agent..

@asamal4 asamal4 merged commit 377a66b into lightspeed-core:main May 20, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants