[LEADS-348, LEADS-364] API Latency and Token Calculation by xmican10 · Pull Request #230 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-05-06T10:52:41Z

Description

This PR addresses the following acceptance criteria (LEADS-348):

Calculate api execution time and store in final result. agent_latency
Add statistics for all 3 latencies → 50, 95, 99 percentile.
Add information to all relevant reports including the new quality report (LEADS-349)

For LEADS-364 API token statistics were added into quality_report.json

test_statistics module was splited due to reaching file line limits

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- API latency tracking: per-turn and conversation latency with percentile metrics (p50, p95, p99) included in outputs and CSV.
- Agent-focused token statistics and latency shown in summary and quality reports.
Documentation
- Evaluation guide updated with a Performance Metrics section describing latency percentiles, streaming metrics, and token reporting.
Tests
- Added and updated unit tests covering latency stats, agent token stats, CSV/export presence, and report generation.

coderabbitai · 2026-05-06T10:52:48Z

Warning

Rate limit exceeded

@xmican10 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 18 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09109dbb-9f90-4618-a890-732d5281d067

📥 Commits

Reviewing files that changed from the base of the PR and between 244cc20 and 6e03548.

📒 Files selected for processing (24)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/statistics.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

Walkthrough

Adds end-to-end agent/API latency measurement and percentile stats (p50/p95/p99): collects per-call latency, aggregates percentiles, persists values, exposes agent token usage with percentile statistics, updates quality/summary outputs, and adds tests and docs.

Changes

API Latency Tracking Feature

Layer / File(s)	Summary
Docs and CSV config `README.md`, `docs/EVALUATION_GUIDE.md`, `config/system.yaml`	Add README feature bullet, document CSV field and performance metrics section, include `agent_latency` in CSV columns.
Data models & public exports `src/lightspeed_evaluation/core/models/data.py`, `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/core/constants.py`	Add `agent_latency` to `TurnData` and `EvaluationResult`, update supported CSV columns and package-level token usage export name.
Stat models & functions `src/lightspeed_evaluation/core/models/statistics.py`, `src/lightspeed_evaluation/core/output/statistics.py`	Add `p95`/`p99` to `NumericStats`, replace `ApiTokenUsage` with `AgentTokenUsage`/`AgentTokenStats`, extend compute_numeric_stats for percentiles, add helpers to compute field numeric stats and agent token stats/usage.
Summary & Quality report wiring `src/lightspeed_evaluation/core/models/summary.py`, `src/lightspeed_evaluation/core/models/quality.py`	Compute/expose `agent_token_usage` and `agent_latency_stats` in EvaluationSummary and accept/populate these in QualityReport.create_report.
API call latency measurement `src/lightspeed_evaluation/pipeline/evaluation/amender.py`	Measure API call latency with perf_counter(), set `turn_data.agent_latency`, skip cached responses (zero tokens), and record on APIError.
Per-request latency compute `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Compute per-request `agent_latency` (turn-level or conversation sum) and include it in both success and error EvaluationResult payloads; rename token-count helper accordingly.
Database persistence `src/lightspeed_evaluation/core/storage/sql_storage.py`	Add nullable `agent_latency` ORM column and persist `EvaluationResult.agent_latency` to DB records.
Output generation & rendering `src/lightspeed_evaluation/core/output/generator.py`	Pass agent stats to QualityReport, add helper to build agent token stats JSON, include `agent_latency_stats` in JSON/text summary, extend per-result JSON with `agent_latency` and `tokens_per_second`, and render std/p95/p99 where available.
Runner integration `src/lightspeed_evaluation/runner/evaluation.py`	Switch runner to `AgentTokenUsage` and `compute_agent_token_usage` and update `_print_summary` typing.
Test fixtures `tests/unit/core/models/conftest.py`	Add fixtures returning NumericStats and AgentTokenStats for tests.
Quality report tests `tests/unit/core/models/test_quality.py`	Update and add tests covering agent latency and agent token stats in QualityReport creations and edge cases.
Summary tests `tests/unit/core/models/test_summary.py`	Add tests asserting agent latency stats computed from nonzero values, ignored zeros, and serialization expectations.
Output generator tests `tests/unit/core/output/test_generator.py`	Update tests to pass agent stats and verify new JSON/text output fields.
Statistics tests `tests/unit/core/output/test_statistics.py`, `tests/unit/core/output/test_statistics_api.py`	Refactor tests: narrow/truncate legacy file and add test_statistics_api.py covering compute_agent_token_usage, agent latency stats helper, and overall stats token tracking.
Detailed stats tests `tests/unit/core/output/test_statistics_detailed.py`	Add comprehensive tests for compute_detailed_stats: grouping, outcome counts, rates, score stats, CI behavior, and tag breakdowns.
Amender & storage tests `tests/unit/pipeline/evaluation/test_amender.py`, `tests/unit/core/storage/test_sql_storage.py`	Add amender latency measurement tests (mocked perf_counter, missing client, cached responses) and assert DB CSV column includes `agent_latency`.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

lightspeed-core/lightspeed-evaluation#233: PR #233 introduced refactored statistics/token-usage plumbing that this change extends with agent latency and nested token statistics.

Suggested Reviewers

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main changes: API latency calculation and token calculation are the core features implemented across the PR, reflected in all major file modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
119-122: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Set agent_latency in the error path for accurate timing.

When the API call fails, turn_data.agent_latency is never set. This means it will use the default value (likely 0.0), which misrepresents the actual time spent on the failed API call. Consider setting agent_latency based on the elapsed time even when an error occurs, to accurately track API call overhead.
🕒 Proposed fix to track latency in error path
         except APIError as e:
+            api_latency = time.perf_counter() - api_start_time
+            turn_data.agent_latency = api_latency
             error_msg = f"API Error for turn {turn_data.turn_id}: {e}"
             logger.error(error_msg)
             return error_msg, conversation_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 119 -
122, In the APIError except block inside amender.py (the except APIError as e:
handler for the call that sets turn_data), record the elapsed time and assign it
to turn_data.agent_latency before logging and returning: compute the latency
using the same start/end timing mechanism used in the successful path (or
capture time at exception entry), set turn_data.agent_latency = elapsed, then
include that updated timing prior to calling logger.error and returning
error_msg, conversation_id so failed calls reflect actual agent_latency.

🧹 Nitpick comments (4)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

71-76: 💤 Low value

Consider tracking API call time separately from agent processing time.

The current logic sets agent_latency to 0.0 for cached responses (when tokens are zero), which discards network overhead and authentication time. While this appears intentional (tracking only agent processing time), it means you lose visibility into the total API call latency for cached responses.

Consider one of these alternatives:

Track both api_call_time (total) and agent_compute_time (processing only)

Document that agent_latency excludes cached response overhead

Verify that zero tokens reliably indicates a cached response in all scenarios
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 71 -
76, The current assignment to turn_data.agent_latency overwrites total API
latency for cached responses (when turn_data.api_input_tokens and
turn_data.api_output_tokens are zero); change this by adding separate fields
(e.g., turn_data.api_call_time = api_latency and turn_data.agent_compute_time =
... ) and set agent_compute_time to api_latency only when tokens > 0 (or to 0
otherwise), so you preserve total api_latency in turn_data.api_call_time while
keeping agent processing time in turn_data.agent_compute_time; update any
downstream usage of turn_data.agent_latency to use the new fields or document
the semantics clearly in the turn_data class/definition.

src/lightspeed_evaluation/core/output/statistics.py (1)

311-314: ⚡ Quick win

Use Google-style docstring for the new public helper.

calculate_field_numeric_stats_from_evaluation_data is a public API in src/ and should include explicit Args and Returns sections for consistency with repository standards.

As per coding guidelines src/**/*.py: "Use Google-style docstrings for all public APIs in Python".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 311 - 314,
The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.

tests/unit/core/models/test_summary.py (2)

223-241: 💤 Low value

Consider adding a test for mixed zero and non-zero latencies.

The current tests cover "all non-zero" and "all zero" scenarios. Consider adding a test where some turns have agent_latency=0 (e.g., cached responses) and others have positive values (actual API calls). This would verify the behavior when API calls are only partially measured.

📝 Example test case

def test_with_mixed_api_latency(self) -> None:
    """Test from_results with mixed zero and non-zero latencies."""
    results = [_make_result(turn_id="t1")]
    
    eval_data = [
        EvaluationData(
            conversation_group_id="conv1",
            turns=[
                TurnData(turn_id="t1", query="Query 1", agent_latency=1.5),
                TurnData(turn_id="t2", query="Query 2", agent_latency=0),  # cached
                TurnData(turn_id="t3", query="Query 3", agent_latency=2.0),
            ],
        )
    ]
    
    summary = EvaluationSummary.from_results(results, evaluation_data=eval_data)
    
    # Should compute stats only from non-zero values (1.5, 2.0)
    assert summary.agent_latency_stats is not None
    assert summary.agent_latency_stats.count == 2
    assert summary.agent_latency_stats.min_value == 1.5
    assert summary.agent_latency_stats.max_value == 2.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 223 - 241, Add a new
unit test that calls EvaluationSummary.from_results with EvaluationData
containing a mix of TurnData entries where some have agent_latency == 0 and
others have positive values; verify that EvaluationSummary.agent_latency_stats
is not None and that its count, min_value, and max_value are computed only from
the non-zero latencies (e.g., ensure count==number of positive latencies and
min/max reflect those values). Use the existing test pattern and helper
_make_result to create results and assert the expected stats for the non-zero
latencies.

199-222: 💤 Low value

Consider more precise assertion for mean latency.

Line 219 asserts mean > 0 which is very loose. Given the input values (1.5, 2.0, 1.8), the expected mean is approximately 1.7667. A more precise assertion would validate the calculation correctness.

🔍 Suggested improvement

         assert summary.agent_latency_stats is not None
         assert summary.agent_latency_stats.count == 3
-        assert summary.agent_latency_stats.mean is not None
-        assert summary.agent_latency_stats.mean > 0
+        expected_mean = (1.5 + 2.0 + 1.8) / 3  # ≈ 1.7667
+        assert summary.agent_latency_stats.mean is not None
+        assert abs(summary.agent_latency_stats.mean - expected_mean) < 0.0001
         assert summary.agent_latency_stats.min_value == 1.5
         assert summary.agent_latency_stats.max_value == 2.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 199 - 222, The test
test_with_api_latency currently asserts summary.agent_latency_stats.mean > 0
which is too loose; update the assertion after calling
EvaluationSummary.from_results to check the mean against the expected value
(mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison (e.g., assert
round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4) or use
pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 523-529: The NumericStats fields on agent_latency (mean, median,
std, min_value, max_value, p95, p99) may be None and the current f"{...:.3f}"
formatting will raise; update the block that writes these stats in generator.py
to guard each field (e.g., check value is not None) and format conditionally—use
f"{value:.3f}s" when value is present and a safe fallback string like "N/A" (or
"0.000s" if preferred) when None so writing never raises; ensure all referenced
symbols (agent_latency.mean, agent_latency.median, agent_latency.std,
agent_latency.min_value, agent_latency.max_value, agent_latency.p95,
agent_latency.p99) are handled consistently.
- Around line 779-783: The JSON output drops newly added percentile fields
because _streaming_stats_to_dict (used when summary.streaming is not None in
generator.py) does not include p95/p99 for TTFT, duration, and throughput;
update _streaming_stats_to_dict to map and include the percentile fields (p95
and p99) from the incoming streaming stats object into the returned dict (e.g.,
add keys like ttft_p95, ttft_p99, duration_p95, duration_p99, throughput_p95,
throughput_p99 or preserve existing naming conventions) so summary.streaming
serializes the new percentile data into result["streaming_performance"].

In `@tests/unit/pipeline/evaluation/test_amender.py`:
- Around line 267-299: The test is flaky because it asserts turn.agent_latency >
0 based on real timing; patch time.perf_counter inside the test so
amend_single_turn's timing is deterministic: use
mocker.patch("time.perf_counter", side_effect=[START, END]) (or return_value for
a single call if amend_single_turn only reads once) before calling
APIDataAmender.amend_single_turn, choose START and END such that END > START
(e.g. 1.0, 1.5) so agent_latency becomes a known value, then assert the exact
expected latency (END-START) and token fields as before; reference
APIDataAmender.amend_single_turn and TurnData.agent_latency when making the
change.

---

Outside diff comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 119-122: In the APIError except block inside amender.py (the
except APIError as e: handler for the call that sets turn_data), record the
elapsed time and assign it to turn_data.agent_latency before logging and
returning: compute the latency using the same start/end timing mechanism used in
the successful path (or capture time at exception entry), set
turn_data.agent_latency = elapsed, then include that updated timing prior to
calling logger.error and returning error_msg, conversation_id so failed calls
reflect actual agent_latency.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 311-314: The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 71-76: The current assignment to turn_data.agent_latency
overwrites total API latency for cached responses (when
turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change
this by adding separate fields (e.g., turn_data.api_call_time = api_latency and
turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency
only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in
turn_data.api_call_time while keeping agent processing time in
turn_data.agent_compute_time; update any downstream usage of
turn_data.agent_latency to use the new fields or document the semantics clearly
in the turn_data class/definition.

In `@tests/unit/core/models/test_summary.py`:
- Around line 223-241: Add a new unit test that calls
EvaluationSummary.from_results with EvaluationData containing a mix of TurnData
entries where some have agent_latency == 0 and others have positive values;
verify that EvaluationSummary.agent_latency_stats is not None and that its
count, min_value, and max_value are computed only from the non-zero latencies
(e.g., ensure count==number of positive latencies and min/max reflect those
values). Use the existing test pattern and helper _make_result to create results
and assert the expected stats for the non-zero latencies.
- Around line 199-222: The test test_with_api_latency currently asserts
summary.agent_latency_stats.mean > 0 which is too loose; update the assertion
after calling EvaluationSummary.from_results to check the mean against the
expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison
(e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4)
or use pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e495129c-0580-484c-8dd8-b4f993bdcbae

📥 Commits

Reviewing files that changed from the base of the PR and between 1951eb9 and 861d59e.

📒 Files selected for processing (21)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

VladimirKadlec

Renaming "api" to "agent" is really confusing (and it's probably confusing for Claude&spol as well). For example:

in system.yaml we have:
- "api_input_tokens"
- "api_output_tokens"
- "agent_latency"
in src/lightspeed_evaluation/core/models/data.py:
- fields agent_latency with descriptions "API latency ..."

turn_data.agent_latency = api_latency

    def _write_agent_latency_stats( ..... ) -> None:
        """Write API latency statistics section."""
        if agent_latency is None:
            return  # No API latency data available

and many other inconsistent examples in comments/field/variable naming.

@asamal4 I understand we are in the middle of moving to the "agents". Do we want to call all the existing "API" comments/field/variable as "agent"?

Otherwise, LGTM 👍

asamal4 · 2026-05-15T10:03:00Z

@VladimirKadlec @xmican10 regarding below

@asamal4 I understand we are in the middle of moving to the "agents". Do we want to call all the existing "API" comments/field/variable as "agent"?

Few things may not also apply to all agent type; ex: TTFT. But probably safe to call it as agent_ttft. I am also trying to avoid any backward compatibility in future, so I was suggesting to change only the new fields. But I understand your concern, it is definitely confusing.

How about we just change it for new user facing column/property names and method names (library mode) and avoid large change for the internal logic. And have a follow up PR where we make it consistent or at least add coments.

asamal4

Thank you !!

Note: We have discussed about inconsistencies during caching usage.. We will have to discuss, what is going to be right approach here 1. use 0 for token/latency or 2. previous cached value.. Currently we have both mechanisms.

coderabbitai

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

297-313: 💤 Low value

Docstring inconsistency with actual behavior.

The docstring states the return value "may have count=0 if no valid data", but compute_numeric_stats returns None for an empty list (not a NumericStats with count=0). The function correctly returns None in both cases (no evaluation_data, or no valid latency values), so the docstring should reflect this.

📝 Suggested docstring fix

     """Compute agent latency statistics from evaluation data.

     Args:
         evaluation_data: List of evaluation data containing turn-level latency values.

     Returns:
-        NumericStats instance (may have count=0 if no valid data), or None if no evaluation data.
+        NumericStats instance with computed statistics, or None if no evaluation data
+        or no valid (non-zero) latency values exist.
     """

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 297 - 313,
Update the compute_agent_latency_stats docstring to accurately describe its
return behavior: state that it returns a NumericStats instance when latency
values exist and otherwise returns None (both when evaluation_data is empty and
when there are no valid agent_latency values). Reference the
compute_agent_latency_stats function and the helper
compute_field_numeric_stats_from_evaluation_data so reviewers know where the
behavior originates.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 297-313: Update the compute_agent_latency_stats docstring to
accurately describe its return behavior: state that it returns a NumericStats
instance when latency values exist and otherwise returns None (both when
evaluation_data is empty and when there are no valid agent_latency values).
Reference the compute_agent_latency_stats function and the helper
compute_field_numeric_stats_from_evaluation_data so reviewers know where the
behavior originates.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ed3f2085-84bd-4331-967b-c0a2f6ff2157

📥 Commits

Reviewing files that changed from the base of the PR and between 861d59e and 87e9fd0.

📒 Files selected for processing (24)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/statistics.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

✅ Files skipped from review due to trivial changes (2)

README.md
docs/EVALUATION_GUIDE.md

🚧 Files skipped from review as they are similar to previous changes (14)

src/lightspeed_evaluation/core/constants.py
tests/unit/core/storage/test_sql_storage.py
src/lightspeed_evaluation/core/storage/sql_storage.py
tests/unit/core/models/test_summary.py
src/lightspeed_evaluation/core/models/data.py
tests/unit/core/models/conftest.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
config/system.yaml
tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_generator.py
src/lightspeed_evaluation/core/output/generator.py
tests/unit/core/models/test_quality.py

coderabbitai

♻️ Duplicate comments (1)

src/lightspeed_evaluation/core/output/generator.py (1)

990-997: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Streaming percentile stats (p95/p99) are not included in JSON output.

_streaming_stats_to_dict excludes p95 and p99 fields that are now part of NumericStats, causing streaming performance metrics to lose percentile data in reports.

📝 Proposed fix

         if numeric is not None:
             result[field_name] = {
                 "count": numeric.count,
                 "mean": numeric.mean,
                 "median": numeric.median,
                 "std": numeric.std,
                 "min": numeric.min_value,
                 "max": numeric.max_value,
+                "p95": numeric.p95,
+                "p99": numeric.p99,
             }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/generator.py` around lines 990 - 997,
The _streaming_stats_to_dict function currently maps NumericStats fields into
the result dict but omits the percentile fields p95 and p99; update the mapping
for result[field_name] to include "p95": numeric.p95 and "p99": numeric.p99
alongside count/mean/median/std/min/max so streaming NumericStats percentile
values are preserved in the JSON output.

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/models/summary.py (1)

76-78: 💤 Low value

Consider updating docstring to mention latency stats.

The docstring mentions "API token and streaming stats" but now also computes agent_latency_stats.

📝 Suggested docstring update

             evaluation_data: Optional evaluation data for API token and streaming stats.
+                Also used to compute agent latency statistics when available.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/models/summary.py` around lines 76 - 78,
Update the docstring for the function or class that takes "results" and
"evaluation_data" (where agent_latency_stats is now computed) to explicitly
mention that it also computes and returns agent latency statistics
(agent_latency_stats) in addition to API token and streaming stats; locate the
docstring in src/lightspeed_evaluation/core/models/summary.py near the
parameters block (the function/method handling results and evaluation_data) and
add a brief description of agent_latency_stats and its purpose/format.

src/lightspeed_evaluation/core/output/generator.py (1)

593-594: 💤 Low value

Use precision parameter for std dev formatting.

Standard deviation uses hardcoded .3f while other fields use the precision parameter.

📝 Suggested fix

         if include_std and stats["count"] > 1:
-            f.write(f"  Std Dev: {stats['std']:.3f}\n")
+            f.write(f"  Std Dev: {stats['std']:{fmt}}\n")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/generator.py` around lines 593 - 594,
The Std Dev line currently uses a hardcoded ".3f"; update the formatting to use
the precision parameter like the other fields by replacing f.write(f"  Std Dev:
{stats['std']:.3f}\n") with an f-string that injects precision (e.g. f"  Std
Dev: {stats['std']:.{precision}f}\n") so the std dev respects the precision
variable used elsewhere in the generator (refer to include_std, stats and
precision in this file).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 990-997: The _streaming_stats_to_dict function currently maps
NumericStats fields into the result dict but omits the percentile fields p95 and
p99; update the mapping for result[field_name] to include "p95": numeric.p95 and
"p99": numeric.p99 alongside count/mean/median/std/min/max so streaming
NumericStats percentile values are preserved in the JSON output.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/models/summary.py`:
- Around line 76-78: Update the docstring for the function or class that takes
"results" and "evaluation_data" (where agent_latency_stats is now computed) to
explicitly mention that it also computes and returns agent latency statistics
(agent_latency_stats) in addition to API token and streaming stats; locate the
docstring in src/lightspeed_evaluation/core/models/summary.py near the
parameters block (the function/method handling results and evaluation_data) and
add a brief description of agent_latency_stats and its purpose/format.

In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 593-594: The Std Dev line currently uses a hardcoded ".3f"; update
the formatting to use the precision parameter like the other fields by replacing
f.write(f"  Std Dev: {stats['std']:.3f}\n") with an f-string that injects
precision (e.g. f"  Std Dev: {stats['std']:.{precision}f}\n") so the std dev
respects the precision variable used elsewhere in the generator (refer to
include_std, stats and precision in this file).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4d99b3bc-178f-42c3-bcf0-bc4eece41999

📥 Commits

Reviewing files that changed from the base of the PR and between 87e9fd0 and 244cc20.

📒 Files selected for processing (24)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/statistics.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

✅ Files skipped from review due to trivial changes (2)

src/lightspeed_evaluation/core/storage/sql_storage.py
README.md

🚧 Files skipped from review as they are similar to previous changes (18)

tests/unit/core/storage/test_sql_storage.py
src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/core/constants.py
tests/unit/core/models/conftest.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/models/test_summary.py
docs/EVALUATION_GUIDE.md
tests/unit/pipeline/evaluation/test_amender.py
src/lightspeed_evaluation/runner/evaluation.py
src/lightspeed_evaluation/core/models/quality.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_generator.py
tests/unit/core/models/test_quality.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/unit/core/output/test_statistics_detailed.py

xmican10 · 2026-05-18T12:09:12Z

Rebased, resolved conflicts and addressed all comments.

asamal4

Thank you !! LGTM..
Note: we will have to start incrementally change api to agent..

xmican10 force-pushed the LEADS-348-calculate-api-latency branch 4 times, most recently from a00a8f1 to 861d59e Compare May 13, 2026 08:10

xmican10 changed the title ~~[LEADS-348] API Latency Calculation~~ [LEADS-348, LEADS-364] API Latency Calculation May 13, 2026

xmican10 changed the title ~~[LEADS-348, LEADS-364] API Latency Calculation~~ [LEADS-348, LEADS-364] API Latency and Token Calculation May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/output/generator.py Outdated

Comment thread src/lightspeed_evaluation/core/output/generator.py

Comment thread tests/unit/pipeline/evaluation/test_amender.py

xmican10 force-pushed the LEADS-348-calculate-api-latency branch 3 times, most recently from be3c850 to b07a6f5 Compare May 13, 2026 12:53

xmican10 marked this pull request as ready for review May 13, 2026 14:51

VladimirKadlec previously approved these changes May 14, 2026

View reviewed changes

xmican10 mentioned this pull request May 14, 2026

[LEADS-348] evaluation latency #235

Open

15 tasks

asamal4 reviewed May 15, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated

asamal4 mentioned this pull request May 15, 2026

feat: add agent driver abstraction layer for evaluation pipeline #231

Merged

5 tasks

LEADS-349-calculate-aggregated-score-from-key-metrics

cbc9611

xmican10 dismissed VladimirKadlec’s stale review via 87e9fd0 May 18, 2026 09:33

xmican10 force-pushed the LEADS-348-calculate-api-latency branch from b07a6f5 to 87e9fd0 Compare May 18, 2026 09:33

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

xmican10 force-pushed the LEADS-348-calculate-api-latency branch from 87e9fd0 to 244cc20 Compare May 18, 2026 11:33

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

API latency calculation

6e03548

xmican10 force-pushed the LEADS-348-calculate-api-latency branch from 244cc20 to 6e03548 Compare May 18, 2026 12:06

asamal4 approved these changes May 19, 2026

View reviewed changes

asamal4 merged commit 377a66b into lightspeed-core:main May 20, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LEADS-348, LEADS-364] API Latency and Token Calculation#230

[LEADS-348, LEADS-364] API Latency and Token Calculation#230
asamal4 merged 2 commits into
lightspeed-core:mainfrom
xmican10:LEADS-348-calculate-api-latency

xmican10 commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec left a comment

Uh oh!

asamal4 commented May 15, 2026

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

xmican10 commented May 18, 2026

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xmican10 commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented May 15, 2026

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

xmican10 commented May 18, 2026

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmican10 commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading