feat!: track cumulative wall time in analyze plan #5505

wkalt · 2025-12-17T05:41:39Z

BREAKING CHANGE: removes cumulative_cpu metrics from analyze plan output.

This adds a wall time to execution stats. This is useful for more clearly understanding the execution times experienced by the user. Unlike existing timers, these include IO. When IO is performed across children in parallel, the computed time range is based on the collective minimum and maximum bounds of the children.

This PR also removes the timer "cumulative_cpu", to reduce the number of items presented across all operators.

This adds a cumulative_wall metric to execution stats. This is useful for more clearly understanding the execution times experienced by the user. Unlike existing timers, these include IO. When IO is performed across children in parallel, the computed time range is based on the collective minimum and maximum bounds of the children.

chatgpt-codex-connector · 2025-12-17T05:41:44Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

wkalt · 2025-12-17T05:54:09Z

I have a simple script here:

import lance

ds = lance.dataset("/home/wyatt/.lancedb/databases/wyatt/wikipedia_full.lance")
print(ds.scanner(filter="id < 'aaaaahhhh'").analyze_plan())

On this code it prints this:

AnalyzeExec verbose=true, metrics=[], cumulative_cpu=548.306063ms, cumulative_wall=16.251096261s
  TracedExec, metrics=[], cumulative_cpu=548.306063ms, cumulative_wall=16.251096261s
    ProjectionExec: expr=[id@0 as id, url@2 as url, title@3 as title, text@4 as text], metrics=[output_rows=6458670, elapsed_compute=14.068535ms], cumulative_cpu=548.306063ms, cumulative_wall=16.251096261s
      Take: columns="id, _rowid, (url), (title), (text)", metrics=[output_rows=6458670, elapsed_compute=352.750872ms, batches_processed=789, bytes_read=19367085082, iops=8549, requests=8523], cumulative_cpu=534.237528ms, cumulative_wall=15.542186413s
        CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=6458670, elapsed_compute=17.215249ms], cumulative_cpu=181.486656ms, cumulative_wall=15.542186413s
          LanceRead: uri=home/wyatt/.lancedb/databases/wyatt/wikipedia_full.lance/data, projection=[id], num_fragments=12, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=id < Utf8("aaaaahhhh"), refine_filter=id < Utf8("aaaaahhhh"), metrics=[output_rows=6458670, elapsed_compute=164.271407ms, fragments_scanned=12, ranges_scanned=12, rows_scanned=6458670, bytes_read=101683557, iops=55, requests=55, task_wait_time=164.099785ms], cumulative_cpu=164.271407ms, cumulative_wall=15.522717259s

Prior to this commit it would print this:

AnalyzeExec verbose=true, metrics=[], cumulative_cpu=452.556045ms
  TracedExec, metrics=[], cumulative_cpu=452.556045ms
    ProjectionExec: expr=[id@0 as id, url@2 as url, title@3 as title, text@4 as text], metrics=[output_rows=6458670, elapsed_compute=1.953877ms], cumulative_cpu=452.556045ms
      Take: columns="id, _rowid, (url), (title), (text)", metrics=[output_rows=6458670, elapsed_compute=148.397487ms, batches_processed=789, bytes_read=19367085082, iops=8549, requests=8523], cumulative_cpu=450.602168ms
        CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=6458670, elapsed_compute=2.472034ms], cumulative_cpu=302.204681ms
          LanceRead: uri=home/wyatt/.lancedb/databases/wyatt/wikipedia_full.lance/data, projection=[id], num_fragments=12, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=id < Utf8("aaaaahhhh"), refine_filter=id < Utf8("aaaaahhhh"), metrics=[output_rows=6458670, elapsed_compute=299.732647ms, fragments_scanned=12, ranges_scanned=12, rows_scanned=6458670, bytes_read=101683557, iops=55, requests=55, task_wait_time=299.626144ms], cumulative_cpu=299.732647ms

The problem with this is it doesn't actually convey that the query took 16s, which is really important information for analysis. If there is appetite for it, I would actually like to go a step further and replace "cumulative_cpu" with "cumulative_wall" completely. We could even consider calling it just "elapsed" as it would be pretty clear from context:

(hand edited)

AnalyzeExec verbose=true, metrics=[], elapsed=16.251096261s
  TracedExec, metrics=[], elapsed=16.251096261s
    ProjectionExec: expr=[id@0 as id, url@2 as url, title@3 as title, text@4 as text], metrics=[output_rows=6458670, elapsed_compute=14.068535ms], elapsed=16.251096261s
      Take: columns="id, _rowid, (url), (title), (text)", metrics=[output_rows=6458670, elapsed_compute=352.750872ms, batches_processed=789, bytes_read=19367085082, iops=8549, requests=8523], elapsed=15.542186413s
        CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=6458670, elapsed_compute=17.215249ms], elapsed=15.542186413s
          LanceRead: uri=home/wyatt/.lancedb/databases/wyatt/wikipedia_full.lance/data, projection=[id], num_fragments=12, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=id < Utf8("aaaaahhhh"), refine_filter=id < Utf8("aaaaahhhh"), metrics=[output_rows=6458670, elapsed_compute=164.271407ms, fragments_scanned=12, ranges_scanned=12, rows_scanned=6458670, bytes_read=101683557, iops=55, requests=55, task_wait_time=164.099785ms], elapsed=15.522717259s

There are a couple issues in this plan I am still hunting down - need to understand why the timings don't add up completely. But overall this is more compact than what we have today and it indicates much more clearly to the user that the time is in LanceRead. And also, the headline number is the timing experienced by the user so nobody experiences a 16s query and sees 452ms in the reporting.

rust/lance-datafusion/src/exec.rs

codecov · 2025-12-17T06:11:45Z

Codecov Report

❌ Patch coverage is 94.64286% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-datafusion/src/exec.rs	94.44%	1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

wkalt · 2025-12-17T13:47:50Z

Updated to remove cumulative_cpu and reorder the formatting. Here is the new plan:

AnalyzeExec verbose=true, elapsed=3.574797895s, metrics=[]
  TracedExec, elapsed=3.574797895s, metrics=[]
    ProjectionExec: elapsed=3.574797895s, expr=[id@0 as id, url@2 as url, title@3 as title, text@4 as text], metrics=[output_rows=6458670, elapsed_compute=21.283429ms]
      Take: elapsed=3.456628356s, columns="id, _rowid, (url), (title), (text)", metrics=[output_rows=6458670, elapsed_compute=701.453702ms, batches_processed=789, bytes_read=19623000893, iops=8587, requests=8561]
        CoalesceBatchesExec: elapsed=3.456628356s, target_batch_size=8192, metrics=[output_rows=6458670, elapsed_compute=28.1345ms]
          LanceRead: elapsed=3.442903622s, uri=home/wyatt/.lancedb/databases/wyatt/wikipedia_full.lance/data, projection=[id], num_fragments=12, range_before=None, range_after=None, row_id=true, row_addr=false, full_filter=id < Utf8("aaaaahhhh"), refine_filter=id < Utf8("aaaaahhhh"), metrics=[output_rows=6458670, elapsed_compute=91.143462ms, fragments_scanned=12, ranges_scanned=12, rows_scanned=6458670, bytes_read=101683557, iops=55, requests=55, task_wait_time=90.702025ms]

wkalt · 2025-12-17T15:00:22Z

Marked as breaking due to removal of cumulative_cpu. However, I doubt there is a real API commitment around stability of plan representations, and we can remove the marker if so.

wkalt · 2025-12-17T15:01:53Z

Here is a more interesting plan, for hybrid search:

    Vector Search Plan:
    AnalyzeExec verbose=true, elapsed=54.190985ms, metrics=[]
      TracedExec, elapsed=54.190985ms, metrics=[]
        ProjectionExec: elapsed=54.190985ms, expr=[id@3 as id, document_id@2 as document_id, author@4 as author, frame@5 as frame, frame_embedding@6 as frame_embedding, _distance@0 as
    _distance], metrics=[output_rows=1000, elapsed_compute=1.573µs]
          RemoteTake: elapsed=54.190905ms, columns="_distance, _rowid, document_id, id, author, frame, frame_embedding", metrics=[output_rows=1000, elapsed_compute=12.788163ms,
    output_batches=1, remote_takes=1000]
            CoalesceBatchesExec: elapsed=12.297833ms, target_batch_size=1024, metrics=[output_rows=1000, elapsed_compute=751ns]
              GlobalLimitExec: elapsed=12.295259ms, skip=0, fetch=1000, metrics=[output_rows=1000, elapsed_compute=3.466µs]
                FilterExec: elapsed=12.293285ms, document_id@2 != 12345 AND document_id@2 != 67890, metrics=[output_rows=1000, elapsed_compute=9.027µs]
                  RemoteTake: elapsed=12.291372ms, columns="_distance, _rowid, document_id", metrics=[output_rows=1000, elapsed_compute=2.178325ms, output_batches=1, remote_takes=1000]
                    CoalesceBatchesExec: elapsed=5.04686ms, target_batch_size=1024, metrics=[output_rows=1000, elapsed_compute=3.095µs]
                      SortExec: elapsed=4.907856ms, TopK(fetch=1000), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], filter=[_distance@0 < 465.2244
     OR _distance@0 = 465.2244 AND _rowid@1 < 30064885945], metrics=[output_rows=1000, elapsed_compute=688.843µs, row_replacements=4045]
                        ANNSubIndex: elapsed=4.907856ms, name=frame_embedding_idx, k=1000, deltas=1, metrics=[output_rows=20000, elapsed_compute=4.906173ms, index_comparisons=317537,
    indices_loaded=0, partitions_searched=20, parts_loaded=3]
                          ANNIvfPartition: elapsed=114.357µs, uuid=65615731-897c-4f36-8928-9b96aee2d3b9, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1,
    elapsed_compute=115.529µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=488, parts_loaded=0]

    FTS Search Plan:
    AnalyzeExec verbose=true, elapsed=79.891015ms, metrics=[]
      TracedExec, elapsed=79.891015ms, metrics=[]
        ProjectionExec: elapsed=79.891015ms, expr=[id@3 as id, document_id@2 as document_id, author@4 as author, frame@5 as frame, frame_embedding@6 as frame_embedding, _score@1 as _score],
    metrics=[output_rows=1000, elapsed_compute=1.673µs]
          RemoteTake: elapsed=79.890845ms, columns="_rowid, _score, document_id, id, author, frame, frame_embedding", metrics=[output_rows=1000, elapsed_compute=14.171029ms, output_batches=1,
     remote_takes=1000]
            CoalesceBatchesExec: elapsed=37.605791ms, target_batch_size=1024, metrics=[output_rows=1000, elapsed_compute=862ns]
              GlobalLimitExec: elapsed=37.603026ms, skip=0, fetch=1000, metrics=[output_rows=1000, elapsed_compute=3.617µs]
                FilterExec: elapsed=37.601313ms, document_id@2 != 12345 AND document_id@2 != 67890, metrics=[output_rows=1000, elapsed_compute=10.64µs]
                  RemoteTake: elapsed=37.599398ms, columns="_rowid, _score, document_id", metrics=[output_rows=1000, elapsed_compute=1.592838ms, output_batches=1, remote_takes=1000]
                    CoalesceBatchesExec: elapsed=30.676205ms, target_batch_size=1024, metrics=[output_rows=1000, elapsed_compute=4.688µs]
                      MatchQuery: elapsed=30.65777ms, column=frame, query=history science technology, metrics=[output_rows=1000, elapsed_compute=30.648001ms, index_comparisons=64114,
    indices_loaded=0, parts_loaded=0]

cmccabe · 2025-12-17T18:00:11Z

rust/lance-datafusion/src/exec.rs

            for child in plan.children() {
-                cumulative_cpu += self.calculate_cumulative_cpu(child);
+                let child_metrics = self.calculate_metrics(child);
+                min_start = Self::min_option(min_start, child_metrics.min_start);


Is there a real possibility that the child start time could be before the parent?

I wouldn't generally expect it

cmccabe · 2025-12-17T18:03:12Z

cumulative elapsed time does seem important to report. I guess cumulative cpu may not be as important since the mental work flow is more like "X took a lot of time" -> "why is that?" -> "there was a lot of CPU on step Y" which cumulative CPU doesn't help that much with.

wkalt · 2025-12-17T18:43:49Z

@cmccabe yeah, that's my thought too. I think usually the reason you're looking at one of these things is you experienced a slow wall time, so the wall time will generally lead you to the node you want to focus on. At that point the elapsed CPU (and potentially cumulative elapsed of the subplan) is useful, but I didn't want to clutter the whole tree with 2 cumulative timers.

github-actions bot added the enhancement New feature or request label Dec 17, 2025

wkalt commented Dec 17, 2025

View reviewed changes

rust/lance-datafusion/src/exec.rs Outdated Show resolved Hide resolved

wkalt changed the title ~~feat: track cumulative wall time in analyze plan~~ feat!: track cumulative wall time in analyze plan Dec 17, 2025

github-actions bot added the breaking-change label Dec 17, 2025

wkalt force-pushed the task/improve-analyze-timers branch from 83e4b18 to 6b723e8 Compare December 17, 2025 14:04

remove cumulative_cpu, reorder formatting

b1dd51c

wkalt force-pushed the task/improve-analyze-timers branch from 5f80c6a to b1dd51c Compare December 17, 2025 14:39

github-actions bot added the python label Dec 17, 2025

lint

91c1e0f

Fix doctest

0b37c1f

cmccabe reviewed Dec 17, 2025

View reviewed changes

wjones127 approved these changes Dec 17, 2025

View reviewed changes

wjones127 merged commit 574990c into lance-format:main Dec 17, 2025
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat!: track cumulative wall time in analyze plan #5505

feat!: track cumulative wall time in analyze plan #5505

Uh oh!

wkalt commented Dec 17, 2025 •

edited by wjones127

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

cmccabe Dec 17, 2025

Uh oh!

wkalt Dec 17, 2025

Uh oh!

cmccabe commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat!: track cumulative wall time in analyze plan #5505

feat!: track cumulative wall time in analyze plan #5505

Uh oh!

Conversation

wkalt commented Dec 17, 2025 • edited by wjones127 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

cmccabe Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

wkalt Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Dec 17, 2025

Uh oh!

wkalt commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wkalt commented Dec 17, 2025 •

edited by wjones127

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading