Skip to content

topsql/reporter: add TopRU RU window aggregation and reporting pipeline#67089

Merged
ti-chi-bot[bot] merged 10 commits intomasterfrom
codex/topru-pr3-rerun
Mar 20, 2026
Merged

topsql/reporter: add TopRU RU window aggregation and reporting pipeline#67089
ti-chi-bot[bot] merged 10 commits intomasterfrom
codex/topru-pr3-rerun

Conversation

@zimulala
Copy link
Copy Markdown
Contributor

@zimulala zimulala commented Mar 17, 2026

What problem does this PR solve?

Issue Number: close #67065

Problem Summary:

Add reporter-side TopRU aggregation/output path, while keeping PR2/PR3 responsibilities split and reviewable.

What changed and how does it work?

  • Add RU aggregation datamodel and window aggregator:
    • ru_datamodel.go, ru_window_aggregator.go
  • Extend reporter pipeline to collect and emit RURecords on report tick:
    • reporter.go
  • Wire TopRU collector registration lifecycle in TopSQL setup/close:
    • topsql.go
  • Add reporter-side TopRU tests/cases:
    • ru_datamodel_test.go, ru_window_aggregator_test.go, topru_case_runner_test.go, topru_generated_cases_test.go, reporter_test.go
  • Split-only BUILD adjustment:
    • remove topru_structured_test.go from reporter test srcs (file no longer exists in current source branch).

Dependency note:
This PR depends on PR2 (RU delta collection in stmtstats/executor). RU collection is completed in PR2; this PR only handles reporter aggregation/output.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • TopSQL accepts RU (resource-usage) increments, aggregates them in 15s buckets, and emits aligned 60s Top‑N RU summaries (per-user/SQL/plan). Ingestion is non‑blocking with drop metrics for backpressure; RU is included alongside existing report payloads.
  • Chores

    • RU collector is auto-registered/unregistered during TopSQL setup/teardown.
  • Tests

    • Extensive unit tests and benchmarks for RU data model, windowed aggregation, Top‑N eviction/“others” behavior, concurrency, backpressure and reporting.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 17, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Mar 17, 2026

Review Complete

Findings: 0 issues
Posted: 0
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 17, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Mar 17, 2026

Hi @zimulala. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TopRU collection and reporting: in-memory RU data model, a 15s‑bucket sliding RU window aggregator with 60s reporting windows, reporter wiring (non‑blocking RU queue, worker, API), RU included in ReportData, many unit tests/benchmarks, and BUILD/test dependency updates.

Changes

Cohort / File(s) Summary
RU Data Model
pkg/util/topsql/reporter/ru_datamodel.go, pkg/util/topsql/reporter/ru_datamodel_test.go
New bounded multi-level Top‑N RU model: ruItem/ruRecord, per-user & global collectors, two‑tier “others” eviction, merge/compact/top‑N logic, proto export, and comprehensive unit tests.
RU Window Aggregator
pkg/util/topsql/reporter/ru_window_aggregator.go, pkg/util/topsql/reporter/ru_window_aggregator_test.go
New ruWindowAggregator implementing 15s buckets and 60s window reporting, bucket rotation/compaction, late-data handling, top‑N enforcement, plus extensive concurrency and benchmark tests.
Reporter Integration
pkg/util/topsql/reporter/reporter.go, pkg/util/topsql/reporter/reporter_test.go
Adds CollectRUIncrements API, non‑blocking RU channel & worker, wires ruAggregator into reporting pipeline, propagates RURecords into ReportData, and adds backpressure/drop metrics and related tests.
TopRU Test Scenarios
pkg/util/topsql/reporter/topru_case_runner_test.go, pkg/util/topsql/reporter/topru_generated_cases_test.go
Deterministic test harness and data‑driven cases validating RU aggregation semantics, metadata emission, key aggregation, edge cases, and payload assertions.
Build & Setup
pkg/util/topsql/reporter/BUILD.bazel, pkg/util/topsql/topsql.go
BUILD/test deps updated for new files and Prometheus proto; topsql setup/close optionally register/unregister RUCollector.
Minor
pkg/util/topsql/stmtstats/aggregator_bench_test.go
One-line benchmark comment clarification.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Reporter as RemoteTopSQLReporter
    participant Worker as collectRUWorker
    participant Aggregator as ruWindowAggregator
    participant Sink as DataSink

    Client->>Reporter: CollectRUIncrements(data)
    Reporter->>Reporter: enqueue to collectRUIncrementsChan (drop metric if full)

    activate Worker
    Worker->>Reporter: dequeue increments
    Worker->>Aggregator: addBatchToBucket(ruIncrements)
    deactivate Worker

    Reporter->>Aggregator: takeReportRecords(nowTs, itemInterval)
    Aggregator->>Aggregator: align windows, merge 15s buckets, apply top‑N caps
    Aggregator-->>Reporter: []TopRURecord

    Reporter->>Reporter: attach RURecords to collectedData
    Reporter->>Sink: send ReportData (SQLMeta + RURecords)
    Sink-->>Sink: consume report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

component/statistics

Suggested reviewers

  • nolouch
  • yibin87
  • XuHuaiyu

Poem

🐇 In fifteen‑second buckets I hop and take note,
I nibble hot keys while the long tails float.
Sixty seconds later I bundle and sing—
TopRU and TopSQL, neat numbers I bring.
A carrot for code, a tiny hopping ping!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the main change: adding TopRU RU window aggregation and reporting pipeline to the reporter package.
Description check ✅ Passed The description comprehensively addresses problem statement, detailed changes, test coverage, and dependency notes, though the release note section could be more explicit.
Linked Issues check ✅ Passed The PR closes issue #67065 and implements TopRU aggregation and reporting as required, with code changes matching the stated objectives.
Out of Scope Changes check ✅ Passed All changes are focused on TopRU reporter-side aggregation and reporting; only one minor comment clarification in topsql.go is tangential but related to TopRU setup.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/topru-pr3-rerun
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hawkingrei
Copy link
Copy Markdown
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed do-not-merge/needs-triage-completed labels Mar 17, 2026
@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/util/topsql/topsql.go (1)

65-67: Capture the registered RU collector instance to keep lifecycle symmetric.

Line 65 registers the collector from the current globalTopProfilingReport, while Line 86 unregisters from whatever instance globalTopProfilingReport points to at close time. If tests replace the global via SetupTopProfilingForTest between setup/close, the originally registered collector can remain registered.

♻️ Proposed refactor
 var (
 	globalTopProfilingReport reporter.TopSQLReporter
 	singleTargetDataSink     *reporter.SingleTargetDataSink
+	registeredRUCollector    stmtstats.RUCollector
 )
@@
 	stmtstats.RegisterCollector(globalTopProfilingReport)
 	if ruCollector, ok := globalTopProfilingReport.(stmtstats.RUCollector); ok {
 		stmtstats.RegisterRUCollector(ruCollector)
+		registeredRUCollector = ruCollector
 	}
 	stmtstats.SetupAggregator()
 }
@@
-	if ruCollector, ok := globalTopProfilingReport.(stmtstats.RUCollector); ok {
-		stmtstats.UnregisterRUCollector(ruCollector)
+	if registeredRUCollector != nil {
+		stmtstats.UnregisterRUCollector(registeredRUCollector)
+		registeredRUCollector = nil
 	}

Also applies to: 86-88

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/topsql/topsql.go` around lines 65 - 67, When registering the RU
collector from the current globalTopProfilingReport, capture and store the
specific collector instance (the ruCollector returned in the registration block)
so you can unregister that exact instance later; change the registration site
that calls stmtstats.RegisterRUCollector(ruCollector) to keep a
module-level/local field (e.g., savedRUCollector) and then update the
teardown/close code that currently calls stmtstats.UnregisterRUCollector(...) to
use savedRUCollector instead of re-reading globalTopProfilingReport; make the
same change for the TX collector path so SetupTopProfilingForTest replacements
don’t leave the original collector registered.
pkg/util/topsql/reporter/ru_datamodel.go (1)

123-142: Consider the O(n) timestamp lookup in high-throughput scenarios.

The add method uses a linear scan to find matching timestamps. For the current 15s-bucket design (max ~4 timestamps per 60s window), this is fine. If the bucket granularity ever changes to support many more timestamps per record, consider using a map-based lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/topsql/reporter/ru_datamodel.go` around lines 123 - 142, The add
method on ruRecord currently does an O(n) scan over r.items to match timestamp;
change ruRecord to maintain a map (e.g., itemsMap map[uint64]*ruItem) alongside
the items slice and update add to first look up the ruItem in itemsMap by
timestamp and update it (and r.totalRU), otherwise create a new ruItem, append
it to r.items and insert it into itemsMap; ensure any other methods that modify
r.items (removal, reset, serialization) also update itemsMap accordingly so both
structures stay consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/topsql/reporter/ru_datamodel.go`:
- Around line 123-142: The add method on ruRecord currently does an O(n) scan
over r.items to match timestamp; change ruRecord to maintain a map (e.g.,
itemsMap map[uint64]*ruItem) alongside the items slice and update add to first
look up the ruItem in itemsMap by timestamp and update it (and r.totalRU),
otherwise create a new ruItem, append it to r.items and insert it into itemsMap;
ensure any other methods that modify r.items (removal, reset, serialization)
also update itemsMap accordingly so both structures stay consistent.

In `@pkg/util/topsql/topsql.go`:
- Around line 65-67: When registering the RU collector from the current
globalTopProfilingReport, capture and store the specific collector instance (the
ruCollector returned in the registration block) so you can unregister that exact
instance later; change the registration site that calls
stmtstats.RegisterRUCollector(ruCollector) to keep a module-level/local field
(e.g., savedRUCollector) and then update the teardown/close code that currently
calls stmtstats.UnregisterRUCollector(...) to use savedRUCollector instead of
re-reading globalTopProfilingReport; make the same change for the TX collector
path so SetupTopProfilingForTest replacements don’t leave the original collector
registered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 56de3fb6-e361-4709-8ae5-d05ac16af549

📥 Commits

Reviewing files that changed from the base of the PR and between 16df0d1 and 794fb69.

📒 Files selected for processing (11)
  • pkg/util/topsql/reporter/BUILD.bazel
  • pkg/util/topsql/reporter/reporter.go
  • pkg/util/topsql/reporter/reporter_test.go
  • pkg/util/topsql/reporter/ru_datamodel.go
  • pkg/util/topsql/reporter/ru_datamodel_test.go
  • pkg/util/topsql/reporter/ru_window_aggregator.go
  • pkg/util/topsql/reporter/ru_window_aggregator_test.go
  • pkg/util/topsql/reporter/topru_case_runner_test.go
  • pkg/util/topsql/reporter/topru_generated_cases_test.go
  • pkg/util/topsql/stmtstats/aggregator_bench_test.go
  • pkg/util/topsql/topsql.go

Copy link
Copy Markdown
Contributor

@XuHuaiyu XuHuaiyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR review: TopRU aggregation and reporting pipeline. Two findings below.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/util/topsql/reporter/ru_window_aggregator.go (1)

178-199: ⚠️ Potential issue | 🟠 Major

Apply the final 100x100 cap after merging sub-intervals.

intervalCompacted limits each 15s/30s slice, but mergedOutput.mergeFrom(...) can still accumulate up to 4× the configured user/SQL count across the 60s window. Returning mergedOutput.toTopRURecords(...) directly therefore breaks the file's own 100x100 contract for itemInterval=15 and 30.

♻️ Proposed fix
-	// Convert to proto at output.
-	return mergedOutput.toTopRURecords(keyspaceName)
+	finalOutput := mergedOutput.compactWithLimits(ruReportTopNUsers, ruReportTopNSQLsPerUser)
+	if finalOutput == nil {
+		return nil
+	}
+	return finalOutput.toTopRURecords(keyspaceName)

Please also add a regression for the 15s/30s over-cap cases, not just the 60s path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/topsql/reporter/ru_window_aggregator.go` around lines 178 - 199, The
mergedOutput can exceed the configured top-N caps because you only compact each
sub-interval (intervalCompacted) but never re-apply compactWithLimits after
merging those intervalCompacted results; fix by calling mergedOutput =
mergedOutput.compactWithLimits(ruReportTopNUsers, ruReportTopNSQLsPerUser) (or
equivalent in-place compaction) immediately before converting to proto with
mergedOutput.toTopRURecords(keyspaceName), and ensure the singleBucket path
still returns a capped result (intervalCompacted is fine there). Also add
regression tests that exercise itemInterval=15 and itemInterval=30 to assert the
final output respects the 100x100 caps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/topsql/reporter/reporter.go`:
- Around line 172-185: CollectRUIncrements currently enqueues un-timestamped RU
batches onto collectRUIncrementsChan and batches are timestamped later in
collectRUWorker, which can reorder attribution around the reporting tick; to
fix, change the channel payload to include the producer timestamp (e.g. wrap
stmtstats.RUIncrementMap into a struct with a time.Time field) and have
CollectRUIncrements stamp the batch before sending, then update collectRUWorker
to drain all pending entries and process them in timestamp order at each tick
(or only process entries during the tick goroutine) so RU is attributed to the
correct window; apply the same pattern to the other enqueue points referenced
around collectRUIncrementsChan and the related enqueue/consumer code in the
collectRUWorker/reporting logic (the same timestamped-envelope +
drain-before-report approach).

---

Duplicate comments:
In `@pkg/util/topsql/reporter/ru_window_aggregator.go`:
- Around line 178-199: The mergedOutput can exceed the configured top-N caps
because you only compact each sub-interval (intervalCompacted) but never
re-apply compactWithLimits after merging those intervalCompacted results; fix by
calling mergedOutput = mergedOutput.compactWithLimits(ruReportTopNUsers,
ruReportTopNSQLsPerUser) (or equivalent in-place compaction) immediately before
converting to proto with mergedOutput.toTopRURecords(keyspaceName), and ensure
the singleBucket path still returns a capped result (intervalCompacted is fine
there). Also add regression tests that exercise itemInterval=15 and
itemInterval=30 to assert the final output respects the 100x100 caps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 645ec5ff-bab3-4bda-a805-13ac7ef3fe47

📥 Commits

Reviewing files that changed from the base of the PR and between 794fb69 and fc3bcfe.

📒 Files selected for processing (3)
  • pkg/util/topsql/reporter/reporter.go
  • pkg/util/topsql/reporter/ru_datamodel.go
  • pkg/util/topsql/reporter/ru_window_aggregator.go

@pingcap pingcap deleted a comment from tiprow bot Mar 17, 2026
@zimulala zimulala force-pushed the codex/topru-pr3-rerun branch from a6d1e3c to e760e1e Compare March 17, 2026 11:12
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/util/topsql/reporter/ru_datamodel.go (1)

123-142: Consider: Linear scan in ruRecord.add may become a bottleneck.

The add method performs a linear scan over items to find an existing timestamp. For typical 15s buckets with 1-4 timestamps this is acceptable, but if items grow larger (e.g., during merges), this could become O(n²).

A map-based lookup could improve performance if profiling shows this as a hotspot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/topsql/reporter/ru_datamodel.go` around lines 123 - 142, The add
method on ruRecord currently does a linear scan over r.items to find matching
timestamp which can degrade to O(n²); replace this with a map-based index (e.g.,
add a field on ruRecord like itemsIndex map[uint64]int) and change ruRecord.add
to look up timestamp in itemsIndex, update the existing ruItem by index when
present, or append a new ruItem and record its index in itemsIndex; ensure you
update itemsIndex whenever you append, merge, or reorder items and keep
r.totalRU changes identical to the current logic in add.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/topsql/reporter/ru_datamodel.go`:
- Around line 123-142: The add method on ruRecord currently does a linear scan
over r.items to find matching timestamp which can degrade to O(n²); replace this
with a map-based index (e.g., add a field on ruRecord like itemsIndex
map[uint64]int) and change ruRecord.add to look up timestamp in itemsIndex,
update the existing ruItem by index when present, or append a new ruItem and
record its index in itemsIndex; ensure you update itemsIndex whenever you
append, merge, or reorder items and keep r.totalRU changes identical to the
current logic in add.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0ff2db95-dc25-4843-87f7-d766f4c1c89e

📥 Commits

Reviewing files that changed from the base of the PR and between 71313c1 and e760e1e.

📒 Files selected for processing (5)
  • pkg/util/topsql/reporter/BUILD.bazel
  • pkg/util/topsql/reporter/reporter.go
  • pkg/util/topsql/reporter/ru_datamodel.go
  • pkg/util/topsql/reporter/ru_window_aggregator.go
  • pkg/util/topsql/reporter/ru_window_aggregator_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/util/topsql/reporter/ru_window_aggregator_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 85.73798% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.3741%. Comparing base (83d3794) to head (4660412).
⚠️ Report is 44 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67089        +/-   ##
================================================
- Coverage   77.7079%   77.3741%   -0.3339%     
================================================
  Files          2013       1940        -73     
  Lines        551161     549623      -1538     
================================================
- Hits         428296     425266      -3030     
- Misses       121134     124304      +3170     
+ Partials       1731         53      -1678     
Flag Coverage Δ
integration 41.0202% <14.7208%> (-7.1352%) ⬇️
unit 76.6910% <84.7429%> (+0.4890%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (+4.7091%) ⬆️
parser ∅ <ø> (∅)
br 47.2781% <ø> (-13.6005%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@yibin87 yibin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zimulala
Copy link
Copy Markdown
Contributor Author

Re: ru_datamodel.go `ruRecord.add` linear scan (CodeRabbit nitpick): For the current 15s-bucket design we have at most a few timestamps per record per 60s window, so the linear scan is intentional and acceptable. If we change granularity or profiling shows this as a hotspot, we can add a map-based index (e.g. `itemsMap map[uint64]*ruItem`) in a follow-up.

Copy link
Copy Markdown
Contributor

@yibin87 yibin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/util/topsql/reporter/reporter_test.go (1)

1211-1236: Benchmarks should avoid unbounded sink accumulation.

initializeCache registers a sink that retains every ReportData; in benchmark loops this can dominate memory/time and blur reporter-path measurements. Prefer a no-op or bounded/drained sink for benchmark scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/topsql/reporter/reporter_test.go` around lines 1211 - 1236, The
benchmark currently uses initializeCache which registers a sink that retains
every ReportData and causes unbounded accumulation; change the benchmark to
register a no-op/draining sink instead of the retaining sink (or modify
initializeCache to accept a sink parameter) so ReportData is discarded promptly.
Concretely, for the BenchmarkReporterScenarios setup (where tsr is created for
TopSQLOnly, TopRUOnly, and TopSQLAndTopRU), create and use a sink that simply
returns nil or drains reports (does not append to a slice) and pass it into
initializeCache or call tsr.RegisterSink with that no-op sink before the loops
so populateCache, populateCacheWithRU and tsr.doReport do not retain ReportData.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/topsql/reporter/reporter_test.go`:
- Around line 247-285: Record the original TopSQL enabled state at the start
(e.g. origTopSQLEnabled := topsqlstate.TopSQLEnabled()) before calling
topsqlstate.DisableTopSQL(), and in the existing t.Cleanup restore it by calling
topsqlstate.EnableTopSQL() or topsqlstate.DisableTopSQL() based on
origTopSQLEnabled; use the same
TestEffectiveReportIntervalSecondsTopSQLIndependentFromTopRU test and
topsqlstate.EnableTopSQL/DisableTopSQL/TopSQLEnabled symbols so the global
TopSQL flag is returned to its prior state to avoid cross-test leakage.

In `@pkg/util/topsql/reporter/ru_window_aggregator_test.go`:
- Around line 69-80: The helper fillAggregatorSteadyState60sAt10kKeys currently
sets numUsers=200 and numSQLsPerUser=200 producing 40,000 keys; change the
constants so numUsers * numSQLsPerUser == 10_000 (e.g., numUsers=100 and
numSQLsPerUser=100) and rebuild the batch via makeRUBatch(numUsers,
numSQLsPerUser) so the function matches its name/comment and the benchmark
targets 10k keys; update only the constants in
fillAggregatorSteadyState60sAt10kKeys and any related comment text if needed.

---

Nitpick comments:
In `@pkg/util/topsql/reporter/reporter_test.go`:
- Around line 1211-1236: The benchmark currently uses initializeCache which
registers a sink that retains every ReportData and causes unbounded
accumulation; change the benchmark to register a no-op/draining sink instead of
the retaining sink (or modify initializeCache to accept a sink parameter) so
ReportData is discarded promptly. Concretely, for the BenchmarkReporterScenarios
setup (where tsr is created for TopSQLOnly, TopRUOnly, and TopSQLAndTopRU),
create and use a sink that simply returns nil or drains reports (does not append
to a slice) and pass it into initializeCache or call tsr.RegisterSink with that
no-op sink before the loops so populateCache, populateCacheWithRU and
tsr.doReport do not retain ReportData.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 551cab77-6549-4904-be6b-280a0a58ebc3

📥 Commits

Reviewing files that changed from the base of the PR and between e760e1e and 7030ca7.

📒 Files selected for processing (5)
  • pkg/util/topsql/reporter/reporter_test.go
  • pkg/util/topsql/reporter/ru_datamodel_test.go
  • pkg/util/topsql/reporter/ru_window_aggregator.go
  • pkg/util/topsql/reporter/ru_window_aggregator_test.go
  • pkg/util/topsql/reporter/topru_generated_cases_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/util/topsql/reporter/ru_window_aggregator.go

Copy link
Copy Markdown
Member

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if len(batch.data) == 0 {
continue
}
tsr.ruAggregator.addBatchToBucket(batch.timestamp, batch.data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High priority

Moving the timestamp to enqueue time is necessary, but this still leaves a cross-tick race because report flushing and RU draining are on separate goroutines.

Impact

  • Suppose a batch is enqueued at t = 59, but collectRUWorker does not consume it until after takeReportRecords(60) has advanced lastReportedEndTs to 60.
  • This line will then call addBatchToBucket(59, ...), which aligns to bucket 45.
  • ruWindowAggregator.addBatchToBucket(...) will treat that as late data (45 < 60) and drop it entirely.

So a batch produced before the report tick can still disappear from the closed window; this is not only a best-effort shift to the next window.

Test gap
TestTopRUBestEffortBoundaryShift covers a batch collected at t = 61 (already after the tick), but it does not cover the remaining problematic case: collected before the tick, drained after the tick.

Suggested direction
Before closing/reporting a window, drain pending RU batches into the aggregator, or serialize RU ingestion and report flushing on the same goroutine/event loop.

@zimulala zimulala force-pushed the codex/topru-pr3-rerun branch from 89b8852 to 0374ede Compare March 20, 2026 10:12
Copy link
Copy Markdown
Contributor

@XuHuaiyu XuHuaiyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest update. Serializing RU batch ingestion onto collectWorker and shifting late batches to the earliest still-open window addresses my previous concern about pre-tick batches being dropped by the report/drain race. The updated aggregator/reporter tests also look good from my side.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XuHuaiyu, yibin87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 20, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 20, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-18 06:33:11.674611022 +0000 UTC m=+338718.762268559: ☑️ agreed by yibin87.
  • 2026-03-20 11:45:36.906661678 +0000 UTC m=+530263.994319215: ☑️ agreed by XuHuaiyu.

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

@zimulala
Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TopRU aggregation + reporting

5 participants