resourcegroup, session, metrics: cherry-pick runaway watch improvements for v8.5.5 by JmPotato · Pull Request #67352 · pingcap/tidb

JmPotato · 2026-03-27T05:02:53Z

What problem does this PR solve?

Issue Number: ref #65747, ref #65828, ref #66155, ref #65834, ref #66171, ref #66182

Problem Summary:

Cherry-pick a series of runaway watch improvements from master to the v8.5.5 release branch. These PRs collectively improve the performance, observability, and correctness of the runaway query watch mechanism.

What changed and how does it work?

Cherry-picked PRs (in order):

refactor(runaway): use info cache to check runaway watch tables #65747 — refactor(runaway): use info cache to check runaway watch tables
feat(session): add schema upgrade v254 to optimize runaway watch index #65828 — feat(session): add schema upgrade v254 to optimize runaway watch index (adapted as version224 on release branch, with idx_start_time and idx_done_time indexes)
feat(runaway): paginate watch sync using id checkpoints #66155 — feat(runaway): paginate watch sync using id checkpoints
feat(runaway): refactor record flushing with batch flusher abstraction #65834 — feat(runaway): refactor record flushing with batch flusher abstraction
refactor(runaway): make active watch counting lock-free #66171 — refactor(runaway): make active watch counting lock-free
feat(metrics/runaway): add syncer observability #66182 — feat(metrics/runaway): add syncer observability

Adaptation fixes for the release branch:

metricscommon.New*Vec → local New*Vec (metricscommon package does not exist on this branch)
ast.NewCIStr/ast.CIStr → model.NewCIStr/model.CIStr (API location differs on this branch)
Schema upgrade version adapted from version254 (master) to version224 (release branch reserved range)
Grafana JSON regenerated from jsonnet source to resolve panel ID conflicts

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Summary by CodeRabbit

New Features
- Improved runaway handling with batched flushers and enhanced observability (9 new Grafana panels + new Prometheus metrics for flusher/syncer).
Chores
- Added secondary indexes to runaway watch tables and bumped bootstrap version to 224 for schema upgrade.
Tests
- Added/updated extensive tests for runaway manager, flusher, syncer, record handling, and the ver224 bootstrap upgrade.

coderabbitai · 2026-03-27T05:03:18Z

📝 Walkthrough

Walkthrough

Adds runaway flusher/syncer instrumentation and Grafana panels, implements a generic in-memory batch flusher with tests, refactors Manager to use atomic per-group counters and batch flushers, changes syncer to ID-based checkpoints, and bumps bootstrap version to 224 with index upgrades and related tests.

Changes

Cohort / File(s)	Summary
Metrics infra `pkg/metrics/resource_group.go`, `pkg/metrics/metrics.go`	Add multiple new Prometheus metric variables for runaway flusher/syncer and register them in `RegisterMetrics()`.
Grafana dashboard `pkg/metrics/grafana/tidb_resource_control.json`, `pkg/metrics/grafana/tidb_resource_control.jsonnet`	Insert 9 new "Runaway" panels (flusher/syncer ops, durations, intervals, batch size quantiles, checkpoint) and renumber panel IDs.
Batch flusher implementation & tests `pkg/resourcegroup/runaway/flusher.go`, `pkg/resourcegroup/runaway/flusher_test.go`	Add unexported generic `batchFlusher` with ticker/threshold-based flushing, merge/flush callbacks, SQL exec via `ExecRCRestrictedSQL`, metrics emission, and unit tests covering add/merge/flush behaviors.
Manager refactor & flush loop `pkg/resourcegroup/runaway/manager.go`, `pkg/resourcegroup/tests/resource_group_test.go`	Replace `ActiveGroup`+lock with `sync.Map` of `*atomic.Int64`, wire TTL cache callbacks to update atomic counters, replace single-timer flusher with three `batchFlusher` instances, add `doSync()` and instrument sync metrics; adjust tests to exercise new flush behavior and failpoints.
Runaway record SQL & tests `pkg/resourcegroup/runaway/record.go`, `pkg/resourcegroup/runaway/record_test.go`	Remove `recordKey.Hash()` and fnv dependency; add batch INSERT/DELETE SQL generators using table name helpers; update tests to rely on map key semantics instead of explicit hash checks.
Syncer changes `pkg/resourcegroup/runaway/syncer.go`	Add syncer metrics and `InfoCache` dependency, switch table-existence checks to infoschema cache, change checkpoint type from time to int64, push checkpoint as last fetched row ID, and add LIMIT+bind to selects.
Checker & tests `pkg/resourcegroup/runaway/checker.go`, `pkg/resourcegroup/runaway/checker_test.go`	Remove an `ActiveLock` RLock usage in `DeriveChecker`; add extensive unit tests for active counters, threshold logic, action selection, and concurrent marking.
Build/test config `pkg/resourcegroup/runaway/BUILD.bazel`, `pkg/resourcegroup/tests/BUILD.bazel`	Add new source/test files to Bazel targets, increase test shard counts, and add dependencies (metrics/proto/testify).
Bootstrap schema & tests `pkg/session/bootstrap.go`, `pkg/session/bootstrap_test.go`, `br/pkg/restore/snap_client/systable_restore_test.go`	Bump bootstrap version to 224, add `idx_start_time`/`idx_done_time` indexes via `upgradeToVer224()`, add upgrade test, and update expected bootstrap version in restore test.
Minor test expectation `pkg/executor/infoschema_reader_test.go`, `br/pkg/restore/snap_client/systable_restore_test.go`	Adjust expected counts/versions in tests (`count(*)` 81→83, bootstrap version 223→224).

Sequence Diagram(s)

sequenceDiagram
    participant RM as RunawayManager
    participant BF as batchFlusher
    participant T as Ticker
    participant SQL as ExecRCRestrictedSQL
    participant MET as Prometheus

    rect rgba(200,200,255,0.5)
    Note over RM,BF: Periodic batched flushing for runaway/quarantine/stale records
    end

    loop periodic
        T->>BF: tick
        BF->>BF: check buffer / threshold
        alt flush needed
            BF->>BF: dedupe/merge records
            BF->>SQL: gen & ExecRCRestrictedSQL(batch SQL, params)
            SQL-->>BF: success / error
            BF->>MET: observe duration, batch size, counters
            BF->>RM: reset buffer, update lastFlushTime
        else no-op
            BF-->>RM: continue
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(metrics/runaway): add syncer observability #66182 — Overlaps runaway instrumentation, flusher/syncer logic, and dashboard changes.
pkg/resourcegroup: fix unstable TestResourceGroupRunaway (#65780) #67316 — Related adjustments to resourcegroup runaway tests and flushing behavior.

Suggested labels

cherry-pick-approved, type/cherry-pick-for-release-8.5, ok-to-test

Suggested reviewers

wjhuang2016
yudongusa
yibin87
hawkingrei

Poem

🐰 Batches fill and tickers chime,
Counters hop and histograms rhyme,
Sync by id, indexes align,
Flusher hums, metrics shine —
A rabbit cheers: logs look fine!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely identifies the main change: cherry-picking runaway watch improvements for the v8.5.5 release branch. It directly summarizes the primary objective of the changeset.
Description check	✅ Passed	The description follows the template structure, includes issue references (ref `#65747`, ref `#65828`, etc.), provides a clear problem summary explaining the purpose of cherry-picking improvements, details what changed with ordered PR citations, explains adaptation fixes for the release branch, and indicates unit tests were included. All critical sections are present and substantive.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

pkg/session/bootstrap_test.go (1)
2612-2619: Close the recordset in getTableCreateSQLFn to avoid test-side resource leakage.

Each show create table call should close res after reading.
♻️ Proposed fix
 	getTableCreateSQLFn := func(se sessiontypes.Session, tableName string) string {
 		res := MustExecToRecodeSet(t, se, fmt.Sprintf("show create table mysql.%s", tableName))
+		defer func() { require.NoError(t, res.Close()) }()
 		chk := res.NewChunk(nil)
 		err = res.Next(ctx, chk)
 		require.NoError(t, err)
 		require.Equal(t, 1, chk.NumRows())
 		return string(chk.GetRow(0).GetBytes(1))
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/session/bootstrap_test.go` around lines 2612 - 2619, getTableCreateSQLFn
opens a recordset via MustExecToRecodeSet but never closes it, leaking
resources; fix by closing the recordset (res) after use—best add defer
res.Close() immediately after obtaining res in getTableCreateSQLFn (so the
recordset is always closed even on errors) while keeping the existing calls to
res.NewChunk, res.Next, and returns intact.
pkg/resourcegroup/runaway/record.go (1)
140-158: Factor the watch-row argument builder once.

genInsertionStmt and genBatchInsertWatchStmt now duplicate the watch-table field order, NullTime handling, and switch-group mapping. A shared helper would keep the single-row and batch paths in lockstep and make future backports less error-prone.

As per coding guidelines "Code SHOULD remain maintainable for future readers with basic TiDB familiarity, including readers who are not experts in the specific subsystem/feature."

Also applies to: 196-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/resourcegroup/runaway/record.go` around lines 140 - 158, genInsertionStmt
and genBatchInsertWatchStmt duplicate the same watch-table field ordering,
NullTime handling, and switch-group mapping; factor that logic into a single
helper (e.g., buildWatchRowParams or watchRowValues) that accepts a
*QuarantineRecord and returns the prepared []any for a single row (performing
the NullTime -> nil conversion and calling r.getSwitchGroupName()), then update
genInsertionStmt and genBatchInsertWatchStmt to reuse that helper for building
params while keeping the existing placeholder order produced by
writeInsert/getRunawayWatchTableName; ensure the helper is used for both
single-row and batch paths so ordering and conversions remain identical.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/resourcegroup/runaway/flusher.go`:
- Around line 127-139: The current flush logic always clears f.buffer and
advances f.lastFlushTime regardless of flush outcome; change it so that after
calling f.flushFn(f.buffer) and recording metrics (f.batchSizeObserver,
f.durationObserver), you only reset f.buffer (make(map[K]V, f.threshold)) and
set f.lastFlushTime = now when err == nil (successful flush). If err != nil,
increment f.flushErrorCounter and leave f.buffer and f.lastFlushTime untouched
so buffered records are preserved for retry; keep the existing
f.flushSuccessCounter increment only on success.

In `@pkg/resourcegroup/runaway/manager.go`:
- Around line 170-177: The merge function passed to newBatchFlusher for
quarantineRecordFlusher currently keeps the first buffered QuarantineRecord per
key, which can preserve stale data; change it to keep the newest buffered record
instead—either always overwrite m[k] with v in that closure or compare an
age/ID/timestamp field on QuarantineRecord and replace m[k] when v is newer
(this logic should align with addWatchList’s rule that the latest local watch
(ID==0) is authoritative). Update the merge closure used when constructing
quarantineRecordFlusher (and any similar flusher closures) so the newer record
wins.
- Around line 194-205: The created ticker runawayRecordGCTicker is never stopped
causing a goroutine leak; after the line that constructs runawayRecordGCTicker
:= time.NewTicker(gcInterval) add a defer to stop it (e.g., defer
runawayRecordGCTicker.Stop()) so the ticker is cleaned up when the function
returns; locate the ticker creation near the existing calls to
rm.runawayRecordChan(), rm.quarantineRecordChan(), and
rm.staleQuarantineRecordChan() and insert the defer immediately after that
construction.

In `@pkg/resourcegroup/tests/resource_group_test.go`:
- Around line 550-559: After enabling the failpoint via
failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush",
...) immediately register a cleanup to always disable it (use t.Cleanup(func(){
failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush")
}) or a guarded defer) so the failpoint is turned off even if subsequent
assertions (the three tk.QueryToErr calls with resource_group hint) fail; update
the block around the failpoint.Enable call to add this cleanup and remove the
later unconditional failpoint.Disable call.

In `@pkg/session/bootstrap_test.go`:
- Line 2607: The helper getTableCreateSQLFn creates a result set via
MustExecToRecodeSet but never closes it; after the call that assigns the result
set (the variable named res from MustExecToRecodeSet) add a defer res.Close()
immediately so the result is always closed and connection handles are released;
update getTableCreateSQLFn to defer res.Close() right after MustExecToRecodeSet
returns to prevent resource leaks during tests.

---

Nitpick comments:
In `@pkg/resourcegroup/runaway/record.go`:
- Around line 140-158: genInsertionStmt and genBatchInsertWatchStmt duplicate
the same watch-table field ordering, NullTime handling, and switch-group
mapping; factor that logic into a single helper (e.g., buildWatchRowParams or
watchRowValues) that accepts a *QuarantineRecord and returns the prepared []any
for a single row (performing the NullTime -> nil conversion and calling
r.getSwitchGroupName()), then update genInsertionStmt and
genBatchInsertWatchStmt to reuse that helper for building params while keeping
the existing placeholder order produced by writeInsert/getRunawayWatchTableName;
ensure the helper is used for both single-row and batch paths so ordering and
conversions remain identical.

In `@pkg/session/bootstrap_test.go`:
- Around line 2612-2619: getTableCreateSQLFn opens a recordset via
MustExecToRecodeSet but never closes it, leaking resources; fix by closing the
recordset (res) after use—best add defer res.Close() immediately after obtaining
res in getTableCreateSQLFn (so the recordset is always closed even on errors)
while keeping the existing calls to res.NewChunk, res.Next, and returns intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e9a1dbd1-aeaa-41b8-981e-cef789e1f423

📥 Commits

Reviewing files that changed from the base of the PR and between 1fa258b and f8a21db.

📒 Files selected for processing (18)

br/pkg/restore/snap_client/systable_restore_test.go
pkg/metrics/grafana/tidb_resource_control.json
pkg/metrics/grafana/tidb_resource_control.jsonnet
pkg/metrics/metrics.go
pkg/metrics/resource_group.go
pkg/resourcegroup/runaway/BUILD.bazel
pkg/resourcegroup/runaway/checker.go
pkg/resourcegroup/runaway/checker_test.go
pkg/resourcegroup/runaway/flusher.go
pkg/resourcegroup/runaway/flusher_test.go
pkg/resourcegroup/runaway/manager.go
pkg/resourcegroup/runaway/record.go
pkg/resourcegroup/runaway/record_test.go
pkg/resourcegroup/runaway/syncer.go
pkg/resourcegroup/tests/BUILD.bazel
pkg/resourcegroup/tests/resource_group_test.go
pkg/session/bootstrap.go
pkg/session/bootstrap_test.go

💤 Files with no reviewable changes (1)

pkg/resourcegroup/runaway/record_test.go

coderabbitai · 2026-03-27T05:23:54Z

+	err := f.flushFn(f.buffer)
+	duration := time.Since(start)
+
+	f.batchSizeObserver.Observe(float64(batchSize))
+	f.durationObserver.Observe(duration.Seconds())
+	if err != nil {
+		f.flushErrorCounter.Inc()
+	} else {
+		f.flushSuccessCounter.Inc()
+	}
+
+	f.lastFlushTime = now
+	f.buffer = make(map[K]V, f.threshold)


⚠️ Potential issue | 🔴 Critical

Preserve buffered records on flush failure.

Lines 127-139 clear f.buffer even when flushFn returns an error. That turns a transient ExecRCRestrictedSQL failure into permanent loss of pending runaway records, watch inserts, or stale-watch deletes. Only advance lastFlushTime and reset the buffer after a successful flush.

🛠️ Minimal fix

f.batchSizeObserver.Observe(float64(batchSize)) f.durationObserver.Observe(duration.Seconds()) if err != nil { f.flushErrorCounter.Inc() - } else { - f.flushSuccessCounter.Inc() + return } - - f.lastFlushTime = now - f.buffer = make(map[K]V, f.threshold) + f.flushSuccessCounter.Inc() + f.lastFlushTime = now + f.buffer = make(map[K]V, f.threshold)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

err := f.flushFn(f.buffer)

duration := time.Since(start)

f.batchSizeObserver.Observe(float64(batchSize))

f.durationObserver.Observe(duration.Seconds())

if err != nil {

f.flushErrorCounter.Inc()

} else {

f.flushSuccessCounter.Inc()

}

f.lastFlushTime = now

f.buffer = make(map[K]V, f.threshold)

err := f.flushFn(f.buffer)

duration := time.Since(start)

f.batchSizeObserver.Observe(float64(batchSize))

f.durationObserver.Observe(duration.Seconds())

if err != nil {

f.flushErrorCounter.Inc()

return

}

f.flushSuccessCounter.Inc()

f.lastFlushTime = now

f.buffer = make(map[K]V, f.threshold)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/resourcegroup/runaway/flusher.go` around lines 127 - 139, The current flush logic always clears f.buffer and advances f.lastFlushTime regardless of flush outcome; change it so that after calling f.flushFn(f.buffer) and recording metrics (f.batchSizeObserver, f.durationObserver), you only reset f.buffer (make(map[K]V, f.threshold)) and set f.lastFlushTime = now when err == nil (successful flush). If err != nil, increment f.flushErrorCounter and leave f.buffer and f.lastFlushTime untouched so buffered records are preserved for retry; keep the existing f.flushSuccessCounter increment only on success.

coderabbitai · 2026-03-27T05:23:54Z

+	quarantineRecordFlusher := newBatchFlusher(
+		"quarantine-record",
+		watchFlushInterval,
+		batchSize,
+		func(m map[string]*QuarantineRecord, k string, v *QuarantineRecord) {
+			if _, ok := m[k]; !ok {
+				m[k] = v
+			}


⚠️ Potential issue | 🟠 Major

Keep the newest buffered quarantine record for each key.

addWatchList() treats the latest ID == 0 local watch as authoritative until the system row arrives, but this merge function keeps the first buffered value instead. Two marks for the same key within one flush interval can therefore persist stale TTL/action data and then overwrite the newer in-memory watch on the next sync.

🛠️ Minimal fix

func(m map[string]*QuarantineRecord, k string, v *QuarantineRecord) { - if _, ok := m[k]; !ok { - m[k] = v - } + m[k] = v },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

quarantineRecordFlusher := newBatchFlusher(

"quarantine-record",

watchFlushInterval,

batchSize,

func(m map[string]*QuarantineRecord, k string, v *QuarantineRecord) {

if _, ok := m[k]; !ok {

m[k] = v

}

quarantineRecordFlusher := newBatchFlusher(

"quarantine-record",

watchFlushInterval,

batchSize,

func(m map[string]*QuarantineRecord, k string, v *QuarantineRecord) {

m[k] = v

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/resourcegroup/runaway/manager.go` around lines 170 - 177, The merge function passed to newBatchFlusher for quarantineRecordFlusher currently keeps the first buffered QuarantineRecord per key, which can preserve stale data; change it to keep the newest buffered record instead—either always overwrite m[k] with v in that closure or compare an age/ID/timestamp field on QuarantineRecord and replace m[k] when v is newer (this logic should align with addWatchList’s rule that the latest local watch (ID==0) is authoritative). Update the merge closure used when constructing quarantineRecordFlusher (and any similar flusher closures) so the newer record wins.

Do we need to fix it?

@rleungx — The fix is not strictly required, but it's a minor correctness improvement worth considering.

Here's the reasoning:

Why it's mostly fine as-is:

quarantineRecordFlusher targets genBatchInsertWatchStmt, which generates a plain multi-row INSERT. Keeping the first record is actually a reasonable dedup strategy to avoid attempting to insert the same watch key twice.

The local watchList (TTL cache) already keeps the newest record — in addWatchList the item.ID == 0 branch always replaces with the latest value. So query enforcement on the hot path is always using the most up-to-date settings.

On the next UpdateNewAndDoneWatch cycle, the syncer reconciles DB state back into memory anyway, so any transient inconsistency is self-correcting.

Why the suggested fix has merit:

If two runaway events for the same key (resource group + SQL pattern) fire within one 1-second flush window with different TTL or Action settings, the DB row would capture the first (potentially stale) settings rather than the latest. This is a minor observability/persistence gap, not a query-enforcement gap.

The scenario is rare in practice (same query marked twice with different settings within 1 second), so the impact is low.

Verdict: It's a low-priority, low-risk cleanup. If you prefer to keep the code simple and avoid the edge case entirely, applying m[k] = v is fine. If the current behavior is intentional (write-once per flush to avoid duplicate INSERTs), it's also acceptable. Either way, query enforcement correctness is unaffected.

🧠 Learnings used

Learnt from: joechenrh Repo: pingcap/tidb PR: 66878 File: tests/realtikvtest/importintotest/import_into_test.go:253-267 Timestamp: 2026-03-11T06:26:14.065Z Learning: In pingcap/tidb, PR `#66878` (cherry-pick of `#58401` to release-7.5) also bundles the follow-up nil-check fix from PR `#63146`. `TestOnUpdateColumn` in `tests/realtikvtest/importintotest/import_into_test.go` was cherry-picked from PR `#63146` and tests that `IMPORT INTO` on a table with an `ON UPDATE CURRENT_TIMESTAMP` column does NOT panic due to a nil `onDup` map in `resolveGeneratedColumns`. It is NOT intended to cover the transitive generated-column / index-inconsistency fix from `#58401/`#58400.

Learnt from: terry1purcell Repo: pingcap/tidb PR: 66695 File: pkg/planner/cardinality/BUILD.bazel:62-66 Timestamp: 2026-03-13T01:48:14.212Z Learning: In pingcap/tidb cherry-pick PRs, unresolved merge conflicts (e.g., conflict markers in BUILD.bazel or other files) are expected artifacts of the cherry-pick process and are already tracked by the ti-chi-bot hold mechanism. Do NOT flag merge conflicts as review issues — they are not a reason to block or comment on the code review.

Learnt from: zimulala Repo: pingcap/tidb PR: 67265 File: pkg/util/topsql/reporter/ru_datamodel_test.go:259-308 Timestamp: 2026-03-25T03:46:10.574Z Learning: In `pkg/util/topsql/reporter/ru_datamodel_test.go` (pingcap/tidb PR `#67265`), `TestRUCollectingOthersWireLabelNoCollisionWithRuntimeUserShape` intentionally uses `"app127.0.0.1"` (not `othersUserWireLabel`) as the runtime user. The regression is scoped to runtime user shapes (`userhost` / empty string); broadening to arbitrary raw user strings matching the wire label is out of contract. The empty-user and merge-path regressions (`TestRUCollectingEmptyUserAndGlobalOthersRemainDistinct`, `TestRUCollectingMergeFromKeepsEmptyUserDistinctFromGlobalOthers`) plus aggregator-side assertion hardening collectively cover the fix.

Learnt from: cryo-zd Repo: pingcap/tidb PR: 65532 File: pkg/planner/core/preprocess.go:1887-1898 Timestamp: 2026-03-12T08:41:44.509Z Learning: In pkg/planner/core/preprocess.go (pingcap/tidb PR `#65532`), the `checkLockClauseTables` function intentionally accepts `FOR UPDATE OF <base_table>` even when the FROM item was introduced with an alias (`matched.alias.L != "" && !matchedByAlias`). Rather than returning an error (strict MySQL behavior), TiDB emits a warning to guide users toward the alias form. This backward-compatibility fallback is a deliberate acceptance criterion for the change — do not flag it as a MySQL-incompatibility bug in future reviews.

Learnt from: joechenrh Repo: pingcap/tidb PR: 66878 File: pkg/planner/core/logical_plan_builder.go:6455-6472 Timestamp: 2026-03-11T06:29:00.122Z Learning: Ensure code reviews verify the VirtualAssignmentsOffset semantics: the planner sets Update.VirtualAssignmentsOffset = len(update.List). The executor should only apply OrderedList[:VirtualAssignmentsOffset] when composing new rows, and only after the 'changed' check should it group/evaluate OrderedList[VirtualAssignmentsOffset:] per table. This pattern applies to files under pkg/planner/core and pkg/executor (e.g., common_plans.go and update.go). Reviewers should check that updates respect slicing behavior, that the offset is consistently derived from the planner, and that downstream code does not bypass the offset when creating new rows. Add tests validating both branches: the slice before the offset for new rows, and the per-table handling of the slice after the offset.

coderabbitai · 2026-03-27T05:23:54Z

+	runawayRecordGCTicker := time.NewTicker(gcInterval)
 	recordCh := rm.runawayRecordChan()
 	quarantineRecordCh := rm.quarantineRecordChan()
 	staleQuarantineRecordCh := rm.staleQuarantineRecordChan()
-	flushThreshold := flushThreshold()
-	// recordMap is used to deduplicate records which will be inserted into `mysql.tidb_runaway_queries`.
-	recordMap := make(map[recordKey]*Record, flushThreshold)
-
-	flushRunawayRecords := func() {
-		if len(recordMap) == 0 {
-			return
-		}
-		sql, params := genRunawayQueriesStmt(recordMap)
-		if _, err := ExecRCRestrictedSQL(rm.sysSessionPool, sql, params); err != nil {
-			logutil.BgLogger().Error("flush runaway records failed", zap.Error(err), zap.Int("count", len(recordMap)))
-		}
-		// reset the map.
-		recordMap = make(map[recordKey]*Record, flushThreshold)
-	}

 	for {
 		select {
 		case <-rm.exit:
+			runawayRecordFlusher.stop()
+			quarantineRecordFlusher.stop()
+			staleQuarantineFlusher.stop()
 			logutil.BgLogger().Info("runaway record flush loop exit")


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file find . -name "manager.go" -path "*/resourcegroup/runaway/*" -type f

Repository: pingcap/tidb

Length of output: 94

🏁 Script executed:

# Read the file to understand the context around lines 194-205 wc -l ./pkg/resourcegroup/runaway/manager.go

Repository: pingcap/tidb

Length of output: 98

🏁 Script executed:

# Get the full function containing the ticker creation, examining a broader range sed -n '170,230p' ./pkg/resourcegroup/runaway/manager.go

Repository: pingcap/tidb

Length of output: 2047

🏁 Script executed:

# Let's see more context to understand the full function sed -n '150,210p' ./pkg/resourcegroup/runaway/manager.go

Repository: pingcap/tidb

Length of output: 1614

🏁 Script executed:

# Check the function signature to understand the full context sed -n '130,210p' ./pkg/resourcegroup/runaway/manager.go

Repository: pingcap/tidb

Length of output: 2174

🏁 Script executed:

# Let's see the complete function with all its defers and exit paths sed -n '138,230p' ./pkg/resourcegroup/runaway/manager.go

Repository: pingcap/tidb

Length of output: 2893

Stop the GC ticker when the flush loop exits.

The runawayRecordGCTicker created at line 194 is never stopped. Add defer runawayRecordGCTicker.Stop() immediately after construction to prevent the ticker's internal goroutine from leaking when the function returns.

🛠️ Minimal fix

runawayRecordGCTicker := time.NewTicker(gcInterval) +defer runawayRecordGCTicker.Stop() recordCh := rm.runawayRecordChan()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

runawayRecordGCTicker := time.NewTicker(gcInterval)

recordCh := rm.runawayRecordChan()

quarantineRecordCh := rm.quarantineRecordChan()

staleQuarantineRecordCh := rm.staleQuarantineRecordChan()

flushThreshold := flushThreshold()

// recordMap is used to deduplicate records which will be inserted into `mysql.tidb_runaway_queries`.

recordMap := make(map[recordKey]*Record, flushThreshold)

flushRunawayRecords := func() {

if len(recordMap) == 0 {

return

}

sql, params := genRunawayQueriesStmt(recordMap)

if _, err := ExecRCRestrictedSQL(rm.sysSessionPool, sql, params); err != nil {

logutil.BgLogger().Error("flush runaway records failed", zap.Error(err), zap.Int("count", len(recordMap)))

}

// reset the map.

recordMap = make(map[recordKey]*Record, flushThreshold)

}

for {

select {

case <-rm.exit:

runawayRecordFlusher.stop()

quarantineRecordFlusher.stop()

staleQuarantineFlusher.stop()

logutil.BgLogger().Info("runaway record flush loop exit")

runawayRecordGCTicker := time.NewTicker(gcInterval)

defer runawayRecordGCTicker.Stop()

recordCh := rm.runawayRecordChan()

quarantineRecordCh := rm.quarantineRecordChan()

staleQuarantineRecordCh := rm.staleQuarantineRecordChan()

for {

select {

case <-rm.exit:

runawayRecordFlusher.stop()

quarantineRecordFlusher.stop()

staleQuarantineFlusher.stop()

logutil.BgLogger().Info("runaway record flush loop exit")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/resourcegroup/runaway/manager.go` around lines 194 - 205, The created ticker runawayRecordGCTicker is never stopped causing a goroutine leak; after the line that constructs runawayRecordGCTicker := time.NewTicker(gcInterval) add a defer to stop it (e.g., defer runawayRecordGCTicker.Stop()) so the ticker is cleaned up when the function returns; locate the ticker creation near the existing calls to rm.runawayRecordChan(), rm.quarantineRecordChan(), and rm.staleQuarantineRecordChan() and insert the defer immediately after that construction.

coderabbitai · 2026-03-27T05:23:54Z

+	require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", `return(1)`))
 	// check thrice to make sure the runaway query be regarded as a repeated query.
 	err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.2) from t")
 	require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
 	err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.3) from t")
 	require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
-	// using FastRunawayGC to trigger flush
-	require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/FastRunawayGC", `return(1)`))
 	err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.4) from t")
 	require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
+	// resume the flush to ensure an aggregated record can be queried later.
+	require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))


⚠️ Potential issue | 🟡 Minor

Register skipFlush cleanup immediately after enabling it.

If any require or eventual assertion before Line 559 fails, this failpoint stays enabled and can cascade into unrelated runaway tests. Please move the disable into t.Cleanup (or a guarded defer) right after Enable.

🧹 Suggested cleanup guard

require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", `return(1)`)) +skipFlushEnabled := true +t.Cleanup(func() { + if skipFlushEnabled { + require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush")) + } +}) ... // resume the flush to ensure an aggregated record can be queried later. require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush")) +skipFlushEnabled = false

As per coding guidelines "Unit tests in a package that uses failpoints: MUST enable failpoints before tests and disable afterward."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", `return(1)`))

// check thrice to make sure the runaway query be regarded as a repeated query.

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.2) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.3) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

// using FastRunawayGC to trigger flush

require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/FastRunawayGC", `return(1)`))

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.4) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

// resume the flush to ensure an aggregated record can be queried later.

require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))

require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", `return(1)`))

skipFlushEnabled := true

t.Cleanup(func() {

if skipFlushEnabled {

require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))

}

})

// check thrice to make sure the runaway query be regarded as a repeated query.

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.2) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.3) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.4) from t")

require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")

// resume the flush to ensure an aggregated record can be queried later.

require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))

skipFlushEnabled = false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/resourcegroup/tests/resource_group_test.go` around lines 550 - 559, After enabling the failpoint via failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", ...) immediately register a cleanup to always disable it (use t.Cleanup(func(){ failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush") }) or a guarded defer) so the failpoint is turned off even if subsequent assertions (the three tk.QueryToErr calls with resource_group hint) fail; update the block around the failpoint.Enable call to add this cleanup and remove the later unconditional failpoint.Disable call.

coderabbitai · 2026-03-27T05:23:54Z

+	m := meta.NewMutator(txn)
+	err = m.FinishBootstrap(int64(ver223))
+	require.NoError(t, err)
+	RevertVersionAndVariables(t, seV223, ver223)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify whether uppercase helper symbols exist anywhere. rg -nP --type=go 'func\s+(RevertVersionAndVariables|GetBootstrapVersion)\s*\(' # Show call sites in the changed test. rg -nP --type=go '\b(RevertVersionAndVariables|GetBootstrapVersion)\s*\(' pkg/session/bootstrap_test.go # Confirm lowercase helpers already present. rg -nP --type=go 'func\s+(revertVersionAndVariables|getBootstrapVersion)\s*\(' pkg/session/bootstrap_test.go

Repository: pingcap/tidb

Length of output: 38

🏁 Script executed:

wc -l pkg/session/bootstrap_test.go

Repository: pingcap/tidb

Length of output: 90

🏁 Script executed:

sed -n '2595,2652p' pkg/session/bootstrap_test.go

Repository: pingcap/tidb

Length of output: 2401

🏁 Script executed:

# Also check what functions are actually defined at the beginning of the file grep -n "^func.*revert\|^func.*get" pkg/session/bootstrap_test.go | head -20

Repository: pingcap/tidb

Length of output: 140

Missing resource cleanup in getTableCreateSQLFn helper.

The local helper at line 2612 creates a result set via MustExecToRecodeSet but never closes it. This can leak connection handles and cause non-deterministic test behavior in repeated runs.

Add defer res.Close() after creating the result set:

Proposed fix

getTableCreateSQLFn := func(se sessiontypes.Session, tableName string) string { res := MustExecToRecodeSet(t, se, fmt.Sprintf("show create table mysql.%s", tableName)) + defer res.Close() chk := res.NewChunk(nil) err = res.Next(ctx, chk) require.NoError(t, err) require.Equal(t, 1, chk.NumRows()) return string(chk.GetRow(0).GetBytes(1)) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RevertVersionAndVariables(t, seV223, ver223)

getTableCreateSQLFn := func(se sessiontypes.Session, tableName string) string {

res := MustExecToRecodeSet(t, se, fmt.Sprintf("show create table mysql.%s", tableName))

defer res.Close()

chk := res.NewChunk(nil)

err = res.Next(ctx, chk)

require.NoError(t, err)

require.Equal(t, 1, chk.NumRows())

return string(chk.GetRow(0).GetBytes(1))

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/session/bootstrap_test.go` at line 2607, The helper getTableCreateSQLFn creates a result set via MustExecToRecodeSet but never closes it; after the call that assigns the result set (the variable named res from MustExecToRecodeSet) add a defer res.Close() immediately so the result is always closed and connection handles are released; update getTableCreateSQLFn to defer res.Close() right after MustExecToRecodeSet returns to prevent resource leaks during tests.

codecov · 2026-03-27T05:33:38Z

Codecov Report

❌ Patch coverage is 61.97183% with 108 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release-8.5-20260323-v8.5.5@45689ac). Learn more about missing BASE report.

Additional details and impacted files

@@                       Coverage Diff                        @@
##             release-8.5-20260323-v8.5.5     #67352   +/-   ##
================================================================
  Coverage                               ?   55.4491%           
================================================================
  Files                                  ?       1816           
  Lines                                  ?     650811           
  Branches                               ?          0           
================================================================
  Hits                                   ?     360869           
  Misses                                 ?     263104           
  Partials                               ?      26838

Flag	Coverage Δ
integration	`39.1306% <43.6619%> (?)`
unit	`64.8838% <61.9718%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.9278% <0.0000%> (?)`
parser	`∅ <0.0000%> (?)`
br	`63.8755% <0.0000%> (?)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

♻️ Duplicate comments (1)

pkg/session/bootstrap_test.go (1)

2612-2619: ⚠️ Potential issue | 🟡 Minor

Close the RecordSet in getTableCreateSQLFn to avoid leaks.

At Line 2613, MustExecToRecodeSet returns a RecordSet that is never closed before return. Please close it in the helper to prevent leaked handles across tests.

Proposed fix

 getTableCreateSQLFn := func(se sessiontypes.Session, tableName string) string {
 	res := MustExecToRecodeSet(t, se, fmt.Sprintf("show create table mysql.%s", tableName))
 	chk := res.NewChunk(nil)
 	err = res.Next(ctx, chk)
 	require.NoError(t, err)
 	require.Equal(t, 1, chk.NumRows())
-	return string(chk.GetRow(0).GetBytes(1))
+	ddl := string(chk.GetRow(0).GetBytes(1))
+	require.NoError(t, res.Close())
+	return ddl
 }

As per coding guidelines: "Keep error handling actionable and contextual; avoid silently swallowing errors."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/session/bootstrap_test.go` around lines 2612 - 2619, The helper
getTableCreateSQLFn is leaking the RecordSet returned by MustExecToRecodeSet;
after obtaining res (the RecordSet) you should close it before returning—e.g.,
call res.Close() (or defer res.Close() immediately after the call) and handle
any error from Close (use require.NoError on the returned error) so the
RecordSet is always released even if Next or assertions fail; keep references to
res, chk, ctx and ensure Close runs before the function returns.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/session/bootstrap_test.go`:
- Around line 2612-2619: The helper getTableCreateSQLFn is leaking the RecordSet
returned by MustExecToRecodeSet; after obtaining res (the RecordSet) you should
close it before returning—e.g., call res.Close() (or defer res.Close()
immediately after the call) and handle any error from Close (use require.NoError
on the returned error) so the RecordSet is always released even if Next or
assertions fail; keep references to res, chk, ctx and ensure Close runs before
the function returns.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a9b0b2bd-33dd-4087-a9fd-65173b00e433

📥 Commits

Reviewing files that changed from the base of the PR and between b2730f6 and f314dba.

📒 Files selected for processing (1)

pkg/session/bootstrap_test.go

EmmaDuDu · 2026-03-30T01:44:26Z

/test unit-test

tiprow · 2026-03-30T01:44:42Z

@EmmaDuDu: No presubmit jobs available for pingcap/tidb@release-8.5-20260323-v8.5.5

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rleungx · 2026-03-30T02:26:54Z

+	}
+
+	start := time.Now()
+	err := f.flushFn(f.buffer)


If the flush failed, will all records be dropped?

Yes, this behavior is currently not being handled.

Considering that the related flush only affects certain observability records, it is acceptable to leave it untreated for now under these circumstances.

EmmaDuDu · 2026-03-30T05:51:55Z

/test unit-test

tiprow · 2026-03-30T05:52:11Z

@EmmaDuDu: No presubmit jobs available for pingcap/tidb@release-8.5-20260323-v8.5.5

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

EmmaDuDu · 2026-03-30T06:49:37Z

/test unit-test

tiprow · 2026-03-30T06:49:52Z

@EmmaDuDu: No presubmit jobs available for pingcap/tidb@release-8.5-20260323-v8.5.5

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-03-30T07:23:57Z

[LGTM Timeline notifier]

Timeline:

2026-03-30 06:27:08.303228251 +0000 UTC m=+160033.508588308: ☑️ agreed by rleungx.
2026-03-30 07:23:55.848705165 +0000 UTC m=+163441.054065221: ☑️ agreed by YuJuncen.

hawkingrei · 2026-03-30T12:31:48Z

/retest

hawkingrei · 2026-03-30T13:17:20Z

/retest

hawkingrei · 2026-03-30T14:03:21Z

/retest

hawkingrei · 2026-03-30T15:05:22Z

/retest

hawkingrei · 2026-03-30T21:36:28Z

/retest

hawkingrei · 2026-03-30T22:17:37Z

/retest

hawkingrei · 2026-03-30T23:18:28Z

/retest

EmmaDuDu · 2026-03-31T00:10:19Z

/test unit-test

…cap#65747) ref pingcap#65746

pingcap#65828) ref pingcap#65746

ref pingcap#65746

pingcap#65834) ref pingcap#65746

close pingcap#65746

ref pingcap#65746

- Replace metricscommon.New*Vec with local New*Vec (metricscommon package does not exist on the release branch) - Replace ast.NewCIStr/ast.CIStr with model.NewCIStr/model.CIStr (CIStr was moved from parser/ast to parser/model on this branch) Signed-off-by: JmPotato <github@ipotato.me>

The test was adapted from master's TestTiDBUpgradeToVer254 but kept master-only exported symbols that don't exist on the release branch: - RevertVersionAndVariables → revertVersionAndVariables (unexported) - GetBootstrapVersion → getBootstrapVersion (unexported) - store.SetOption(StoreBootstrappedKey, nil) → unsetStoreBootstrapped(store.UUID()) - mysql.ModeNone → 0 (avoid adding parser/mysql import for a single constant) Signed-off-by: JmPotato <github@ipotato.me>

The addition of idx_start_time and idx_done_time on runaway watch tables increased the total system index count from 81 to 83. Signed-off-by: JmPotato <github@ipotato.me>

coderabbitai

♻️ Duplicate comments (1)

pkg/resourcegroup/tests/resource_group_test.go (1)

550-559: ⚠️ Potential issue | 🟡 Minor

Register skipFlush cleanup immediately after enabling it.

If any assertion after Line 550 fails, skipFlush stays enabled and can bleed into later runaway tests.

🧹 Suggested cleanup

 require.NoError(t, failpoint.Enable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush", `return(1)`))
+skipFlushEnabled := true
+t.Cleanup(func() {
+	if skipFlushEnabled {
+		require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))
+	}
+})
 // check thrice to make sure the runaway query be regarded as a repeated query.
 err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.2) from t")
 require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
 err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.3) from t")
 require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
 err = tk.QueryToErr("select /*+ resource_group(rg1) */ sleep(0.4) from t")
 require.ErrorContains(t, err, "Query execution was interrupted, identified as runaway query")
 // resume the flush to ensure an aggregated record can be queried later.
 require.NoError(t, failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush"))
+skipFlushEnabled = false

As per coding guidelines "Unit tests in a package that uses failpoints: MUST enable failpoints before tests and disable afterward."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/resourcegroup/tests/resource_group_test.go` around lines 550 - 559,
Enablement of the failpoint
"github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush" must be paired
with immediate cleanup to avoid leaking into other tests; after the
failpoint.Enable(...) call, add a defer that calls
failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush")
(and assert/ignore its error appropriately) so the failpoint is always disabled
even if subsequent requires/assertions fail — update the block around the
current failpoint.Enable(...) and the later failpoint.Disable(...) to remove the
standalone disable and rely on the deferred cleanup instead.

🧹 Nitpick comments (2)

pkg/resourcegroup/runaway/flusher_test.go (1)
26-45: Stop the helper ticker in test cleanup.

newTestBatchFlusher allocates a live time.Ticker for every test case and never stops it. Passing t into the helper and registering flusher.stop() (or ticker.Stop()) in t.Cleanup would keep these tests leak-free.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/resourcegroup/runaway/flusher_test.go` around lines 26 - 45,
newTestBatchFlusher currently creates a live time.Ticker that is never stopped;
change newTestBatchFlusher to accept a *testing.T parameter, create the
batchFlusher as before, then call t.Cleanup to stop the flusher's ticker (either
by calling flusher.stop() or ticker.Stop()) to avoid leaking goroutines; update
all test call sites to pass the *testing.T and ensure batchFlusher.stop() (or
equivalent) is implemented/used to stop the ticker.
pkg/resourcegroup/runaway/syncer.go (1)
177-179: Decouple checkpoint advancement from SELECT * column order.

The new ID checkpointing reads GetInt64(0) from raw rows, so it stays correct only while the key column remains the first physical column in both system tables. Selecting the key column explicitly would make the pagination change less fragile to future schema tweaks.

Also applies to: 216-218, 241-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/resourcegroup/runaway/syncer.go` around lines 177 - 179, The checkpoint
advancement currently reads the ID from raw rows using GetInt64(0), which
couples it to physical column order; update the code that sets reader.CheckPoint
(the blocks that use rs[len(rs)-1].GetInt64(0) — including the occurrences
around the reader.CheckPoint assignment and the similar blocks at the other
locations mentioned) to select the key column explicitly in the SQL (e.g.,
SELECT id, ... -> SELECT key_column, ...) and then read the value by column name
or by using a typed scanner into a struct/variable (e.g., scan into an explicit
id/key variable) before assigning reader.CheckPoint, so checkpointing no longer
depends on column position.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/resourcegroup/tests/resource_group_test.go`:
- Around line 550-559: Enablement of the failpoint
"github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush" must be paired
with immediate cleanup to avoid leaking into other tests; after the
failpoint.Enable(...) call, add a defer that calls
failpoint.Disable("github.com/pingcap/tidb/pkg/resourcegroup/runaway/skipFlush")
(and assert/ignore its error appropriately) so the failpoint is always disabled
even if subsequent requires/assertions fail — update the block around the
current failpoint.Enable(...) and the later failpoint.Disable(...) to remove the
standalone disable and rely on the deferred cleanup instead.

---

Nitpick comments:
In `@pkg/resourcegroup/runaway/flusher_test.go`:
- Around line 26-45: newTestBatchFlusher currently creates a live time.Ticker
that is never stopped; change newTestBatchFlusher to accept a *testing.T
parameter, create the batchFlusher as before, then call t.Cleanup to stop the
flusher's ticker (either by calling flusher.stop() or ticker.Stop()) to avoid
leaking goroutines; update all test call sites to pass the *testing.T and ensure
batchFlusher.stop() (or equivalent) is implemented/used to stop the ticker.

In `@pkg/resourcegroup/runaway/syncer.go`:
- Around line 177-179: The checkpoint advancement currently reads the ID from
raw rows using GetInt64(0), which couples it to physical column order; update
the code that sets reader.CheckPoint (the blocks that use
rs[len(rs)-1].GetInt64(0) — including the occurrences around the
reader.CheckPoint assignment and the similar blocks at the other locations
mentioned) to select the key column explicitly in the SQL (e.g., SELECT id, ...
-> SELECT key_column, ...) and then read the value by column name or by using a
typed scanner into a struct/variable (e.g., scan into an explicit id/key
variable) before assigning reader.CheckPoint, so checkpointing no longer depends
on column position.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 839f8c30-68a8-4a59-9bf5-ac3bffa58049

📥 Commits

Reviewing files that changed from the base of the PR and between f314dba and 55177e9.

📒 Files selected for processing (19)

br/pkg/restore/snap_client/systable_restore_test.go
pkg/executor/infoschema_reader_test.go
pkg/metrics/grafana/tidb_resource_control.json
pkg/metrics/grafana/tidb_resource_control.jsonnet
pkg/metrics/metrics.go
pkg/metrics/resource_group.go
pkg/resourcegroup/runaway/BUILD.bazel
pkg/resourcegroup/runaway/checker.go
pkg/resourcegroup/runaway/checker_test.go
pkg/resourcegroup/runaway/flusher.go
pkg/resourcegroup/runaway/flusher_test.go
pkg/resourcegroup/runaway/manager.go
pkg/resourcegroup/runaway/record.go
pkg/resourcegroup/runaway/record_test.go
pkg/resourcegroup/runaway/syncer.go
pkg/resourcegroup/tests/BUILD.bazel
pkg/resourcegroup/tests/resource_group_test.go
pkg/session/bootstrap.go
pkg/session/bootstrap_test.go

💤 Files with no reviewable changes (1)

pkg/resourcegroup/runaway/record_test.go

✅ Files skipped from review due to trivial changes (2)

pkg/executor/infoschema_reader_test.go
pkg/session/bootstrap_test.go

🚧 Files skipped from review as they are similar to previous changes (7)

pkg/resourcegroup/runaway/checker.go
br/pkg/restore/snap_client/systable_restore_test.go
pkg/resourcegroup/tests/BUILD.bazel
pkg/resourcegroup/runaway/BUILD.bazel
pkg/metrics/resource_group.go
pkg/session/bootstrap.go
pkg/resourcegroup/runaway/flusher.go

ti-chi-bot · 2026-03-31T02:46:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yudongusa, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [YuJuncen]
~~br/OWNERS~~ [YuJuncen]
~~pkg/session/OWNERS~~ [yudongusa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 27, 2026

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

rleungx reviewed Mar 30, 2026

View reviewed changes

rleungx previously approved these changes Mar 30, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 30, 2026

YuJuncen previously approved these changes Mar 30, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 30, 2026

okJiang previously approved these changes Mar 30, 2026

View reviewed changes

yudongusa approved these changes Mar 30, 2026

View reviewed changes

ti-chi-bot Bot added the approved label Mar 30, 2026

JmPotato added 10 commits March 31, 2026 09:58

refactor(runaway): use info cache to check runaway watch tables (ping…

a59675f

…cap#65747) ref pingcap#65746

feat(session): add schema upgrade v254 to optimize runaway watch index (

d4c8f3b

pingcap#65828) ref pingcap#65746

feat(runaway): paginate watch sync using id checkpoints (pingcap#66155)

d934329

ref pingcap#65746

feat(runaway): refactor record flushing with batch flusher abstraction (

5c83b6d

pingcap#65834) ref pingcap#65746

refactor(runaway): make active watch counting lock-free (pingcap#66171)

69bdffa

close pingcap#65746

feat(metrics/runaway): add syncer observability (pingcap#66182)

811bbbe

ref pingcap#65746

test: remove extra newline in bootstrap test

b3989b0

test: update index count expectation for runaway watch indexes

55177e9

The addition of idx_start_time and idx_done_time on runaway watch tables increased the total system index count from 81 to 83. Signed-off-by: JmPotato <github@ipotato.me>

JmPotato dismissed stale reviews from okJiang, YuJuncen, and rleungx via 55177e9 March 31, 2026 01:59

JmPotato force-pushed the cherry-pick/runaway-v8.5.5 branch from f314dba to 55177e9 Compare March 31, 2026 01:59

ti-chi-bot Bot removed the approved label Mar 31, 2026

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

YuJuncen approved these changes Mar 31, 2026

View reviewed changes

ti-chi-bot Bot added the approved label Mar 31, 2026

ti-chi-bot Bot merged commit bfce44e into pingcap:release-8.5-20260323-v8.5.5 Mar 31, 2026
24 checks passed

JmPotato deleted the cherry-pick/runaway-v8.5.5 branch March 31, 2026 03:50

This was referenced Mar 31, 2026

Release 8.5 20260225 v8.5.2 | pd=@pull-10500-9f9829d_linux_amd64 #67474

Open

*: fix bug of 'like' expression with NO_BACKSLASH_ESCAPES sql_mode | tidb-test=pr/2678 #65973

Merged

infoschema, server: add per connection TLS status (#62563) #67515

Open

coderabbitai Bot mentioned this pull request Apr 9, 2026

session: add upgrade policy for analyze distsql concurrency #65424

Merged

13 tasks

JmPotato mentioned this pull request Apr 14, 2026

resourcegroup/runaway: switch watch syncer to time-based cursor #67758

Merged

2 tasks

This was referenced May 8, 2026

session: backfill tidb_ignore_inlist_plan_digest during upgrade (#68137) #68223

Closed

metrics: enhance diagnostic capabilities for gRPC network issues (#67811) #68299

Open

coderabbitai Bot mentioned this pull request May 20, 2026

ttl: enhance TTL scan with secondary index #68397

Open

13 tasks

JmPotato mentioned this pull request Jun 3, 2026

resourcegroup, session, metrics: cherry-pick runaway watch improvements #68921

Merged

5 tasks

Conversation

JmPotato commented Mar 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rleungx Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

EmmaDuDu commented Mar 30, 2026

Uh oh!

tiprow Bot commented Mar 30, 2026

Uh oh!

rleungx Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

EmmaDuDu commented Mar 30, 2026

Uh oh!

tiprow Bot commented Mar 30, 2026

Uh oh!

EmmaDuDu commented Mar 30, 2026

Uh oh!

tiprow Bot commented Mar 30, 2026

Uh oh!

ti-chi-bot Bot commented Mar 30, 2026

[LGTM Timeline notifier]

Uh oh!

hawkingrei commented Mar 30, 2026

Uh oh!

hawkingrei commented Mar 30, 2026

Uh oh!

hawkingrei commented Mar 30, 2026

Uh oh!

JmPotato commented Mar 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading

coderabbitai Bot Mar 27, 2026 •

edited

Loading

codecov Bot commented Mar 27, 2026 •

edited

Loading