Skip to content

executor: deflake TestStorageEnginesInSlowQuery#68168

Merged
ti-chi-bot[bot] merged 3 commits intopingcap:masterfrom
henrybw:issue/66727-deflake-storage-engine-in-slow-query
May 7, 2026
Merged

executor: deflake TestStorageEnginesInSlowQuery#68168
ti-chi-bot[bot] merged 3 commits intopingcap:masterfrom
henrybw:issue/66727-deflake-storage-engine-in-slow-query

Conversation

@henrybw
Copy link
Copy Markdown
Contributor

@henrybw henrybw commented May 6, 2026

What problem does this PR solve?

Issue Number: ref #66727

Problem Summary:

TestStorageEnginesInSlowQuery is still flaky after #66773. The slow-log write inside MustExec isn't always immediately visible to the immediately following read of information_schema.slow_query under CI load, so the row-count check fails with 0 rows.

What changed and how does it work?

  • Refactor each slow_query to call a new checkStorageEngines helper function that uses tk.EventuallyMustQueryAndCheck.
  • Add a t.Cleanup that dumps the slow-log file on test failure to disambiguate a missing entry from a present-but-not matching one if this test is still flaky.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Tests
    • Expanded slow-query test coverage to validate storage-engine usage across TiKV, TiFlash, and mixed reads for varied query shapes (point-get, index readers/lookups, index merges, TABLESAMPLE).
    • Added reusable checks to assert storage routing and introduced multiple scenarios exercising read-from-storage hints.
    • Improved test cleanup and slow-log handling to provide clearer diagnostics on failures.

@ti-chi-bot ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label May 6, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 6, 2026

@henrybw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 6, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e53700b0-1844-4e1f-afcd-8ed99b89e9fa

📥 Commits

Reviewing files that changed from the base of the PR and between c43b8de and 58dab3b.

📒 Files selected for processing (1)
  • pkg/executor/slow_query_sql_test.go

📝 Walkthrough

Walkthrough

Adds a test helper and expands TestStorageEnginesInSlowQuery in pkg/executor/slow_query_sql_test.go to assert storage engine usage (TiKV/TiFlash) across multiple query scenarios; replaces defer cleanup with t.Cleanup that can dump slow logs on failure.

Changes

Slow Query Test Improvements

Layer / File(s) Summary
Test Helper
pkg/executor/slow_query_sql_test.go
Added checkStorageEngines(t *testing.T, tk *testkit.TestKit, where, expected string) which polls information_schema.slow_query for storage_from_kv and storage_from_mpp within a 2s window and asserts the expected values.
Cleanup / Debug Wiring
pkg/executor/slow_query_sql_test.go
Replaced defer-based cleanup with t.Cleanup that restores global config, removes the slow log file, and conditionally dumps the slow log on test failure.
Test Scenarios / Assertions
pkg/executor/slow_query_sql_test.go
Refactored TestStorageEnginesInSlowQuery to call checkStorageEngines for multiple scenarios: simple SELECT, TiKV-only, TiFlash-only, TiKV+TiFlash mixed reads, point-get, index reader, index lookup, index merge, and TABLESAMPLE. Replaced inline slow_query checks with the helper and added several new cases exercising read_from_storage hints.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm

Suggested reviewers

  • XuHuaiyu
  • xzhangxian1008
  • gengliqi

Poem

I hopped through logs both slow and spry,
Polling engines as queries fly.
TiKV here, TiFlash there,
I count their footprints with rabbit care. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'executor: deflake TestStorageEnginesInSlowQuery' directly describes the main change: making the slow query test less flaky through refactoring.
Description check ✅ Passed The description follows the required template with all key sections completed: issue reference, clear problem summary, detailed explanation of changes, proper test checklist with unit test marked, and release note set to None.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

Command failed

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@henrybw
Copy link
Copy Markdown
Contributor Author

henrybw commented May 6, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.0757%. Comparing base (49398d2) to head (58dab3b).
⚠️ Report is 13 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68168        +/-   ##
================================================
- Coverage   77.7514%   77.0757%   -0.6757%     
================================================
  Files          1990       1972        -18     
  Lines        551828     552480       +652     
================================================
- Hits         429054     425828      -3226     
- Misses       121854     126650      +4796     
+ Partials        920          2       -918     
Flag Coverage Δ
integration 41.5054% <ø> (+1.7035%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4888% <ø> (ø)
parser ∅ <ø> (∅)
br 50.0597% <ø> (-13.0338%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/executor/slow_query_sql_test.go (1)

479-494: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

t.Cleanup diagnostic never fires — file is already removed by the time it runs.

In Go's testing model, defer statements in the test function body execute when the function returns, while t.Cleanup callbacks run after the function returns. Because the defer at line 479 calls both f.Close() and os.Remove(...), the slow-log file is gone before the t.Cleanup at line 487 attempts os.ReadFile. The if err == nil guard silently swallows the "no such file" error, making the diagnostic a no-op on every failure.

Fix: convert the file teardown from defer to a t.Cleanup registered before the diagnostic one. Since t.Cleanup is LIFO, the diagnostic (registered later) will run before the removal (registered earlier).

🐛 Proposed fix — swap defer for t.Cleanup with correct ordering
-	defer func() {
-		config.StoreGlobalConfig(originCfg)
-		require.NoError(t, f.Close())
-		require.NoError(t, os.Remove(newCfg.Log.SlowQueryFile))
-	}()
 	require.NoError(t, logutil.InitLogger(newCfg.Log.ToLogConfig()))
-	// On failure, dump the slow log to disambiguate a missing entry from one
-	// that's present but doesn't match the expected pattern (issue `#66727`).
+	// Registered first → runs last (LIFO): tear down file after the dump.
+	t.Cleanup(func() {
+		config.StoreGlobalConfig(originCfg)
+		require.NoError(t, f.Close())
+		require.NoError(t, os.Remove(newCfg.Log.SlowQueryFile))
+	})
+	// Registered second → runs first (LIFO): dump before removal.
+	// On failure, dump the slow log to disambiguate a missing entry from one
+	// that's present but doesn't match the expected pattern (issue `#66727`).
 	t.Cleanup(func() {
 		if !t.Failed() {
 			return
 		}
 		if data, err := os.ReadFile(f.Name()); err == nil {
 			t.Logf("slow log contents (%d bytes):\n%s", len(data), data)
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/executor/slow_query_sql_test.go` around lines 479 - 494, The teardown
that closes and removes the slow-log file is currently done with defer (calling
config.StoreGlobalConfig(originCfg), f.Close(), and
os.Remove(newCfg.Log.SlowQueryFile)) which runs before the t.Cleanup diagnostic
and causes the diagnostic's os.ReadFile(f.Name()) to always fail; change that
defer into a t.Cleanup registered before the existing diagnostic t.Cleanup so
cleanup runs LIFO (register t.Cleanup(func(){
config.StoreGlobalConfig(originCfg); require.NoError(t, f.Close());
require.NoError(t, os.Remove(newCfg.Log.SlowQueryFile)) }) before the diagnostic
one) ensuring the diagnostic t.Cleanup that reads f.Name() runs while the file
still exists; keep the loginit call to
logutil.InitLogger(newCfg.Log.ToLogConfig()) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/executor/slow_query_sql_test.go`:
- Around line 479-494: The teardown that closes and removes the slow-log file is
currently done with defer (calling config.StoreGlobalConfig(originCfg),
f.Close(), and os.Remove(newCfg.Log.SlowQueryFile)) which runs before the
t.Cleanup diagnostic and causes the diagnostic's os.ReadFile(f.Name()) to always
fail; change that defer into a t.Cleanup registered before the existing
diagnostic t.Cleanup so cleanup runs LIFO (register t.Cleanup(func(){
config.StoreGlobalConfig(originCfg); require.NoError(t, f.Close());
require.NoError(t, os.Remove(newCfg.Log.SlowQueryFile)) }) before the diagnostic
one) ensuring the diagnostic t.Cleanup that reads f.Name() runs while the file
still exists; keep the loginit call to
logutil.InitLogger(newCfg.Log.ToLogConfig()) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6600e1cb-297b-487a-91f4-2cdefe159b9b

📥 Commits

Reviewing files that changed from the base of the PR and between 49398d2 and c348240.

📒 Files selected for processing (1)
  • pkg/executor/slow_query_sql_test.go

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 6, 2026

Hi @henrybw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/executor/slow_query_sql_test.go (1)

517-517: 💤 Low value

'select%t_tiflash;' pattern also matches the subsequent join query — latent ordering dependency.

The join query on line 520 ends with from t_tikv, t_tiflash;, so it is also matched by select%t_tiflash;. The check at line 517 is currently safe only because it executes before line 520 runs. If these two blocks were reordered, or another query matching this suffix were inserted above line 516, checkStorageEngines at line 517 would see 2 rows and fail non-obviously.

Consider anchoring the pattern more tightly — for example, matching the full hint text — or document the ordering constraint with a comment:

♻️ Suggested fix
-	checkStorageEngines(t, tk, "query like 'select%t_tiflash;'", "0 1")
+	// Pattern must not match the later join query (also ends in "t_tiflash;"); keep this
+	// check before the join scenario below.
+	checkStorageEngines(t, tk, "query like 'select%read_from_storage(tiflash[t_tiflash])%t_tiflash;'", "0 1")

Also applies to: 520-521


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3bed41ac-5326-4da5-9075-2b3e9270b7cc

📥 Commits

Reviewing files that changed from the base of the PR and between c348240 and c43b8de.

📒 Files selected for processing (1)
  • pkg/executor/slow_query_sql_test.go

Copy link
Copy Markdown
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 6, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lance6716, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [lance6716,windtalker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 7, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-06 03:45:57.293650552 +0000 UTC m=+239430.167000524: ☑️ agreed by windtalker.
  • 2026-05-07 12:50:37.704439535 +0000 UTC m=+358510.577789517: ☑️ agreed by lance6716.

@ti-chi-bot ti-chi-bot Bot merged commit 7b2b384 into pingcap:master May 7, 2026
34 of 35 checks passed
@henrybw henrybw deleted the issue/66727-deflake-storage-engine-in-slow-query branch May 7, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants