Skip to content

ttl: honor scan task cancellation across statement boundaries#67285

Merged
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
zanmato1984:issue-66982-flaky
Apr 4, 2026
Merged

ttl: honor scan task cancellation across statement boundaries#67285
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
zanmato1984:issue-66982-flaky

Conversation

@zanmato1984
Copy link
Copy Markdown
Contributor

@zanmato1984 zanmato1984 commented Mar 25, 2026

What problem does this PR solve?

Issue Number: ref #66982

Problem Summary

TestCancelWhileScan is flaky because TTL scan cancellation can fall into a statement-boundary gap.

The scan task currently relies on KillStmt to interrupt the running internal SQL. If cancellation happens between statements, the next internal statement resets the statement context before execution, which clears the statement-bound kill state. As a result, the scan SELECT can still start and continue running even though the TTL task has already been canceled.

What is changed and how it works?

This PR fixes the issue in two parts:

  1. Pass the TTL scan task cancellation context into the actual internal SQL execution path, instead of relying only on KillStmt.
  2. After ResetContextOfStmt, immediately honor a canceled caller context before executing the next statement.

Together, these changes close the statement-boundary cancellation gap and make TTL scan cancellation respond to task cancellation directly.

This PR also adds targeted regression coverage for the statement-boundary cancellation case.

Check List

Tests

  • Unit test
  • Integration test
  • Lint

Side effects

This changes internal SQL behavior only when the caller context is already canceled. That behavior should be more correct and should have negligible performance impact beyond an additional ctx.Err() check and context propagation in TTL scan.

Release note

Fix flaky TTL scan cancellation caused by a statement-boundary gap between task cancellation and internal SQL execution.

Summary by CodeRabbit

  • Bug Fixes

    • TTL scans now honor cancellation promptly at SQL statement boundaries and propagate cancellation through scan operations, preventing stalled aborts.
    • Execution path now checks for cancellation before starting subsequent statements, stopping further work when a caller cancels.
    • Transaction finalization now more reliably commits or rolls back around canceled contexts.
  • Tests

    • Added integration and unit tests validating TTL scan cancellation timing and transaction behavior after cancellation.
  • Chores

    • Test shard configuration adjusted.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-tests-checked do-not-merge/needs-triage-completed labels Mar 25, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Mar 25, 2026

Review failed due to infrastructure/execution failure after retries. Please re-trigger review.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 25, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Mar 25, 2026

Hi @zanmato1984. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Threads a cancellable per-scan context into TTL scan execution and sessions, injects a failpoint immediately before SQL-killer reset at statement cleanup, and enforces an explicit cancellation check after statement-context reset at statement boundaries.

Changes

Cohort / File(s) Summary
Executor statement reset
pkg/executor/select.go
Injects beforeResetSQLKillerForTTLScan failpoint immediately before vars.SQLKiller.Reset() when vars.InRestrictedSQL and vars.InternalSQLScanUserTable are true.
Session execution boundary
pkg/session/session.go
After ResetContextOfStmt(...) in executeStmtImpl, check ctx.Err() and return early if canceled to honor caller cancellation at the statement boundary.
TTL scan flow
pkg/ttl/ttlworker/scan.go
Create per-task scanCtx via context.WithCancel(ctx), cancel it from the kill goroutine (call cancelScanCtx() before rawSess.KillStmt()), use scanCtx for session creation and SQL execution, and make retry/wait/delete logic observe scanCtx (return scanCtx.Err()).
TTL session signature
pkg/ttl/ttlworker/session.go
Change NewScanSession to accept ctx context.Context and use that context for internal ExecuteSQL calls (replacing context.Background()).
Tests — TTL and executor
pkg/ttl/ttlworker/scan_integration_test.go, pkg/ttl/ttlworker/session_integration_test.go, pkg/executor/test/executor/executor_test.go
Add TestCancelWhileScanAtStatementBoundary integration test using failpoints to trigger cancellation at statement boundary; update test call sites to new NewScanSession signature; relax cancellation assertion to use errors.Is semantics.
Transaction handling tests & manager
pkg/dxf/framework/storage/task_state_test.go, pkg/dxf/framework/storage/task_table.go
Add test for txn rollback on canceled context; change TaskManager.WithNewTxn defer to call se.CommitTxn(ctx) on success and se.RollbackTxn(...) (with internal source context) on failure instead of raw SQL exec.
Build config
pkg/dxf/framework/storage/BUILD.bazel
Increase go_test shard_count from 28 to 29 for storage_test.

Sequence Diagram

sequenceDiagram
    participant KG as Kill Goroutine
    participant Task as TTL Scan Task
    participant Sess as Scan Session / Executor
    participant Killer as SQL Killer

    Task->>Task: scanCtx = context.WithCancel(ctx)
    Task->>Sess: NewScanSession(scanCtx, ...)
    Task->>Sess: Execute SQL using scanCtx

    alt normal execution
        Sess->>Sess: run statement
        Note over Sess: failpoint beforeResetSQLKillerForTTLScan
        Sess->>Killer: vars.SQLKiller.Reset()
        Sess->>Sess: ResetContextOfStmt returns
        Sess->>Sess: check ctx.Err() -> continue if nil
    else cancellation path
        KG->>KG: receive kill trigger
        KG->>Task: cancelScanCtx()
        KG->>Killer: rawSess.KillStmt()
        Task->>Sess: scanCtx canceled -> operations return scanCtx.Err()
    end

    Sess-->>Task: propagate cancellation error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • lcwangchao
  • YangKeao
  • wjhuang2016
  • hawkingrei

Poem

🐰 I threaded a quiet scan-bound thread,
A cancel whisper nudged the work to bed,
A failpoint paused the killer's reset beat,
One hop, one cancel — scan and killer meet,
I munch the logs and hop away, content.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: fixing TTL scan cancellation to work across statement boundaries.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all required sections properly filled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hawkingrei
Copy link
Copy Markdown
Member

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 25, 2026
@zanmato1984 zanmato1984 requested a review from bb7133 March 25, 2026 07:22
@YangKeao YangKeao self-requested a review March 25, 2026 07:47
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.6777%. Comparing base (8412422) to head (7659f52).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67285        +/-   ##
================================================
- Coverage   77.7173%   77.6777%   -0.0396%     
================================================
  Files          1959       1945        -14     
  Lines        543377     545796      +2419     
================================================
+ Hits         422298     423962      +1664     
- Misses       120238     121832      +1594     
+ Partials        841          2       -839     
Flag Coverage Δ
integration 41.2423% <10.3448%> (+5.0675%) ⬆️
unit 76.8749% <93.1034%> (+0.5318%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 48.9152% <ø> (-12.0649%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 25, 2026
@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

@pantheon-ai please review

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Mar 25, 2026

Review Complete

Findings: 0 issues
Posted: 0
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

Copy link
Copy Markdown

@pantheon-ai pantheon-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Code looks good. No issues found.

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/check-issue-triage-complete

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot Bot removed the approved label Apr 1, 2026
@zanmato1984
Copy link
Copy Markdown
Contributor Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/dxf/framework/storage/task_state_test.go`:
- Around line 151-153: The test currently calls sqlexec.ExecSQL(ctx,
se.GetSQLExecutor(), "select sleep(10)"), asserts require.NoError(t, err) and
then returns ctx.Err(); instead capture and return the ExecSQL error directly so
the test observes statement-level cancellation: replace the require.NoError
check with returning the err from the ExecSQL call (i.e., keep the call to
sqlexec.ExecSQL using ctx and se.GetSQLExecutor(), assign its error and return
that error) so cancellation surfaced by ExecSQL is asserted instead of
fabricating context.Canceled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e60c7db3-0ab7-4a56-b235-434f4553a6d6

📥 Commits

Reviewing files that changed from the base of the PR and between 1cf552f and af1968c.

📒 Files selected for processing (2)
  • pkg/dxf/framework/storage/task_state_test.go
  • pkg/dxf/framework/storage/task_table.go

Comment thread pkg/dxf/framework/storage/task_state_test.go
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2026
Copy link
Copy Markdown
Member

@YangKeao YangKeao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Though I doubt whether it'll be very helpful, because the original implementation will leak at most one statement, but after all this PR makes things better 👍 .,

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-25 18:31:02.987070613 +0000 UTC m=+379459.023140873: ☑️ agreed by bb7133.
  • 2026-04-02 07:17:23.336430063 +0000 UTC m=+422248.541790120: ☑️ agreed by YangKeao.

Copy link
Copy Markdown
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

_, commitErr := sqlexec.ExecSQL(ctx, se.GetSQLExecutor(), sql)
if err == nil && commitErr != nil {
err = commitErr
commitErr := se.CommitTxn(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this inside the comment on why we use begin explicitly, but use named method for commit/rollback

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bb7133, D3Hunter, YangKeao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

7 similar comments
@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2026
@hawkingrei
Copy link
Copy Markdown
Member

/retest

@zanmato1984
Copy link
Copy Markdown
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2026
@zanmato1984
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot Bot merged commit 7d3162c into pingcap:master Apr 4, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants