Skip to content

pkg/ddl, planner: add TiKV space precheck for DXF add-index#68955

Open
expxiaoli wants to merge 53 commits into
pingcap:release-nextgen-202603from
expxiaoli:release-nextgen-202603-tikv-precheck
Open

pkg/ddl, planner: add TiKV space precheck for DXF add-index#68955
expxiaoli wants to merge 53 commits into
pingcap:release-nextgen-202603from
expxiaoli:release-nextgen-202603-tikv-precheck

Conversation

@expxiaoli

@expxiaoli expxiaoli commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This PR cherry-picks #68490 to release-nextgen-202603 and adds follow-up changes for Next-Gen SST estimation and ingested SST observability.

What problem does this PR solve?

Issue Number: ref #68354

Problem Summary:

DXF add-index can generate a large amount of SST data and exhaust TiKV disk space, but TiDB does not estimate the physical write footprint before submitting the distributed task.

TiDB also lacks reliable task-level observability for comparing predicted bytes with the SST bytes successfully ingested, especially on Next-Gen clusters where the CSE storage format differs from classic TiKV SSTs.

What changed and how does it work?

Submission-time prediction and precheck

  • Collect an initial TiKV capacity snapshot from PD before the first DXF add-index task submission.
  • Use bounded block sampling to generate representative index KVs through the same helpers used by the actual backfill path.
  • Estimate the physical SST footprint:
    • For classic TiKV, model MVCC default/write CF SST encoding with Pebble.
    • For Next-Gen, model CSE block layout, compression, MVCC metadata, and BinaryFuse8 filter bytes.
    • Estimate the logical and MVCC footprints separately to report MVCC overhead.
  • Scale the sampled estimate by table row count and TiKV replica count.
  • Persist the single-replica estimate, all-replica estimate, MVCC overhead, replica information, sampling diagnostics, and initial capacity snapshot in the DXF task metadata.
  • Check the projected remaining capacity:
    • cluster remaining space must be at least 20% of total capacity
    • each store's projected remaining space must be at least 15% of its total capacity
  • Bound the complete submission-time capacity collection and prediction process to five seconds. Capability, collection, or prediction failures are logged and skipped rather than blocking task submission.
  • Control enforcement with the global variable enforce_disk_space_precheck_before_add_index:
    • OFF by default: log insufficient capacity without rejecting the DDL
    • ON: reject the add-index job when the capacity check fails

Ingested SST observability

  • Record successfully ingested SST identities and sizes from both classic and Next-Gen ingest paths.
  • Deduplicate SST identities so retries do not inflate the observed byte count.
  • Aggregate the counters from read-index and write-and-ingest subtasks.
  • At successful task completion, log:
    • block_sample_predicted_tikv_index_single_replica_bytes
    • block_sample_predicted_tikv_index_all_replica_bytes
    • block_sample_mvcc_overhead_total_bytes
    • logical_index_kv_bytes
    • ingested_sst_bytes
    • ingest source and reliability status
    • initial TiKV capacity and sampling diagnostics

Shared helper and correctness changes

  • Reuse index KV generation, virtual-column, multi-valued-index, global-index, and common-handle helpers between prediction and actual backfill paths.
  • Extract TiKV transaction SST key/value encoding helpers for prediction.
  • Fix restored-data handling for common handles in RECOVER INDEX.
  • Update Bazel dependencies and test shard metadata for the release branch.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test

Validation:

  • make bazel_prepare
  • Targeted failpoint unit tests for DDL prediction, capacity collection, precheck enforcement, SST observation, table index encoding, and the new system variable
  • RealTiKV TestDXFAddIndexRealtimeSummary, including verification of reported ingested SST bytes on Next-Gen
  • make lint
  • git diff --check
  • Next-Gen build, unit-test, MySQL-client, and real-cluster integration CI jobs

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Improve DXF add-index by estimating its physical TiKV SST footprint before task submission, optionally rejecting jobs with insufficient disk space, and reporting successfully ingested SST bytes for classic and Next-Gen clusters.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/invalid-title release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 4, 2026
@tiprow

tiprow Bot commented Jun 4, 2026

Copy link
Copy Markdown

Hi @expxiaoli. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TiKV capacity prediction and optional disk-space precheck for distributed add-index; refactors index KV generation and handle-restoration APIs; records ingested SST identities/sizes; extracts Lightning SST encoders; expands mocks and tests; updates Bazel deps and wiring.

Changes

Add-Index TiKV Capacity Prediction with Refactoring

Layer / File(s) Summary
Metadata, config, and wiring
pkg/ddl/backfilling_dist_executor.go, pkg/sessionctx/vardef/tidb_vars.go, pkg/sessionctx/variable/sysvar.go
BackfillTaskMeta extended with TiKV capacity/prediction fields; TiKVStoreCapacity/TiKVClusterCapacity added; new sysvar TiDBEnforceDiskSpacePrecheckBeforeAddIndex and global atomic exposed; executor wiring passes initial-capacity presence flag.
Handle restoration & virtual-column helper exports
pkg/table/tables/tables.go, pkg/ddl/copr/copr_ctx.go, pkg/executor/admin.go, pkg/ddl/index_cop.go, pkg/ddl/copr/copr_ctx_test.go
Introduces exported TryGetHandleRestoredData; renames/exports virtual-column offsets/types helper; updates recover-index and test call sites; removes legacy getRestoreData.
Multi-value index helpers & encoding
pkg/table/tables/index.go, pkg/table/tables/index_test.go
Introduces BuildMultiValueIndexValueGroups and EncodeRawIndexKeyValues; MV expansion and raw-key encoding are tested for truncation and JSON-array expansion/deduplication.
Index KV generation refactor
pkg/ddl/index.go, pkg/ddl/backfilling_operators.go, pkg/ddl/backfilling_clean_s3.go
Centralizes per-row index KV emission into generateIndexKVsForRow, extracts indexKVHandleForPhysicalTable, updates writeChunk to accept physicalID, simplifies writeOneKV, and delegates cleanup metering to helper.
TiKV capacity prediction & precheck pipeline
pkg/ddl/index.go
Adds PD gRPC store-stats collection, block-sampled index-bytes prediction with MVCC estimation, replica-count resolution (placement/PD), scaling and optional disk-space precheck, records prediction inputs in BackfillTaskMeta, and logs observed ingested SST usage.
Region iteration & PD utilities
pkg/ddl/reorg_util.go, pkg/ddl/reorg_util_test.go
Adds helperFromKVStorage, pdHTTPClientFromStorage, listTableRegions*, and iterateTableRegionsWithClient to page PD region results and centralize region-based sampling; tests expanded with mocks.
Ingested SST recording & summary tracking
pkg/dxf/framework/taskexecutor/execute/interface.go, pkg/ddl/backfilling_import_cloud.go, pkg/ddl/backfilling_read_index.go
Adds SubtaskSummary ingest counters and IngestedSSTCollector; read/import executors and ingestCollector can record deduplicated ingested SST identities/sizes and update summary counters.
Ingest CLI & worker plumbing, tests
pkg/ingestor/ingestcli/*, pkg/lightning/backend/local/*, pkg/lightning/tikv/local_sst_writer.go, tests
nextGenSSTMeta gains Size; WriteResponse.GetSSTMeta() added; SST key/value encoders extracted; job worker records ingested SST meta (next-gen/classic identities) to collector; tests updated with recording collectors and multi-ingest partial-failure scenarios.
Tests & mocks for prediction/observed-capacity
pkg/ddl/reorg_util_test.go, tests/realtikvtest/addindextest2/global_sort_test.go
Adds PD/TiKV mock infrastructure (bufconn, mock PD HTTP client/replicate config), block-sample prediction helpers, and tests validating prediction, replica parsing/scaling and observed capacity via failpoint hook.
Build & Bazel deps
pkg/ddl/BUILD.bazel, pkg/table/tables/BUILD.bazel, pkg/lightning/backend/local/BUILD.bazel
Updates Bazel deps to include Pebble objstorage/sstable/vfs, PD client targets, lightning tikv, adds gRPC bufconn test dep, and bumps test shard count; local_test deps include execute package.

Sequence Diagram

sequenceDiagram
  participant Worker as AddIndex Worker
  participant PD as PD (gRPC/HTTP)
  participant TiKV as TiKV Stores
  participant Storage as Lightning / KV Storage
  participant Collector as SubtaskSummary Collector
  Worker->>PD: collect store stats and placement info
  Worker->>TiKV: block-sample region snapshot reads
  Worker->>Worker: predict index bytes, scale by replica count
  Worker->>PD: optionally enforce disk-space precheck decision
  Worker->>Storage: submit dist add-index task (SST ingest)
  Storage->>Worker: return ingest metadata (SST id/size)
  Worker->>Collector: record ingested SST identities/sizes into SubtaskSummary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • pingcap/tidb#68833: Overlaps with partial-index handling used by index KV generation changes.
  • pingcap/tidb#68824: Touches backfill step wiring and executor initialization related to backfilling dist executor changes.

Suggested labels

cherry-pick-approved, type/cherry-pick-for-release-nextgen-202603, approved, lgtm

Suggested reviewers

  • wjhuang2016
  • D3Hunter
  • joechenrh
  • gengliqi

Poem

"I’m a rabbit counting bytes with care,
hopping through SSTs in the midnight air,
predicting TiKV space with a twitchy nose,
recording ingests where the data grows,
a tiny hop — and the build compiles fair. 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding TiKV space precheck for DXF add-index, which aligns with the extensive changes across DDL, prediction, precheck, and observability components described in the PR objectives.
Description check ✅ Passed The PR description comprehensively addresses the required template sections: it clearly identifies the issue (#68354), explains the problem (DXF add-index can exhaust TiKV disk space without estimation), details what changed (prediction, precheck, observability), includes a complete checklist with appropriate selections, covers side effects and documentation impacts, and provides a release note aligned with the style guide.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/ddl/index.go (1)

3195-3203: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't skip observed-capacity logging when resuming an already-succeeded task.

This early return bypasses scheduleDistTaskObservedTiKVUsage(...), so a paused/resumed add-index job never emits the final predicted-vs-observed TiKV usage log even though the task history and metadata are still available.

Suggested fix
 		if task.State == proto.TaskStateSucceed {
 			w.updateDistTaskRowCount(taskKey, reorgInfo.Job.ID)
+			w.scheduleDistTaskObservedTiKVUsage(taskManager, taskKey, reorgInfo.Job.ID)
 			logutil.DDLLogger().Info(
 				"task succeed, start to resume the ddl job",
 				zap.String("task-key", taskKey))
 			return nil
 		}

Also applies to: 3421-3424

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 3195 - 3203, The early return when task.State
== proto.TaskStateSucceed skips scheduleDistTaskObservedTiKVUsage, so ensure the
observed-capacity logging still runs before resuming an already-succeeded task:
call scheduleDistTaskObservedTiKVUsage with the same identifiers used elsewhere
(use taskKey and reorgInfo.Job.ID) and any required context/params, then call
updateDistTaskRowCount and the resume log, and only then return; apply the same
change to the analogous block around lines 3421-3424 so both paths emit the
final predicted-vs-observed TiKV usage log.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/ddl/index.go`:
- Around line 3250-3331: Wrap the PD-backed precheck calls in a short timeout
derived from ctx (e.g. ctxWithTimeout, cancel := context.WithTimeout(ctx,
5*time.Second); defer cancel()) and pass ctxWithTimeout to the functions that
perform PD/TiKV RPCs so submission cannot hang: use the timed context when
calling w.predictTiKVIndexBytesBlockSample, w.resolveAddIndexTiKVReplicaCount,
canRunTiKVSpacePrecheck, and collectTiKVStoreCapacity (and any other
PD/region/capacity helpers invoked in this block); ensure you call cancel()
after creating the context.
- Around line 4144-4160: The check underestimates space when indexes are
placement-constrained because it divides predictedTiKVIndexBytes by
capacity.StoreCount and sums capacity.Stores indiscriminately; change the logic
in the cluster and per-store checks (where clusterRemaining,
perStorePredictBytes, remainingTiKVBytes, capacity.StoreCount and
capacity.Stores are used) to first compute the actual set of eligible stores
based on the index's placement/replica bundles (respecting pinned/targeted
stores), sum AvailableBytes only over those eligible stores for
clusterRemaining/clusterThreshold, and divide predictedTiKVIndexBytes across
only those eligible stores when computing perStorePredictBytes and
storeRemaining/storeThreshold; apply the same fix in the related block around
lines 4180-4250 that performs the same admission checks.
- Around line 3346-3358: The code stores peak MVCC bytes in the steady field:
blockSamplePrediction.PredictedBytes (set from mvccPhysicalBytes) is being
persisted into BlockSampleSteadyPredictedTiKVIndexBytes; change the assignment
so the MVCC peak value (mvccPhysicalBytes or the variable used when computing
peak) is saved to the appropriate "peak"/peak-named field and ensure
blockSamplePrediction.PredictedBytes is used for the true steady estimate;
update the three occurrences that mirror this pattern (around the symbols
BlockSampleSteadyPredictedTiKVIndexBytes, blockSamplePrediction.PredictedBytes,
and mvccPhysicalBytes) so steady vs peak are written to the correct fields and
logs.

In `@pkg/ddl/reorg_util_test.go`:
- Around line 355-360: Replace usage of context.Background() passed to
grpc.DialContext and RPC calls with a short timeout/deadline context to prevent
CI hangs: create a context with a timeout (e.g., context.WithTimeout) before
calling grpc.DialContext (the block that constructs conn via grpc.DialContext
and the RPC call sites around the listener.Dial() usage), use that context for
the dial and any subsequent RPCs, defer cancel() and handle the context deadline
error in the test (still calling require.NoError on the dial/RPC error) so tests
fail fast if the mock server is not responsive.

---

Outside diff comments:
In `@pkg/ddl/index.go`:
- Around line 3195-3203: The early return when task.State ==
proto.TaskStateSucceed skips scheduleDistTaskObservedTiKVUsage, so ensure the
observed-capacity logging still runs before resuming an already-succeeded task:
call scheduleDistTaskObservedTiKVUsage with the same identifiers used elsewhere
(use taskKey and reorgInfo.Job.ID) and any required context/params, then call
updateDistTaskRowCount and the resume log, and only then return; apply the same
change to the analogous block around lines 3421-3424 so both paths emit the
final predicted-vs-observed TiKV usage log.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 159a21d1-a5b2-42af-a059-5845d424fb3f

📥 Commits

Reviewing files that changed from the base of the PR and between e189973 and 1f63a3c.

📒 Files selected for processing (21)
  • pkg/ddl/BUILD.bazel
  • pkg/ddl/backfilling_clean_s3.go
  • pkg/ddl/backfilling_dist_executor.go
  • pkg/ddl/backfilling_operators.go
  • pkg/ddl/copr/copr_ctx.go
  • pkg/ddl/copr/copr_ctx_test.go
  • pkg/ddl/index.go
  • pkg/ddl/index_cop.go
  • pkg/ddl/reorg_util.go
  • pkg/ddl/reorg_util_test.go
  • pkg/executor/admin.go
  • pkg/lightning/tikv/local_sst_writer.go
  • pkg/sessionctx/vardef/tidb_vars.go
  • pkg/sessionctx/variable/sysvar.go
  • pkg/sessionctx/variable/sysvar_test.go
  • pkg/table/tables/BUILD.bazel
  • pkg/table/tables/index.go
  • pkg/table/tables/index_test.go
  • pkg/table/tables/mutation_checker_test.go
  • pkg/table/tables/tables.go
  • tests/realtikvtest/addindextest2/global_sort_test.go
💤 Files with no reviewable changes (1)
  • pkg/ddl/index_cop.go

Comment thread pkg/ddl/index.go
Comment thread pkg/ddl/index.go Outdated
Comment thread pkg/ddl/index.go
Comment on lines +4144 to +4160
clusterRemaining := remainingTiKVBytes(capacity.AvailableBytes, predictedTiKVIndexBytes)
clusterThreshold := uint64(float64(capacity.TotalBytes) * 0.20)
if clusterRemaining < clusterThreshold {
return dbterror.ErrIngestCheckEnvFailed.FastGenByArgs(
fmt.Sprintf("insufficient TiKV cluster capacity predicted for add index: cluster_available_bytes=%d predict_tikv_index_bytes=%d cluster_total_bytes=%d cluster_remaining_bytes=%d cluster_threshold_bytes=%d",
capacity.AvailableBytes, predictedTiKVIndexBytes, capacity.TotalBytes, clusterRemaining, clusterThreshold))
}

perStorePredictBytes := predictedTiKVIndexBytes / uint64(capacity.StoreCount)
for _, store := range capacity.Stores {
storeRemaining := remainingTiKVBytes(store.AvailableBytes, perStorePredictBytes)
storeThreshold := uint64(float64(store.TotalBytes) * 0.15)
if storeRemaining < storeThreshold {
return dbterror.ErrIngestCheckEnvFailed.FastGenByArgs(
fmt.Sprintf("insufficient TiKV store capacity predicted for add index: store_id=%d store_available_bytes=%d predict_tikv_index_bytes=%d tikv_store_count=%d per_store_predict_bytes=%d store_total_bytes=%d store_remaining_bytes=%d store_threshold_bytes=%d",
store.StoreID, store.AvailableBytes, predictedTiKVIndexBytes, capacity.StoreCount, perStorePredictBytes, store.TotalBytes, storeRemaining, storeThreshold))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This precheck underestimates space for placement-constrained indexes.

The prediction is scaled by replica count, but the admission check still spreads those bytes across capacity.StoreCount and sums availability across every TiKV store. If a placement bundle pins replicas to only a subset of stores, untargeted stores are counted as usable capacity and each target store is undercharged, so the job can pass even when the eligible stores will breach the 15% floor.

Also applies to: 4180-4250

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 4144 - 4160, The check underestimates space
when indexes are placement-constrained because it divides
predictedTiKVIndexBytes by capacity.StoreCount and sums capacity.Stores
indiscriminately; change the logic in the cluster and per-store checks (where
clusterRemaining, perStorePredictBytes, remainingTiKVBytes, capacity.StoreCount
and capacity.Stores are used) to first compute the actual set of eligible stores
based on the index's placement/replica bundles (respecting pinned/targeted
stores), sum AvailableBytes only over those eligible stores for
clusterRemaining/clusterThreshold, and divide predictedTiKVIndexBytes across
only those eligible stores when computing perStorePredictBytes and
storeRemaining/storeThreshold; apply the same fix in the related block around
lines 4180-4250 that performs the same admission checks.

Comment thread pkg/ddl/reorg_util_test.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/ddl/index.go (2)

3665-3674: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Create a fresh timeout context for the retry.

Line 3665's timeout context is reused for both GetStore attempts. If the first RPC consumed that deadline or canceled the context, the retry fails immediately and the TiKV capacity snapshot is skipped under exactly the transient PD conditions this branch is meant to tolerate.

Suggested fix
-	reqCtx, cancel := context.WithTimeout(ctx, pdStoreStatsRequestTimeout)
-	defer cancel()
-	reqCtx = serviceClient.BuildGRPCTargetContext(reqCtx, true)
-	resp, err := pdpb.NewPDClient(serviceClient.GetClientConn()).GetStore(reqCtx, req)
+	call := func(sc sd.ServiceClient) (*pdpb.GetStoreResponse, error) {
+		reqCtx, cancel := context.WithTimeout(ctx, pdStoreStatsRequestTimeout)
+		defer cancel()
+		reqCtx = sc.BuildGRPCTargetContext(reqCtx, true)
+		return pdpb.NewPDClient(sc.GetClientConn()).GetStore(reqCtx, req)
+	}
+	resp, err := call(serviceClient)
 	if needRetryPDStoreStats(serviceClient, resp, err) {
 		serviceClient = serviceDiscovery.GetServiceClient()
 		if serviceClient != nil && serviceClient.GetClientConn() != nil {
-			reqCtx = serviceClient.BuildGRPCTargetContext(reqCtx, true)
-			resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetStore(reqCtx, req)
+			resp, err = call(serviceClient)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 3665 - 3674, The retry is reusing the original
reqCtx which may have exhausted its deadline; when needRetryPDStoreStats(...) is
true, create a fresh timeout context (e.g., newReqCtx, newCancel :=
context.WithTimeout(ctx, pdStoreStatsRequestTimeout)) and call
serviceClient.BuildGRPCTargetContext(newReqCtx, true) before invoking
pdpb.NewPDClient(...).GetStore for the retry, and ensure newCancel() is called
to release resources instead of reusing the original reqCtx/cancel.

4977-4979: ⚠️ Potential issue | 🟠 Major

Bound region sampling cost for precheck

Passing 0 as maxRegions to listTableRegionsWithClient makes iterateTableRegionsWithClient (page size tableRegionsPageSize = 128) walk all regions in the table key range until it hits the end/empty page, since the maxRegions limit/break logic only applies when maxRegions > 0. listSamplePredictionRegions then builds a full regions slice for every returned regionInfo, and only afterward pickSamplePredictionRegionsWithLimit truncates to the small budget (selection.regionCount, max 5). This lets PD region listing scale with total region count rather than the sampled-region budget.

Thread selection.regionCount (or another positive cap) into the region-listing path so fetching is bounded before sampling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 4977 - 4979, The current call to
listTableRegionsWithClient passes 0 (no cap) causing
iterateTableRegionsWithClient (page size tableRegionsPageSize = 128) to scan all
regions before sampling; change the call site (where tableStart, tableEnd and
listTableRegionsWithClient are used) to pass a positive cap based on the
sampling budget (e.g., selection.regionCount or a small multiple thereof) so
listing stops early; update any upstream calls like
listSamplePredictionRegions/pickSamplePredictionRegionsWithLimit to thread
selection.regionCount (or a defined positive limit) into
listTableRegionsWithClient so region fetching is bounded before building the
full regions slice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/ddl/index.go`:
- Around line 3665-3674: The retry is reusing the original reqCtx which may have
exhausted its deadline; when needRetryPDStoreStats(...) is true, create a fresh
timeout context (e.g., newReqCtx, newCancel := context.WithTimeout(ctx,
pdStoreStatsRequestTimeout)) and call
serviceClient.BuildGRPCTargetContext(newReqCtx, true) before invoking
pdpb.NewPDClient(...).GetStore for the retry, and ensure newCancel() is called
to release resources instead of reusing the original reqCtx/cancel.
- Around line 4977-4979: The current call to listTableRegionsWithClient passes 0
(no cap) causing iterateTableRegionsWithClient (page size tableRegionsPageSize =
128) to scan all regions before sampling; change the call site (where
tableStart, tableEnd and listTableRegionsWithClient are used) to pass a positive
cap based on the sampling budget (e.g., selection.regionCount or a small
multiple thereof) so listing stops early; update any upstream calls like
listSamplePredictionRegions/pickSamplePredictionRegionsWithLimit to thread
selection.regionCount (or a defined positive limit) into
listTableRegionsWithClient so region fetching is bounded before building the
full regions slice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4acd83fa-b442-4158-a96a-38b8a80446f8

📥 Commits

Reviewing files that changed from the base of the PR and between 1f63a3c and 2299d03.

📒 Files selected for processing (12)
  • pkg/ddl/backfilling_dist_executor.go
  • pkg/ddl/backfilling_import_cloud.go
  • pkg/ddl/backfilling_read_index.go
  • pkg/ddl/index.go
  • pkg/ddl/reorg_util_test.go
  • pkg/dxf/framework/taskexecutor/execute/interface.go
  • pkg/ingestor/ingestcli/client.go
  • pkg/ingestor/ingestcli/client_test.go
  • pkg/ingestor/ingestcli/interface.go
  • pkg/lightning/backend/local/job_worker.go
  • pkg/lightning/backend/local/job_worker_test.go
  • pkg/lightning/backend/local/region_job.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/ddl/reorg_util_test.go

@codecov

codecov Bot commented Jun 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.91133% with 692 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release-nextgen-202603@aeffcf4). Learn more about missing BASE report.

Additional details and impacted files
@@                     Coverage Diff                     @@
##             release-nextgen-202603     #68955   +/-   ##
===========================================================
  Coverage                          ?   76.1559%           
===========================================================
  Files                             ?       1936           
  Lines                             ?     541301           
  Branches                          ?          0           
===========================================================
  Hits                              ?     412233           
  Misses                            ?     129068           
  Partials                          ?          0           
Flag Coverage Δ
unit 76.1559% <65.9113%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <0.0000%> (?)
parser ∅ <0.0000%> (?)
br 48.7912% <0.0000%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 8, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

1 similar comment
@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 8, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow

tiprow Bot commented Jun 8, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 8, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 9, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow

tiprow Bot commented Jun 9, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli expxiaoli changed the title Release nextgen 202603 tikv precheck pkg/ddl, planner: add TiKV space precheck for DXF add-index Jun 10, 2026
@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 12, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 12, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YangKeao YangKeao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review is still WIP.

Comment thread pkg/ddl/index.go Outdated
zap.Int("worker-cnt", workerCntLimit), zap.Int("required-slots", requiredSlots),
zap.String("task-key", taskKey))
rowSize := estimateTableRowSize(w.workCtx, w.store, w.sess.GetRestrictedSQLExecutor(), t)
var (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it'd be better to have a standalone function / method to represent the logic of TiKV size pre-check.

Then, the nested else branch can be replaced by return directly in if branch. Also, we can re-org some of the condition of the code blocks (e.g. many if initialCapacity != nil { can be replaced by if initialCapacity == nil { return }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to helper method

Comment thread pkg/ddl/index.go
if err != nil {
return result, err
}
result.UseStats = useStats

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that UseStats is only used for log, so a table will not be enforced if it's using pseudo stats. Is it expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. If stats are pseudo, we should skip enforcement and only log, since row-count scaling is unreliable.

Comment thread pkg/ddl/index.go Outdated
}
}

func scalePredictedBytesByReplicaCount(predictedBytes, replicaCount uint64) uint64 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is useless. It's only used in test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I will move it to test code.

Comment thread pkg/ddl/index.go
return writerCfg
}

func generateIndexKVsForRow(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback style function looks strange but for now I don't have better ideas. I understand that it's used to abstract some common logic between the prediction and the real writeChunk. Let me think whether we can have better ways later 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it’s not ideal, but I will keep it for now to avoid duplicating KV generation logic between prediction and writeChunk.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 16, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@expxiaoli

Copy link
Copy Markdown
Contributor Author

/retest-required

@tiprow

tiprow Bot commented Jun 16, 2026

Copy link
Copy Markdown

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YangKeao YangKeao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the common logic of prediction and writeChunk, I also didn't have better idea🤦🏻‍♂️.

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 17, 2026
@YangKeao

Copy link
Copy Markdown
Member

@wjhuang2016 PTAL

@YangKeao

Copy link
Copy Markdown
Member

/hold

Feel free to unhold once the master PR merged.

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2026
@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 30, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-17 01:00:32.710682051 +0000 UTC m=+1526533.780999451: ☑️ agreed by YangKeao.
  • 2026-06-30 05:47:42.522623445 +0000 UTC m=+104804.223002868: ☑️ agreed by wjhuang2016.

@yudongusa yudongusa left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open document PR on this

@ti-chi-bot

ti-chi-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: GMHDBJD, wjhuang2016, YangKeao, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants