pkg/ddl, planner: add TiKV space precheck for DXF add-index by expxiaoli · Pull Request #68955 · pingcap/tidb

expxiaoli · 2026-06-04T09:35:22Z

This PR cherry-picks #68490 to release-nextgen-202603 and adds follow-up changes for Next-Gen SST estimation and ingested SST observability.

What problem does this PR solve?

Issue Number: ref #68354

Problem Summary:

DXF add-index can generate a large amount of SST data and exhaust TiKV disk space, but TiDB does not estimate the physical write footprint before submitting the distributed task.

TiDB also lacks reliable task-level observability for comparing predicted bytes with the SST bytes successfully ingested, especially on Next-Gen clusters where the CSE storage format differs from classic TiKV SSTs.

What changed and how does it work?

Submission-time prediction and precheck

Collect an initial TiKV capacity snapshot from PD before the first DXF add-index task submission.
Use bounded block sampling to generate representative index KVs through the same helpers used by the actual backfill path.
Estimate the physical SST footprint:
- For classic TiKV, model MVCC default/write CF SST encoding with Pebble.
- For Next-Gen, model CSE block layout, compression, MVCC metadata, and BinaryFuse8 filter bytes.
- Estimate the logical and MVCC footprints separately to report MVCC overhead.
Scale the sampled estimate by table row count and TiKV replica count.
Persist the single-replica estimate, all-replica estimate, MVCC overhead, replica information, sampling diagnostics, and initial capacity snapshot in the DXF task metadata.
Check the projected remaining capacity:
- cluster remaining space must be at least 20% of total capacity
- each store's projected remaining space must be at least 15% of its total capacity
Bound the complete submission-time capacity collection and prediction process to five seconds. Capability, collection, or prediction failures are logged and skipped rather than blocking task submission.
Control enforcement with the global variable enforce_disk_space_precheck_before_add_index:
- OFF by default: log insufficient capacity without rejecting the DDL
- ON: reject the add-index job when the capacity check fails

Ingested SST observability

Record successfully ingested SST identities and sizes from both classic and Next-Gen ingest paths.
Deduplicate SST identities so retries do not inflate the observed byte count.
Aggregate the counters from read-index and write-and-ingest subtasks.
At successful task completion, log:
- block_sample_predicted_tikv_index_single_replica_bytes
- block_sample_predicted_tikv_index_all_replica_bytes
- block_sample_mvcc_overhead_total_bytes
- logical_index_kv_bytes
- ingested_sst_bytes
- ingest source and reliability status
- initial TiKV capacity and sampling diagnostics

Shared helper and correctness changes

Reuse index KV generation, virtual-column, multi-valued-index, global-index, and common-handle helpers between prediction and actual backfill paths.
Extract TiKV transaction SST key/value encoding helpers for prediction.
Fix restored-data handling for common handles in RECOVER INDEX.
Update Bazel dependencies and test shard metadata for the release branch.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test

Validation:

make bazel_prepare
Targeted failpoint unit tests for DDL prediction, capacity collection, precheck enforcement, SST observation, table index encoding, and the new system variable
RealTiKV TestDXFAddIndexRealtimeSummary, including verification of reported ingested SST bytes on Next-Gen
make lint
git diff --check
Next-Gen build, unit-test, MySQL-client, and real-cluster integration CI jobs

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Improve DXF add-index by estimating its physical TiKV SST footprint before task submission, optionally rejecting jobs with insufficient disk space, and reporting successfully ingested SST bytes for classic and Next-Gen clusters.

tiprow · 2026-06-04T09:35:35Z

Hi @expxiaoli. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-06-04T09:35:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds TiKV capacity prediction and optional disk-space precheck for distributed add-index; refactors index KV generation and handle-restoration APIs; records ingested SST identities/sizes; extracts Lightning SST encoders; expands mocks and tests; updates Bazel deps and wiring.

Changes

Add-Index TiKV Capacity Prediction with Refactoring

Layer / File(s)	Summary
Metadata, config, and wiring `pkg/ddl/backfilling_dist_executor.go`, `pkg/sessionctx/vardef/tidb_vars.go`, `pkg/sessionctx/variable/sysvar.go`	`BackfillTaskMeta` extended with TiKV capacity/prediction fields; `TiKVStoreCapacity`/`TiKVClusterCapacity` added; new sysvar `TiDBEnforceDiskSpacePrecheckBeforeAddIndex` and global atomic exposed; executor wiring passes initial-capacity presence flag.
Handle restoration & virtual-column helper exports `pkg/table/tables/tables.go`, `pkg/ddl/copr/copr_ctx.go`, `pkg/executor/admin.go`, `pkg/ddl/index_cop.go`, `pkg/ddl/copr/copr_ctx_test.go`	Introduces exported `TryGetHandleRestoredData`; renames/exports virtual-column offsets/types helper; updates recover-index and test call sites; removes legacy `getRestoreData`.
Multi-value index helpers & encoding `pkg/table/tables/index.go`, `pkg/table/tables/index_test.go`	Introduces `BuildMultiValueIndexValueGroups` and `EncodeRawIndexKeyValues`; MV expansion and raw-key encoding are tested for truncation and JSON-array expansion/deduplication.
Index KV generation refactor `pkg/ddl/index.go`, `pkg/ddl/backfilling_operators.go`, `pkg/ddl/backfilling_clean_s3.go`	Centralizes per-row index KV emission into `generateIndexKVsForRow`, extracts `indexKVHandleForPhysicalTable`, updates `writeChunk` to accept `physicalID`, simplifies `writeOneKV`, and delegates cleanup metering to helper.
TiKV capacity prediction & precheck pipeline `pkg/ddl/index.go`	Adds PD gRPC store-stats collection, block-sampled index-bytes prediction with MVCC estimation, replica-count resolution (placement/PD), scaling and optional disk-space precheck, records prediction inputs in `BackfillTaskMeta`, and logs observed ingested SST usage.
Region iteration & PD utilities `pkg/ddl/reorg_util.go`, `pkg/ddl/reorg_util_test.go`	Adds `helperFromKVStorage`, `pdHTTPClientFromStorage`, `listTableRegions*`, and `iterateTableRegionsWithClient` to page PD region results and centralize region-based sampling; tests expanded with mocks.
Ingested SST recording & summary tracking `pkg/dxf/framework/taskexecutor/execute/interface.go`, `pkg/ddl/backfilling_import_cloud.go`, `pkg/ddl/backfilling_read_index.go`	Adds SubtaskSummary ingest counters and `IngestedSSTCollector`; read/import executors and ingestCollector can record deduplicated ingested SST identities/sizes and update summary counters.
Ingest CLI & worker plumbing, tests `pkg/ingestor/ingestcli/`, `pkg/lightning/backend/local/`, `pkg/lightning/tikv/local_sst_writer.go`, tests	`nextGenSSTMeta` gains `Size`; `WriteResponse.GetSSTMeta()` added; SST key/value encoders extracted; job worker records ingested SST meta (next-gen/classic identities) to collector; tests updated with recording collectors and multi-ingest partial-failure scenarios.
Tests & mocks for prediction/observed-capacity `pkg/ddl/reorg_util_test.go`, `tests/realtikvtest/addindextest2/global_sort_test.go`	Adds PD/TiKV mock infrastructure (bufconn, mock PD HTTP client/replicate config), block-sample prediction helpers, and tests validating prediction, replica parsing/scaling and observed capacity via failpoint hook.
Build & Bazel deps `pkg/ddl/BUILD.bazel`, `pkg/table/tables/BUILD.bazel`, `pkg/lightning/backend/local/BUILD.bazel`	Updates Bazel deps to include Pebble objstorage/sstable/vfs, PD client targets, lightning tikv, adds gRPC bufconn test dep, and bumps test shard count; local_test deps include execute package.

Sequence Diagram

sequenceDiagram
  participant Worker as AddIndex Worker
  participant PD as PD (gRPC/HTTP)
  participant TiKV as TiKV Stores
  participant Storage as Lightning / KV Storage
  participant Collector as SubtaskSummary Collector
  Worker->>PD: collect store stats and placement info
  Worker->>TiKV: block-sample region snapshot reads
  Worker->>Worker: predict index bytes, scale by replica count
  Worker->>PD: optionally enforce disk-space precheck decision
  Worker->>Storage: submit dist add-index task (SST ingest)
  Storage->>Worker: return ingest metadata (SST id/size)
  Worker->>Collector: record ingested SST identities/sizes into SubtaskSummary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

pingcap/tidb#68833: Overlaps with partial-index handling used by index KV generation changes.
pingcap/tidb#68824: Touches backfill step wiring and executor initialization related to backfilling dist executor changes.

Suggested labels

cherry-pick-approved, type/cherry-pick-for-release-nextgen-202603, approved, lgtm

Suggested reviewers

wjhuang2016
D3Hunter
joechenrh
gengliqi

Poem

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding TiKV space precheck for DXF add-index, which aligns with the extensive changes across DDL, prediction, precheck, and observability components described in the PR objectives.
Description check	✅ Passed	The PR description comprehensively addresses the required template sections: it clearly identifies the issue (`#68354`), explains the problem (DXF add-index can exhaust TiKV disk space without estimation), details what changed (prediction, precheck, observability), includes a complete checklist with appropriate selections, covers side effects and documentation impacts, and provides a release note aligned with the style guide.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/ddl/index.go (1)

3195-3203: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't skip observed-capacity logging when resuming an already-succeeded task.

This early return bypasses scheduleDistTaskObservedTiKVUsage(...), so a paused/resumed add-index job never emits the final predicted-vs-observed TiKV usage log even though the task history and metadata are still available.

Suggested fix

 		if task.State == proto.TaskStateSucceed {
 			w.updateDistTaskRowCount(taskKey, reorgInfo.Job.ID)
+			w.scheduleDistTaskObservedTiKVUsage(taskManager, taskKey, reorgInfo.Job.ID)
 			logutil.DDLLogger().Info(
 				"task succeed, start to resume the ddl job",
 				zap.String("task-key", taskKey))
 			return nil
 		}

Also applies to: 3421-3424

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 3195 - 3203, The early return when task.State
== proto.TaskStateSucceed skips scheduleDistTaskObservedTiKVUsage, so ensure the
observed-capacity logging still runs before resuming an already-succeeded task:
call scheduleDistTaskObservedTiKVUsage with the same identifiers used elsewhere
(use taskKey and reorgInfo.Job.ID) and any required context/params, then call
updateDistTaskRowCount and the resume log, and only then return; apply the same
change to the analogous block around lines 3421-3424 so both paths emit the
final predicted-vs-observed TiKV usage log.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/ddl/index.go`:
- Around line 3250-3331: Wrap the PD-backed precheck calls in a short timeout
derived from ctx (e.g. ctxWithTimeout, cancel := context.WithTimeout(ctx,
5*time.Second); defer cancel()) and pass ctxWithTimeout to the functions that
perform PD/TiKV RPCs so submission cannot hang: use the timed context when
calling w.predictTiKVIndexBytesBlockSample, w.resolveAddIndexTiKVReplicaCount,
canRunTiKVSpacePrecheck, and collectTiKVStoreCapacity (and any other
PD/region/capacity helpers invoked in this block); ensure you call cancel()
after creating the context.
- Around line 4144-4160: The check underestimates space when indexes are
placement-constrained because it divides predictedTiKVIndexBytes by
capacity.StoreCount and sums capacity.Stores indiscriminately; change the logic
in the cluster and per-store checks (where clusterRemaining,
perStorePredictBytes, remainingTiKVBytes, capacity.StoreCount and
capacity.Stores are used) to first compute the actual set of eligible stores
based on the index's placement/replica bundles (respecting pinned/targeted
stores), sum AvailableBytes only over those eligible stores for
clusterRemaining/clusterThreshold, and divide predictedTiKVIndexBytes across
only those eligible stores when computing perStorePredictBytes and
storeRemaining/storeThreshold; apply the same fix in the related block around
lines 4180-4250 that performs the same admission checks.
- Around line 3346-3358: The code stores peak MVCC bytes in the steady field:
blockSamplePrediction.PredictedBytes (set from mvccPhysicalBytes) is being
persisted into BlockSampleSteadyPredictedTiKVIndexBytes; change the assignment
so the MVCC peak value (mvccPhysicalBytes or the variable used when computing
peak) is saved to the appropriate "peak"/peak-named field and ensure
blockSamplePrediction.PredictedBytes is used for the true steady estimate;
update the three occurrences that mirror this pattern (around the symbols
BlockSampleSteadyPredictedTiKVIndexBytes, blockSamplePrediction.PredictedBytes,
and mvccPhysicalBytes) so steady vs peak are written to the correct fields and
logs.

In `@pkg/ddl/reorg_util_test.go`:
- Around line 355-360: Replace usage of context.Background() passed to
grpc.DialContext and RPC calls with a short timeout/deadline context to prevent
CI hangs: create a context with a timeout (e.g., context.WithTimeout) before
calling grpc.DialContext (the block that constructs conn via grpc.DialContext
and the RPC call sites around the listener.Dial() usage), use that context for
the dial and any subsequent RPCs, defer cancel() and handle the context deadline
error in the test (still calling require.NoError on the dial/RPC error) so tests
fail fast if the mock server is not responsive.

---

Outside diff comments:
In `@pkg/ddl/index.go`:
- Around line 3195-3203: The early return when task.State ==
proto.TaskStateSucceed skips scheduleDistTaskObservedTiKVUsage, so ensure the
observed-capacity logging still runs before resuming an already-succeeded task:
call scheduleDistTaskObservedTiKVUsage with the same identifiers used elsewhere
(use taskKey and reorgInfo.Job.ID) and any required context/params, then call
updateDistTaskRowCount and the resume log, and only then return; apply the same
change to the analogous block around lines 3421-3424 so both paths emit the
final predicted-vs-observed TiKV usage log.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 159a21d1-a5b2-42af-a059-5845d424fb3f

📥 Commits

Reviewing files that changed from the base of the PR and between e189973 and 1f63a3c.

📒 Files selected for processing (21)

pkg/ddl/BUILD.bazel
pkg/ddl/backfilling_clean_s3.go
pkg/ddl/backfilling_dist_executor.go
pkg/ddl/backfilling_operators.go
pkg/ddl/copr/copr_ctx.go
pkg/ddl/copr/copr_ctx_test.go
pkg/ddl/index.go
pkg/ddl/index_cop.go
pkg/ddl/reorg_util.go
pkg/ddl/reorg_util_test.go
pkg/executor/admin.go
pkg/lightning/tikv/local_sst_writer.go
pkg/sessionctx/vardef/tidb_vars.go
pkg/sessionctx/variable/sysvar.go
pkg/sessionctx/variable/sysvar_test.go
pkg/table/tables/BUILD.bazel
pkg/table/tables/index.go
pkg/table/tables/index_test.go
pkg/table/tables/mutation_checker_test.go
pkg/table/tables/tables.go
tests/realtikvtest/addindextest2/global_sort_test.go

💤 Files with no reviewable changes (1)

pkg/ddl/index_cop.go

coderabbitai · 2026-06-04T09:43:40Z

+	clusterRemaining := remainingTiKVBytes(capacity.AvailableBytes, predictedTiKVIndexBytes)
+	clusterThreshold := uint64(float64(capacity.TotalBytes) * 0.20)
+	if clusterRemaining < clusterThreshold {
+		return dbterror.ErrIngestCheckEnvFailed.FastGenByArgs(
+			fmt.Sprintf("insufficient TiKV cluster capacity predicted for add index: cluster_available_bytes=%d predict_tikv_index_bytes=%d cluster_total_bytes=%d cluster_remaining_bytes=%d cluster_threshold_bytes=%d",
+				capacity.AvailableBytes, predictedTiKVIndexBytes, capacity.TotalBytes, clusterRemaining, clusterThreshold))
+	}
+
+	perStorePredictBytes := predictedTiKVIndexBytes / uint64(capacity.StoreCount)
+	for _, store := range capacity.Stores {
+		storeRemaining := remainingTiKVBytes(store.AvailableBytes, perStorePredictBytes)
+		storeThreshold := uint64(float64(store.TotalBytes) * 0.15)
+		if storeRemaining < storeThreshold {
+			return dbterror.ErrIngestCheckEnvFailed.FastGenByArgs(
+				fmt.Sprintf("insufficient TiKV store capacity predicted for add index: store_id=%d store_available_bytes=%d predict_tikv_index_bytes=%d tikv_store_count=%d per_store_predict_bytes=%d store_total_bytes=%d store_remaining_bytes=%d store_threshold_bytes=%d",
+					store.StoreID, store.AvailableBytes, predictedTiKVIndexBytes, capacity.StoreCount, perStorePredictBytes, store.TotalBytes, storeRemaining, storeThreshold))
+		}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This precheck underestimates space for placement-constrained indexes.

The prediction is scaled by replica count, but the admission check still spreads those bytes across capacity.StoreCount and sums availability across every TiKV store. If a placement bundle pins replicas to only a subset of stores, untargeted stores are counted as usable capacity and each target store is undercharged, so the job can pass even when the eligible stores will breach the 15% floor.

Also applies to: 4180-4250

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ddl/index.go` around lines 4144 - 4160, The check underestimates space when indexes are placement-constrained because it divides predictedTiKVIndexBytes by capacity.StoreCount and sums capacity.Stores indiscriminately; change the logic in the cluster and per-store checks (where clusterRemaining, perStorePredictBytes, remainingTiKVBytes, capacity.StoreCount and capacity.Stores are used) to first compute the actual set of eligible stores based on the index's placement/replica bundles (respecting pinned/targeted stores), sum AvailableBytes only over those eligible stores for clusterRemaining/clusterThreshold, and divide predictedTiKVIndexBytes across only those eligible stores when computing perStorePredictBytes and storeRemaining/storeThreshold; apply the same fix in the related block around lines 4180-4250 that performs the same admission checks.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/ddl/index.go`:
- Around line 3665-3674: The retry is reusing the original reqCtx which may have
exhausted its deadline; when needRetryPDStoreStats(...) is true, create a fresh
timeout context (e.g., newReqCtx, newCancel := context.WithTimeout(ctx,
pdStoreStatsRequestTimeout)) and call
serviceClient.BuildGRPCTargetContext(newReqCtx, true) before invoking
pdpb.NewPDClient(...).GetStore for the retry, and ensure newCancel() is called
to release resources instead of reusing the original reqCtx/cancel.
- Around line 4977-4979: The current call to listTableRegionsWithClient passes 0
(no cap) causing iterateTableRegionsWithClient (page size tableRegionsPageSize =
128) to scan all regions before sampling; change the call site (where
tableStart, tableEnd and listTableRegionsWithClient are used) to pass a positive
cap based on the sampling budget (e.g., selection.regionCount or a small
multiple thereof) so listing stops early; update any upstream calls like
listSamplePredictionRegions/pickSamplePredictionRegionsWithLimit to thread
selection.regionCount (or a defined positive limit) into
listTableRegionsWithClient so region fetching is bounded before building the
full regions slice.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4acd83fa-b442-4158-a96a-38b8a80446f8

📥 Commits

Reviewing files that changed from the base of the PR and between 1f63a3c and 2299d03.

📒 Files selected for processing (12)

pkg/ddl/backfilling_dist_executor.go
pkg/ddl/backfilling_import_cloud.go
pkg/ddl/backfilling_read_index.go
pkg/ddl/index.go
pkg/ddl/reorg_util_test.go
pkg/dxf/framework/taskexecutor/execute/interface.go
pkg/ingestor/ingestcli/client.go
pkg/ingestor/ingestcli/client_test.go
pkg/ingestor/ingestcli/interface.go
pkg/lightning/backend/local/job_worker.go
pkg/lightning/backend/local/job_worker_test.go
pkg/lightning/backend/local/region_job.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/ddl/reorg_util_test.go

codecov · 2026-06-07T10:40:40Z

Codecov Report

❌ Patch coverage is 65.91133% with 692 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release-nextgen-202603@aeffcf4). Learn more about missing BASE report.

Additional details and impacted files

@@                     Coverage Diff                     @@
##             release-nextgen-202603     #68955   +/-   ##
===========================================================
  Coverage                          ?   76.1559%           
===========================================================
  Files                             ?       1936           
  Lines                             ?     541301           
  Branches                          ?          0           
===========================================================
  Hits                              ?     412233           
  Misses                            ?     129068           
  Partials                          ?          0

Flag	Coverage Δ
unit	`76.1559% <65.9113%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <0.0000%> (?)`
parser	`∅ <0.0000%> (?)`
br	`48.7912% <0.0000%> (?)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

expxiaoli · 2026-06-08T07:56:12Z

/retest-required

tiprow · 2026-06-08T07:56:37Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-08T08:27:42Z

/retest-required

expxiaoli · 2026-06-08T08:28:06Z

/retest-required

tiprow · 2026-06-08T08:28:10Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiprow · 2026-06-08T08:28:34Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-08T08:35:54Z

/retest-required

tiprow · 2026-06-08T08:36:22Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-09T03:19:29Z

/retest-required

tiprow · 2026-06-09T03:19:55Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

(cherry picked from commit c2bdb41)

(cherry picked from commit cb5558d)

(cherry picked from commit 78f814b)

(cherry picked from commit 5408035)

(cherry picked from commit 5af3a42)

(cherry picked from commit 419c0d1)

(cherry picked from commit 2d1e9cf)

(cherry picked from commit 64bd5e5)

(cherry picked from commit bb76450)

(cherry picked from commit 84866c1)

(cherry picked from commit 6f33aa0)

(cherry picked from commit 6a09801)

tiprow · 2026-06-09T08:41:24Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-12T02:22:10Z

/retest-required

tiprow · 2026-06-12T02:22:41Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-12T03:41:27Z

/retest-required

tiprow · 2026-06-12T03:41:54Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

YangKeao

Review is still WIP.

YangKeao · 2026-06-15T15:50:24Z

 			zap.Int("worker-cnt", workerCntLimit), zap.Int("required-slots", requiredSlots),
 			zap.String("task-key", taskKey))
 		rowSize := estimateTableRowSize(w.workCtx, w.store, w.sess.GetRestrictedSQLExecutor(), t)
+		var (


IMO it'd be better to have a standalone function / method to represent the logic of TiKV size pre-check.

Then, the nested else branch can be replaced by return directly in if branch. Also, we can re-org some of the condition of the code blocks (e.g. many if initialCapacity != nil { can be replaced by if initialCapacity == nil { return }

I will move it to helper method

YangKeao · 2026-06-15T16:08:29Z

+	if err != nil {
+		return result, err
+	}
+	result.UseStats = useStats


It seems that UseStats is only used for log, so a table will not be enforced if it's using pseudo stats. Is it expected?

Reasonable. If stats are pseudo, we should skip enforcement and only log, since row-count scaling is unreliable.

YangKeao · 2026-06-15T16:11:31Z

+	}
+}
+
+func scalePredictedBytesByReplicaCount(predictedBytes, replicaCount uint64) uint64 {


It seems that this function is useless. It's only used in test.

Great catch. I will move it to test code.

YangKeao · 2026-06-15T17:09:10Z

 	return writerCfg
 }

+func generateIndexKVsForRow(


This callback style function looks strange but for now I don't have better ideas. I understand that it's used to abstract some common logic between the prediction and the real writeChunk. Let me think whether we can have better ways later 🤔

Agreed it’s not ideal, but I will keep it for now to avoid duplicating KV generation logic between prediction and writeChunk.

expxiaoli · 2026-06-16T07:51:38Z

/retest-required

tiprow · 2026-06-16T07:52:02Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

expxiaoli · 2026-06-16T10:10:08Z

/retest-required

tiprow · 2026-06-16T10:10:33Z

@expxiaoli: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

YangKeao

For the common logic of prediction and writeChunk, I also didn't have better idea🤦🏻‍♂️.

YangKeao · 2026-06-17T01:00:46Z

@wjhuang2016 PTAL

YangKeao · 2026-06-22T02:38:04Z

/hold

Feel free to unhold once the master PR merged.

ti-chi-bot · 2026-06-30T05:47:43Z

[LGTM Timeline notifier]

Timeline:

2026-06-17 01:00:32.710682051 +0000 UTC m=+1526533.780999451: ☑️ agreed by YangKeao.
2026-06-30 05:47:42.522623445 +0000 UTC m=+104804.223002868: ☑️ agreed by wjhuang2016.

yudongusa

Please open document PR on this

ti-chi-bot · 2026-07-02T06:47:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: GMHDBJD, wjhuang2016, YangKeao, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [GMHDBJD,YangKeao,wjhuang2016]
~~pkg/ddl/OWNERS~~ [GMHDBJD,wjhuang2016]
~~pkg/dxf/OWNERS~~ [GMHDBJD,wjhuang2016]
~~pkg/ingestor/OWNERS~~ [GMHDBJD,wjhuang2016]
~~pkg/lightning/OWNERS~~ [GMHDBJD]
~~pkg/sessionctx/vardef/OWNERS~~ [yudongusa]
~~pkg/sessionctx/variable/OWNERS~~ [yudongusa]
~~pkg/table/OWNERS~~ [GMHDBJD,wjhuang2016]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot Bot added do-not-merge/invalid-title release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

expxiaoli added 12 commits June 9, 2026 15:23

ddl: log observed TiKV capacity increase for DXF add-index

7a55621

(cherry picked from commit c2bdb41)

pkg/ddl,planner: add TiKV space precheck for DXF add-index

41982a5

(cherry picked from commit cb5558d)

pkg/ddl: fix bazel test deps for add-index precheck

11baf9b

(cherry picked from commit 78f814b)

pkg/ddl: fix sample region pagination and flaky test

ccd12f9

(cherry picked from commit 5408035)

pkg/ddl,planner: fix CI lint issues

79b2096

(cherry picked from commit 5af3a42)

ddl: skip add-index precheck without PD HTTP client

d7622fa

(cherry picked from commit 419c0d1)

pkg/ddl: split sample TiKV index prediction modes

20963da

(cherry picked from commit 2d1e9cf)

pkg/ddl: sync bazel deps for sample predictor

6e4f786

(cherry picked from commit 64bd5e5)

pkg/ddl: prealloc add-index sampled KVs

a03f90b

(cherry picked from commit bb76450)

pkg/ddl: keep single add-index sample predictor

7a4bb7b

(cherry picked from commit 84866c1)

ddl: add block sample TiKV index prediction

f2458e1

(cherry picked from commit 6f33aa0)

ddl: skip add index prediction failures

732cc71

(cherry picked from commit 6a09801)

ddl: split next-gen CSE logical estimator

48b8408

expxiaoli changed the title ~~Release nextgen 202603 tikv precheck~~ pkg/ddl, planner: add TiKV space precheck for DXF add-index Jun 10, 2026

ti-chi-bot Bot removed the do-not-merge/invalid-title label Jun 10, 2026

expxiaoli added 2 commits June 11, 2026 17:43

ddl: weight add-index precheck replicas by partition rows

f8079a8

ddl: add missing bazel deps for grpc test imports

cff1e2d

YangKeao reviewed Jun 15, 2026

View reviewed changes

pkg/ddl: refine add index TiKV space precheck

88b9579

YangKeao approved these changes Jun 17, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 17, 2026

ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2026

wjhuang2016 approved these changes Jun 30, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 30, 2026

yudongusa approved these changes Jul 2, 2026

View reviewed changes

GMHDBJD approved these changes Jul 2, 2026

View reviewed changes

ti-chi-bot Bot added the approved label Jul 2, 2026

expxiaoli mentioned this pull request Jul 2, 2026

dxf: document ADD INDEX disk space precheck pingcap/docs#23190

Open

12 tasks

Uh oh!

Conversation

expxiaoli commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Submission-time prediction and precheck

Ingested SST observability

Shared helper and correctness changes

Check List

Release note

Uh oh!

tiprow Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

expxiaoli commented Jun 8, 2026

Uh oh!

tiprow Bot commented Jun 8, 2026

Uh oh!

expxiaoli commented Jun 8, 2026

Uh oh!

expxiaoli commented Jun 8, 2026

Uh oh!

tiprow Bot commented Jun 8, 2026

Uh oh!

tiprow Bot commented Jun 8, 2026

Uh oh!

expxiaoli commented Jun 8, 2026

Uh oh!

tiprow Bot commented Jun 8, 2026

Uh oh!

expxiaoli commented Jun 9, 2026

Uh oh!

tiprow Bot commented Jun 9, 2026

Uh oh!

tiprow Bot commented Jun 9, 2026

Uh oh!

expxiaoli commented Jun 12, 2026

Uh oh!

tiprow Bot commented Jun 12, 2026

Uh oh!

expxiaoli commented Jun 12, 2026

Uh oh!

tiprow Bot commented Jun 12, 2026

Uh oh!

YangKeao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

expxiaoli commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 7, 2026 •

edited

Loading