Skip to content

ingest: EtcdSafePointKV gets empty PD address in DDL path, causing create kv client to fail #66125

@King-Dylan

Description

@King-Dylan

Bug Report

When add-index runs via the DDL distributed framework (DXF) with ingest, the backfill step can fail with:

  • [Lightning:KV:ErrCreateKVClient] create kv client error: context deadline exceeded
  • etcd client: dial tcp: missing address

So ingest backfill can fail because the etcd client used for safe point KV is created with empty PD endpoints.

Introduced by: #55433 (ddl: directly use BackendConfig rather than use lightning config)

That PR switched DDL ingest to build local.BackendConfig via genConfig() in pkg/ddl/ingest/config.go instead of the full Lightning config. genConfig() does not set PDAddr; it only sets fields like LocalStoreDir, KeyspaceName, concurrency, etc. So in the DDL ingest path, BackendConfig.PDAddr is always the zero value (empty string).

pkg/lightning/backend/local/local.go then had two uses of PD addresses:

  • PD client: pdAddrs from pdSvcDiscovery.GetServiceURLs() when pdSvcDiscovery != nil (DDL path), so the PD client gets valid addresses.
  • Etcd safe point KV: it used config.PDAddr for NewEtcdSafePointKV(), which in the DDL path is never set and stays empty.

So when DXF runs add-index and creates a local backend on an executor node, the etcd client is created with empty endpoints → "missing address" and the step fails (often reported as context deadline exceeded).

Note: Already Fixed on master by:#59757 (*: upgrade to the latest client-go)

1. Minimal reproduce step (Required)

  1. Deploy a TiDB cluster (e.g. 8.5.x) with PD and TiKV.
  2. Create a table with enough data so that add-index uses the ingest path (e.g. tens of thousands of rows or more).
  3. Ensure add-index runs via the distributed reorg path (e.g. tidb_enable_dist_task / DDL distributed framework enabled, or conditions that make the job use IsDistReorg).
  4. Run ALTER TABLE t ADD INDEX idx(c); (or add primary key / other index that uses ingest).
  5. Observe the backfill step on the executor node: it creates a local backend and then fails.

2. What did you expect to see? (Required)

Add-index backfill (ingest) should complete successfully: the executor creates a local backend, connects to PD/etcd with valid addresses, and the backfill step succeeds.

3. What did you see instead? (Required)

The backfill step fails with:

  • TiDB log: [Lightning:KV:ErrCreateKVClient] create kv client error: context deadline exceeded
  • TiDB log: ["build ingest backend failed"] ["job ID"=...] [error="[Lightning:KV:ErrCreateKVClient]create kv client error: context deadline exceeded"]
  • etcd client log (if visible): dial tcp: missing address and/or retrying of unary invoker failed ... latest balancer error: last connection error: ... dial tcp: missing address

The step often hits the step context deadline (~5s) and retries repeatedly with the same error.

4. What is your TiDB version? (Required)

v8.5.2

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions