What did you do?
Investigated flaky presubmit failures on PR #4426 where these two jobs failed with Failed to start TiDB:
pull-cdc-mysql-integration-light-next-gen build 608
pull-cdc-pulsar-integration-light-next-gen build 84
Both archived artifacts show the same upstream bootstrap sequence:
- upstream PD logs
[PD:keyspace:ErrRegionSplitTimeout] region split timeout
- upstream PD logs
failed to create pre-alloc keyspace for keyspace1
- upstream
tidb-system then retries Load keyspace SYSTEM failed with ErrKeyspaceNotFound
- the integration bootstrap script eventually times out in
check_tidb_health and prints Failed to start TiDB
The shared bootstrap path is tests/integration_tests/_utils/start_tidb_cluster_nextgen. It waits for PD health, but it does not wait for next-gen keyspace pre-allocation to finish before starting the first SYSTEM TiDB.
What did you expect to see?
Next-gen integration bootstrap should not start SYSTEM TiDB until the upstream keyspace pre-allocation is ready, and these presubmit jobs should not fail with intermittent Failed to start TiDB.
What did you see instead?
The jobs failed before TiDB could accept MySQL connections because upstream keyspace pre-allocation had already timed out.
Relevant failure links:
Relevant log evidence:
- upstream PD:
ErrRegionSplitTimeout then failed to create pre-alloc keyspace
- upstream system TiDB:
Load keyspace SYSTEM failed: ... ErrKeyspaceNotFound
Versions of the cluster
Upstream TiDB cluster version (from the failed job artifact):
Release Version: v9.0.0-beta.2.pre-1334-g5766c79
Git Commit Hash: 5766c79bbff7d2ac273d7cc7cfe71d29fbfc5488
Kernel Type: Next Generation
Upstream TiKV version (from the failed job artifact):
Release Version: 8.5.4+branch-HEAD
Git Commit Hash: 2cfd099039e3bd207aea7efbe8725a413beb4313
TiCDC version (from the failed job artifact):
release-version=v8.5.4-nextgen.202510.5-115-g9d0f1f4d
git-hash=9d0f1f4dda05e49cc449515750bc1ec36dfb295e
What did you do?
Investigated flaky presubmit failures on PR #4426 where these two jobs failed with
Failed to start TiDB:pull-cdc-mysql-integration-light-next-genbuild608pull-cdc-pulsar-integration-light-next-genbuild84Both archived artifacts show the same upstream bootstrap sequence:
[PD:keyspace:ErrRegionSplitTimeout] region split timeoutfailed to create pre-alloc keyspaceforkeyspace1tidb-systemthen retriesLoad keyspace SYSTEM failedwithErrKeyspaceNotFoundcheck_tidb_healthand printsFailed to start TiDBThe shared bootstrap path is
tests/integration_tests/_utils/start_tidb_cluster_nextgen. It waits for PD health, but it does not wait for next-gen keyspace pre-allocation to finish before starting the firstSYSTEMTiDB.What did you expect to see?
Next-gen integration bootstrap should not start
SYSTEMTiDB until the upstream keyspace pre-allocation is ready, and these presubmit jobs should not fail with intermittentFailed to start TiDB.What did you see instead?
The jobs failed before TiDB could accept MySQL connections because upstream keyspace pre-allocation had already timed out.
Relevant failure links:
Relevant log evidence:
ErrRegionSplitTimeoutthenfailed to create pre-alloc keyspaceLoad keyspace SYSTEM failed: ... ErrKeyspaceNotFoundVersions of the cluster
Upstream TiDB cluster version (from the failed job artifact):
Upstream TiKV version (from the failed job artifact):
TiCDC version (from the failed job artifact):