Skip to content

tests: isolate next gen upstream TiKV wal-sync dirs#4685

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
tenfyzhong:fix/nextgen-upstream-walsync-dir
Apr 2, 2026
Merged

tests: isolate next gen upstream TiKV wal-sync dirs#4685
ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
tenfyzhong:fix/nextgen-upstream-walsync-dir

Conversation

@tenfyzhong
Copy link
Copy Markdown
Collaborator

@tenfyzhong tenfyzhong commented Apr 2, 2026

What problem does this PR solve?

Issue Number: close #4684

In next-gen integration startup, all upstream TiKV instances shared the same rfengine.wal-sync-dir. This can cause unstable startup behavior and lead to flaky failures when bootstrapping upstream system TiDB.

What is changed and how it works?

  • Remove [rfengine] with a shared wal-sync-dir from common upstream-tikv.toml.
  • Generate per-instance upstream TiKV config (upstream-tikv1.toml, upstream-tikv2.toml, upstream-tikv3.toml).
  • Append per-instance [rfengine] settings and set isolated wal-sync-dir for each TiKV.
  • Create per-instance wal-sync directories before launching TiKV.
  • Point each upstream TiKV process to its own config file.
  • Remove the fixed 30s sleep before starting TiDB.

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Test commands:

  • go test ./tests/integration_tests/_utils -run TestNextGenUpstreamTiKVWalSyncDirIsPerInstance -count=1
  • go test ./tests/integration_tests/_utils -count=1
  • bash -n tests/integration_tests/_utils/start_tidb_cluster_nextgen

Questions

Will it cause performance regression or break compatibility?

No.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

Fixes flaky next-gen integration startup where upstream TiKV instances shared one rfengine wal-sync-dir path.

Generate per-instance upstream TiKV configs with isolated wal-sync-dir and add a regression test.

Ref pingcap#4684

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

The integration test bootstrap script is modified to isolate upstream TiKV instances by generating per-instance configuration files with unique wal-sync directories (tikv1, tikv2, tikv3) instead of sharing a single path. A 30-second sleep waiting for keyspace pre-allocation is also removed from the startup sequence.

Changes

Cohort / File(s) Summary
Upstream TiKV Bootstrap Config
tests/integration_tests/_utils/start_tidb_cluster_nextgen
Modified to generate per-instance TiKV configs by cloning base config and appending instance-specific [rfengine] sections with isolated wal-sync directories (tikv1/, tikv2/, tikv3/ respectively). Creates per-instance WAL directories and passes per-instance config to tikv-server invocations. Removed 30-second sleep awaiting keyspace pre-allocation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

lgtm, approved, size/S, release-note-none

Suggested reviewers

  • wk989898
  • wlwilliamx
  • lidezhu

Poem

🐰 Three TiKVs now dance in isolated glee,
No shared raft-wal corruption can be,
Each tikv1, tikv2, and tikv3 plays apart,
Bootstrap's more stable, success from the start!
A thirty-second wait falls away,
Faster clusters now start their day ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'tests: isolate next gen upstream TiKV wal-sync dirs' accurately and concisely summarizes the main change: isolating WAL sync directories per TiKV instance to fix startup flakiness.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from issue #4684: per-instance TiKV configs with isolated wal-sync-dirs, directory structure creation, and a regression test.
Out of Scope Changes check ✅ Passed The only modification is to the startup script to isolate upstream TiKV wal-sync directories, which directly addresses issue #4684 and the PR's stated objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description covers the problem being solved, the specific changes made, testing details with multiple test commands, answers to key questions, and provides a release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the start_tidb_cluster_nextgen script to ensure that each TiKV instance uses a unique wal-sync-dir path, preventing potential conflicts. A new integration test was added to verify this configuration. The reviewer suggested using the //go:embed directive in the test file to improve readability and eliminate runtime file I/O dependencies.

Comment on lines +1 to +34
package utils_test

import (
"os"
"path/filepath"
"runtime"
"strings"
"testing"
)

func TestNextGenUpstreamTiKVWalSyncDirIsPerInstance(t *testing.T) {
_, currentFile, _, ok := runtime.Caller(0)
if !ok {
t.Fatal("failed to get test file path")
}

scriptPath := filepath.Join(filepath.Dir(currentFile), "start_tidb_cluster_nextgen")
content, err := os.ReadFile(scriptPath)
if err != nil {
t.Fatalf("failed to read %s: %v", scriptPath, err)
}

script := string(content)

sharedWalDir := `wal-sync-dir = "$TEST_DATA_DIR/wal-sync/upstream/tikv/raft-wal"`
if strings.Contains(script, sharedWalDir) {
t.Fatalf("upstream TiKV instances must not share rfengine wal-sync-dir: %q", sharedWalDir)
}

perInstanceWalDir := `wal-sync-dir = "$TEST_DATA_DIR/wal-sync/upstream/tikv$idx/raft-wal"`
if !strings.Contains(script, perInstanceWalDir) {
t.Fatalf("expected per-instance rfengine wal-sync-dir snippet: %q", perInstanceWalDir)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved readability and robustness, you can use the //go:embed directive (available in Go 1.16+) to embed the script file directly into your test binary at compile time. This simplifies the code by removing the runtime file loading logic, makes the test self-contained, and prevents potential runtime I/O errors.

package utils_test

import (
	_ "embed"
	"strings"
	"testing"
)

//go:embed start_tidb_cluster_nextgen
var script string

func TestNextGenUpstreamTiKVWalSyncDirIsPerInstance(t *testing.T) {
	sharedWalDir := `wal-sync-dir = "$TEST_DATA_DIR/wal-sync/upstream/tikv/raft-wal"`
	if strings.Contains(script, sharedWalDir) {
		t.Fatalf("upstream TiKV instances must not share rfengine wal-sync-dir: %q", sharedWalDir)
	}

	perInstanceWalDir := `wal-sync-dir = "$TEST_DATA_DIR/wal-sync/upstream/tikv$idx/raft-wal"`
	if !strings.Contains(script, perInstanceWalDir) {
		t.Fatalf("expected per-instance rfengine wal-sync-dir snippet: %q", perInstanceWalDir)
	}
}

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration_tests/_utils/start_tidb_cluster_nextgen (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Missing copyright header causes CI failure.

The pipeline reports check-copyright failed. Add the appropriate copyright header at the top of this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen` around lines 1 -
2, Add the project's standard copyright/license header to this script file by
inserting the same multi-line header used across the repo immediately after the
existing shebang (#!/usr/bin/env bash) in start_tidb_cluster_nextgen; ensure it
includes the correct year and owner and matches the formatting of other files so
check-copyright passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen_test.go`:
- Around line 1-9: Add the repository's standard copyright header at the top of
this file before the package declaration (package utils_test); ensure the header
matches the project's required format and years/owner, placed above the import
block so the CI check-copyright passes.

---

Outside diff comments:
In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen`:
- Around line 1-2: Add the project's standard copyright/license header to this
script file by inserting the same multi-line header used across the repo
immediately after the existing shebang (#!/usr/bin/env bash) in
start_tidb_cluster_nextgen; ensure it includes the correct year and owner and
matches the formatting of other files so check-copyright passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 187cd87b-2342-4cb4-b596-5ad8ad1e35a2

📥 Commits

Reviewing files that changed from the base of the PR and between 1858556 and 93901bd.

📒 Files selected for processing (2)
  • tests/integration_tests/_utils/start_tidb_cluster_nextgen
  • tests/integration_tests/_utils/start_tidb_cluster_nextgen_test.go

Comment on lines +1 to +9
package utils_test

import (
"os"
"path/filepath"
"runtime"
"strings"
"testing"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing copyright header causes CI failure.

The pipeline reports check-copyright failed. Add the repository's standard copyright header before the package declaration.

Example fix structure
+// Copyright <year> PingCAP, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// ... (full header per repo standard)
+
 package utils_test
🧰 Tools
🪛 GitHub Actions: PR Build and Unit Test

[error] 1-1: check-copyright failed: The copyright information of this file is incorrect.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen_test.go` around
lines 1 - 9, Add the repository's standard copyright header at the top of this
file before the package declaration (package utils_test); ensure the header
matches the project's required format and years/owner, placed above the import
block so the CI check-copyright passes.

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
@ti-chi-bot ti-chi-bot Bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2026
@tenfyzhong
Copy link
Copy Markdown
Collaborator Author

/test next-gen

@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 2, 2026
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 2, 2026
@tenfyzhong
Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-storage-integration-heavy-next-gen

@ti-chi-bot ti-chi-bot Bot added the lgtm label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898, wlwilliamx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [wk989898,wlwilliamx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-02 03:21:50.40729317 +0000 UTC m=+408115.612653227: ☑️ agreed by wlwilliamx.
  • 2026-04-02 06:45:14.331532041 +0000 UTC m=+420319.536892097: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot Bot merged commit fc9050e into pingcap:master Apr 2, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Next-gen: TiDB bootstrap fails from shared upstream TiKV wal-sync-dir

3 participants