Skip to content

sync-diff-inspector: use the tiflow version package to print versions#12568

Open
kennytm wants to merge 2 commits intopingcap:masterfrom
kennytm:fix-12567
Open

sync-diff-inspector: use the tiflow version package to print versions#12568
kennytm wants to merge 2 commits intopingcap:masterfrom
kennytm:fix-12567

Conversation

@kennytm
Copy link
Copy Markdown
Contributor

@kennytm kennytm commented Mar 21, 2026

What problem does this PR solve?

Issue Number: close #12567

What is changed and how it works?

Change sync-diff-inspector to use functions in github.com/pingcap/tiflow/pkg/version instead of github.com/pingcap/tidb/pkg/util, because the env var are populated to the former.

Before:

[2026/04/02 14:15:06.453 +08:00] [INFO] [printer.go:49] ["Welcome to sync_diff_inspector"] ["Release Version"=None] ["Git Commit Hash"=None] ["Git Branch"=None] ["UTC Build Time"=None] ["Go Version"=go1.25.8]

After:

[2026/04/02 14:13:43.150 +08:00] [INFO] [version.go:47] ["Welcome to sync_diff_inspector"] [release-version=v9.0.0-beta.2.pre-72-gad3f336bd] [git-hash=ad3f336bd1462634ea31fcc6a6b888f8f8d5bd9a] [git-branch=fix-12567] [utc-build-time="2026-04-02 06:11:47"] [go-version=go1.25.8] [failpoint-build=false]

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    • Run the resulting program with -V, and run the program and checking the first line of the log.
  • No code

Questions

Will it cause performance regression or break compatibility?

No

Do you need to update user documentation, design documentation or monitoring documentation?

No

Release note

None

@kennytm kennytm added type/bugfix This PR fixes a bug. area/sync_diff Issues or PRs related to sync_diff_inspector. labels Mar 21, 2026
@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Mar 21, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the version information handling within the sync-diff-inspector tool. By switching to the tiflow version package, the tool will now accurately reflect its version details, aligning with the broader tiflow ecosystem's versioning practices. This change improves the reliability of version reporting without altering core functionality.

Highlights

  • Version Package Migration: The sync-diff-inspector tool has been updated to utilize the version utility functions from github.com/pingcap/tiflow/pkg/version instead of the previous github.com/pingcap/tidb/pkg/util.
  • Consistent Version Reporting: This change ensures that sync-diff-inspector now correctly retrieves and displays version information, as the environment variables are properly populated to the tiflow version package.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 21, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors sync-diff-inspector to use the github.com/pingcap/tiflow/pkg/version package for displaying version information, replacing the dependency on github.com/pingcap/tidb/pkg/util. This is a good change for consistency and ensures the correct, build-time populated version variables are used. The changes are correct and I have no suggestions for improvement.

@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 23, 2026

/retest

https://do.pingcap.net/jenkins/job/pingcap/job/tiflow/job/pull_dm_integration_test/4975/execution/node/504/log/

[Mon Mar 23 02:57:18 CST 2026] <<<<<< start test_join_masters_and_worker >>>>>>
3 dm-master alive
4 dm-worker alive
0 dm-syncer alive
wait process dm-master.test exit...
...
wait process dm-master.test exit...
process dm-master.test didn't exit after 120 seconds, current processlist: jenkins    53838  2.3  0.1 16655424 409536 ?     Sl   02:56   0:04 /home/jenkins/agent/workspace/pingcap/tiflow/pull_dm_integration_test/tiflow/dm/bin/dm-master.test -test.coverprofile=/tmp/dm_test/cov.ha_cases.master.out DEVEL --master-addr=:8461 --log-file=/tmp/dm_test/ha_cases/master3/log/dm-master.log -L=debug --config=/home/jenkins/agent/workspace/pingcap/tiflow/pull_dm_integration_test/tiflow/dm/tests/ha_cases/conf/dm-master3.toml

@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 24, 2026

process dm-master.test didn't exit after 120 seconds, current processlist: jenkins    35095  2.3  0.6 13465788 409968 ?     Sl   14:43   0:04 /home/jenkins/agent/workspace/pingcap/tiflow/pull_dm_integration_test/tiflow/dm/bin/dm-master.test -test.coverprofile=/tmp/dm_test/cov.ha_cases.master.out DEVEL --master-addr=:8361 --log-file=/tmp/dm_test/ha_cases/master2/log/dm-master.log -L=debug --config=/home/jenkins/agent/workspace/pingcap/tiflow/pull_dm_integration_test/tiflow/dm/tests/ha_cases/conf/dm-master2.toml

According to the captured /tmp/dm_test/ha_cases/master2/log/dm-master.log in https://prow.tidb.net/jenkins/job/pingcap/job/tiflow/job/pull_dm_integration_test/29/, this dm-master instance is refusing to quit even received SIGHUP:

[2026/03/24 14:44:44.814 +08:00] [INFO] [main.go:85] ["got signal to exit"] [signal=hangup]

compared with the other 2 dm-master, this instance failed to stop serving peer traffic (https://github.com/etcd-io/etcd/blob/v3.5.15/server/embed/etcd.go#L581-L584)

# these lines are *missing* in master2
[2026/03/24 14:44:44.848 +08:00] [INFO] [etcd.go:581] ["stopping serving peer traffic"] [component="embed etcd"] [address=127.0.0.1:8291]
[2026/03/24 14:44:44.848 +08:00] [INFO] [etcd.go:586] ["stopped serving peer traffic"] [component="embed etcd"] [address=127.0.0.1:8291]
[2026/03/24 14:44:44.848 +08:00] [INFO] [etcd.go:379] ["closed etcd server"] [component="embed etcd"] [name=master1] [data-dir=default.master1] [advertise-peer-urls="[http://127.0.0.1:8291]"] [advertise-client-urls="[http://127.0.0.1:8261]"]
[2026/03/24 14:44:44.848 +08:00] [INFO] [server.go:296] ["server closed"]
[2026/03/24 14:44:44.848 +08:00] [INFO] [main.go:92] ["dm-master exit"]

Analysis from codex suggested this is a bug

  1. We see that master1 and master3 successfully exited. So the remaining master2 has no quorum to do anything in the etcd cluster.
  2. When master2 received SIGHUP to close the server, one of the step is the close the election campaign loop:

    tiflow/dm/master/server.go

    Lines 305 to 307 in bcc51f7

    if s.election != nil {
    s.election.Close()
    }
  3. ... which will close the etcd session
    closeSession := func(se *concurrency.Session) {
    err2 := se.Close() // only log this error
    if err2 != nil {
    e.l.Error("fail to close etcd session", zap.Int64("lease", int64(se.Lease())), zap.Error(err2))
    }
    }
  4. ... which will send a Revoke request, which will timeout 60s, returning DeadlineExceeded. https://github.com/etcd-io/etcd/blob/v3.5.15/client/v3/concurrency/session.go#L95-L102
    [2026/03/24 14:45:44.833 +08:00] [ERROR] [election.go:204] ["fail to close etcd session"] [component=election] [lease=7048024696547951620] [error="context deadline exceeded"]
    [2026/03/24 14:45:44.833 +08:00] [INFO] [election.go:197] ["election is closed"] [component=election] ["current member"="{\"id\":\"master2\",\"addr\":\"127.0.0.1:8361\"}"]
    [2026/03/24 14:45:44.834 +08:00] [INFO] [etcd.go:377] ["closing etcd server"] [component="embed etcd"] [name=master2] [data-dir=default.master2] [advertise-peer-urls="[http://127.0.0.1:8292]"] [advertise-client-urls="[http://127.0.0.1:8361]"]
    [2026/03/24 14:45:44.834 +08:00] [WARN] [serve.go:160] ["stopping insecure grpc server due to error"] [component="embed etcd"] [error="accept tcp [::]:8361: use of closed network connection"]
    
  5. This probably caused the gRPC server to have a stale connection, making Server.stop() wait forever. https://github.com/grpc/grpc-go/blob/v1.77.0/server.go#L1949-L1952
  6. ... which blocked this https://github.com/etcd-io/etcd/blob/v3.5.15/server/embed/serve.go#L158-L163
  7. ... which makes the serversC channel not closed https://github.com/etcd-io/etcd/blob/v3.5.15/server/embed/serve.go#L116-L117
  8. ... which makes (*Etcd).Close() blocked on a for-loop of serversC waiting for it to close https://github.com/etcd-io/etcd/blob/v3.5.15/server/embed/etcd.go#L397-L403
  9. ... which in turn blocked the server close

    tiflow/dm/master/server.go

    Lines 314 to 317 in bcc51f7

    if s.etcd != nil {
    s.etcd.Close()
    }
    s.closed.Store(true)

In etcd v3.5.18 it fixed a bug which etcd.Close() has a deadlock etcd-io/etcd#19139. Let me try to upgrade etcd and see if it can fix the problem.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 25, 2026
@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 25, 2026

/retest

Network broken

@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 27, 2026

/cc @joechenrh

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 27, 2026

@kennytm: GitHub didn't allow me to request PR reviews from the following users: joechenrh.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @joechenrh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 27, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 27, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-27 14:20:21.223507946 +0000 UTC m=+537217.259578196: ☑️ agreed by lance6716.

@ti-chi-bot ti-chi-bot bot added the approved label Mar 27, 2026
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you paste the "the first line of the log" in the description?

/cc @GMHDBJD @D3Hunter

@ti-chi-bot ti-chi-bot bot requested review from D3Hunter and GMHDBJD March 30, 2026 09:26
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 30, 2026

@joechenrh: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

/cc @GMHDBJD @D3Hunter

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joechenrh, lance6716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Apr 2, 2026

Could you paste the "the first line of the log" in the description?

Done.

PTAL @GMHDBJD @D3Hunter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/sync_diff Issues or PRs related to sync_diff_inspector. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/bugfix This PR fixes a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync-diff-inspector carries no version information in -V

3 participants