Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: fix flow control deadlock (release-4.0) #1779

Conversation

liuzix
Copy link
Contributor

@liuzix liuzix commented May 14, 2021

What problem does this PR solve?

  • Fix release-4.0 deadlock caused by flow-control when the sorter's input channel has been blocked.
goroutine 2776 [select, 363 minutes]:
github.com/pingcap/ticdc/cdc.(*oldProcessor).runFlowControl.func1(0xc0895d7918, 0x2189526)
        github.com/pingcap/ticdc@/cdc/processor.go:990 +0x1d8
github.com/pingcap/ticdc/cdc/sink/common.(*TableMemoryQuota).ConsumeWithBlocking(0xc096495da0, 0x13, 0xc0895d7c20, 0x0, 0x0)
        github.com/pingcap/ticdc@/cdc/sink/common/flow_control.go:77 +0xfc
github.com/pingcap/ticdc/cdc/sink/common.(*TableFlowController).Consume(0xc096495dd0, 0x5e5a0ae09d400ea, 0x13, 0xc0895d7c20, 0x0, 0x0)
        github.com/pingcap/ticdc@/cdc/sink/common/flow_control.go:174 +0x8a
github.com/pingcap/ticdc/cdc.(*oldProcessor).runFlowControl(0xc0026606c0, 0x2f71160, 0xc095bf2440, 0x34, 0xc096495dd0, 0xc019019aa0, 0xc019019da0)
        github.com/pingcap/ticdc@/cdc/processor.go:981 +0x45d
github.com/pingcap/ticdc/cdc.(*oldProcessor).sorterConsume.func6(0xc0026606c0, 0x2f71160, 0xc095bf2440, 0x34, 0xc096495dd0, 0x2f5d420, 0xc096495a10, 0xc019019da0)
        github.com/pingcap/ticdc@/cdc/processor.go:1183 +0x82
created by github.com/pingcap/ticdc/cdc.(*oldProcessor).sorterConsume
        github.com/pingcap/ticdc@/cdc/processor.go:1182 +0xb76


goroutine 2774 [semacquire, 363 minutes]:
sync.runtime_SemacquireMutex(0xc096495db0, 0xc036227500, 0x1)
        runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc096495dac)
        sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        sync/mutex.go:81
github.com/pingcap/ticdc/cdc/sink/common.(*TableMemoryQuota).Release(0xc096495da0, 0x0)
        github.com/pingcap/ticdc@/cdc/sink/common/flow_control.go:105 +0x2f4
github.com/pingcap/ticdc/cdc/sink/common.(*TableFlowController).Release(0xc096495dd0, 0x5e5a0a21b240228)
        github.com/pingcap/ticdc@/cdc/sink/common/flow_control.go:215 +0x102
github.com/pingcap/ticdc/cdc.(*oldProcessor).sorterConsume.func5(0xc036227e30, 0xc03622788c)
        github.com/pingcap/ticdc@/cdc/processor.go:1175 +0x100
github.com/pingcap/ticdc/cdc.(*oldProcessor).sorterConsume(0xc0026606c0, 0x2f71160, 0xc095bf2440, 0x34, 0xc095c064e0, 0x11, 0x2f5d420, 0xc096495a10, 0xc0961d18c8, 0xc0961d18d0, ...)
        github.com/pingcap/ticdc@/cdc/processor.go:1256 +0x18ec
github.com/pingcap/ticdc/cdc.(*oldProcessor).addTable.func2.5(0xc0026606c0, 0x2f71160, 0xc095bf2440, 0x34, 0xc096256740, 0x2f5d420, 0xc096495a10, 0xc0961d18c8, 0xc0961d18d0, 0xc095c3e1b0, ...)
        github.com/pingcap/ticdc@/cdc/processor.go:881 +0xca
created by github.com/pingcap/ticdc/cdc.(*oldProcessor).addTable.func2
        github.com/pingcap/ticdc@/cdc/processor.go:880 +0x664
  • goroutine 2776 is blocked by blockCallback function, which will send event to flowControlOutCh, the channel is full at that time.
  • The consumer of flowControlOutCh is blocked by sendResolvedTs2Sink (because they are in the same select branch)
  • There is a flow controller release operation in sendResolvedTs2Sink, which needs to require a lock, however that lock is held by goroutine 2776

What is changed and how it works?

  • Made sure that the flow controller is not locked when the block callback is being called.

Check List

Tests

  • Unit test

Release note

  • Fix bug in flow control

@liuzix
Copy link
Contributor Author

liuzix commented May 14, 2021

/run-all-tests

@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 14, 2021
@amyangfei
Copy link
Contributor

/run-all-tests

@amyangfei amyangfei added this to the v4.0.13 milestone May 14, 2021
@amyangfei amyangfei added release-blocker This issue blocks a release. Please solve it ASAP. type/bugfix This PR fixes a bug. labels May 14, 2021
@amyangfei
Copy link
Contributor

/run-integration-tests

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 14, 2021
@amyangfei amyangfei added the status/ptal Could you please take a look? label May 14, 2021
@overvenus
Copy link
Member

/lgtm

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • amyangfei
  • overvenus

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels May 14, 2021
@lonng lonng added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved status/ptal Could you please take a look? labels May 15, 2021
@lonng
Copy link
Contributor

lonng commented May 15, 2021

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: d2709e7

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label May 15, 2021
@ti-chi-bot
Copy link
Member

@liuzix: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@amyangfei
Copy link
Contributor

/run-unit-tests

@codecov-commenter
Copy link

Codecov Report

Merging #1779 (eeae6c3) into release-4.0 (c08d59e) will increase coverage by 0.0750%.
The diff coverage is 100.0000%.

@@                 Coverage Diff                 @@
##           release-4.0      #1779        +/-   ##
===================================================
+ Coverage      52.4990%   52.5740%   +0.0750%     
===================================================
  Files              153        153                
  Lines            15926      15928         +2     
===================================================
+ Hits              8361       8374        +13     
+ Misses            6654       6644        -10     
+ Partials           911        910         -1     

@ti-chi-bot ti-chi-bot merged commit 85847c3 into pingcap:release-4.0 May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. release-blocker This issue blocks a release. Please solve it ASAP. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants