Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv: fix a concurrency bug that may cause the batchClient timeout (#22239) #22336

Merged
merged 2 commits into from
Jan 11, 2021

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Jan 11, 2021

cherry-pick #22239 to release-4.0
You can switch your code base to this Pull Request by using git-extras:

# In tidb repo:
git pr 22336

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/tidb.git pr/22336:ti-srebot:release-4.0-ae7e43249a35

What problem does this PR solve?

closes #22334

Problem Summary:

The recycleIdleConnArray() logic has a bug: when one goroutine getConnArray() and the other goroutine recycle the idle connection, the prior goroutine may get a stale batchConn which is closed already.

sendBatchRequest() using that stale batchConn would block until timeout.

++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: connArray := getConnArray(addr, enableBatch)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                                                                                                            g2: c.Lock()    
                                                                                                            g2: conn := c.conns[addr]
                                                                                                            g2: Unlock()
                                                                                                            g2: conn.Close()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: sendBatchRequest(connArray)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

What is changed and how it works?

What's Changed:

RLock()
conn := getConArray()
RUnlock()
sendBatchRequest(conn)

This is not enough to protect the conn from been recycle and close.
Now the whole sending process is protected by the read lock, and modify conn map should obtain the write lock.

How it Works:

As long as the sending operation hold the read lock, the recycle connection operation need to wait to obtain the write lock.

Related changes

  • Need to cherry-pick to the release branch

Maybe we can cherry-pick it to 5.0, it's rare to see this bug in the production environment.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Release note

  • fix a concurrency bug that may cause the batch client timeout

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@tiancaiamao you're already a collaborator in bot's repo.

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 11, 2021
@hicqu
Copy link
Contributor

hicqu commented Jan 11, 2021

LGTM

1 similar comment
@lysu
Copy link
Collaborator

lysu commented Jan 11, 2021

LGTM

@ti-srebot ti-srebot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 11, 2021
Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added status/LGT3 The PR has already had 3 LGTM. and removed status/LGT2 Indicates that a PR has LGTM 2. labels Jan 11, 2021
@jebter
Copy link

jebter commented Jan 11, 2021

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 11, 2021
@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 22319

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot merged commit 1edefab into pingcap:release-4.0 Jan 11, 2021
@tiancaiamao tiancaiamao deleted the release-4.0-ae7e43249a35 branch January 11, 2021 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/can-merge Indicates a PR has been approved by a committer. status/LGT3 The PR has already had 3 LGTM. type/bug-fix This PR fixes a bug. type/4.0-cherry-pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants