Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv: make requests fail quickly when the dial times out #12819

Merged
merged 13 commits into from
Oct 25, 2019
Merged

tikv: make requests fail quickly when the dial times out #12819

merged 13 commits into from
Oct 25, 2019

Conversation

lysu
Copy link
Contributor

@lysu lysu commented Oct 18, 2019

What problem does this PR solve?

current impl, we need wait request-timeout(cop:60s, 2pc:20s, point-get:20s) to trigger retry.

we should fail-fast if can't connect to target host, because request-timeout thinking about processing time, but for network unreachable error, it's better to timeout quickly and retry others.

What is changed and how it works?

  • add wait ready timeout again(remove by code review in the previous pr)
  • fail pending request quickly if wait ready timeouted

Check List

Tests

  • Unit test
  • Integration test
  • Manual test
random rewrite getConnArray to a unknow host.
and check can TiDB works well

Code changes

  • impl

Side effects

  • n/a

Related changes

  • Need to cherry-pick to the release branch

Release note

  • tikv: make requests fail-fast for dial timeouted

This change is Reviewable

@lysu lysu added type/enhancement The issue or PR belongs to an enhancement. component/tikv needs-cherry-pick-3.0 labels Oct 18, 2019
@lysu
Copy link
Contributor Author

lysu commented Oct 18, 2019

/run-all-tests

@codecov
Copy link

codecov bot commented Oct 18, 2019

Codecov Report

Merging #12819 into master will decrease coverage by 0.0449%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##            master     #12819       +/-   ##
==============================================
- Coverage   80.146%   80.1011%   -0.045%     
==============================================
  Files          465        465               
  Lines       107903     107398      -505     
==============================================
- Hits         86480      86027      -453     
  Misses       14951      14951               
+ Partials      6472       6420       -52

@sre-bot
Copy link
Contributor

sre-bot commented Oct 20, 2019

@coocood, @tiancaiamao, PTAL.

store/tikv/client_batch.go Outdated Show resolved Hide resolved
@lysu lysu requested a review from jackysp October 21, 2019 05:19
Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jackysp jackysp added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 22, 2019
@coocood
Copy link
Member

coocood commented Oct 22, 2019

LGTM

@coocood
Copy link
Member

coocood commented Oct 22, 2019

Have you run the manual test again after the latest modification?

@lysu
Copy link
Contributor Author

lysu commented Oct 22, 2019

@coocood I have added a new failpoint to mock the wrong addr by manually:

  1. build tidb with make failpoint-enable
  2. start a tidb with tikv/pd with GO_FAILPOINTS_HTTP=0.0.0.0:10001 ./bin/tidb-server -store tikv -path 127.0.0.1:2379
  3. inject wrong addr for once with curl -XPUT -d '1*return("182.16.5.31:8080")->return("")' http://0.0.0.0:10001/github.com/pingcap/tidb/store/tikv/injectWrongStoreAddr
  4. try send other request to tidb cluster should not block, and tidb log will see this after 5s
[2019/10/22 17:16:09.210 +08:00] [INFO] [region_cache.go:504] ["switch region peer to next due to send request fail"] [current="region ID: 6, meta: id:6 start_key:\"m\" end_key:\"n\" region_epoch:<conf_ver:1 version:5 > peers:<id:7 store_id:1 > , peer: id:7 store_id:1 , addr: 182.16.5.31:8080, idx: 0"] [needReload=true] [error="context deadline exceeded"]
[2019/10/22 17:16:09.214 +08:00] [INFO] [region_cache.go:330] ["invalidate current region, because others failed on same store"] [region=10] [store=127.0.0.1:9191]
[2019/10/22 17:16:09.214 +08:00] [INFO] [region_cache.go:330] ["invalidate current region, because others failed on same store"] [region=10] [store=127.0.0.1:9191]
[2019/10/22 17:16:09.212 +08:00] [INFO] [region_cache.go:330] ["invalidate current region, because others failed on same store"] [region=12] [store=127.0.0.1:9191]

@coocood
Copy link
Member

coocood commented Oct 22, 2019

LGTM

@lysu
Copy link
Contributor Author

lysu commented Oct 22, 2019

/run-all-tests

@lysu
Copy link
Contributor Author

lysu commented Oct 22, 2019

/run-all-tests

@lysu lysu added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 22, 2019
@lysu
Copy link
Contributor Author

lysu commented Oct 24, 2019

@jackysp @coocood need a approve thx~

@lysu lysu added the status/can-merge Indicates a PR has been approved by a committer. label Oct 25, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Oct 25, 2019

/run-all-tests

@sre-bot sre-bot merged commit e8631bf into pingcap:master Oct 25, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Oct 25, 2019

cherry pick to release-3.1 failed

@sre-bot
Copy link
Contributor

sre-bot commented Oct 25, 2019

cherry pick to release-3.0 failed

@lysu lysu changed the title tikv: make requests fail-fast for dial timeouted tikv: Make requests fail quickly when the dial times out Oct 28, 2019
@lysu lysu changed the title tikv: Make requests fail quickly when the dial times out tikv: make requests fail quickly when the dial times out Oct 28, 2019
sre-bot pushed a commit to sre-bot/tidb that referenced this pull request Nov 8, 2019
XiaTianliang pushed a commit to XiaTianliang/tidb that referenced this pull request Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants