Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikvclient: refine region-cache #10256

Merged
merged 28 commits into from May 21, 2019

Conversation

@lysu
Copy link
Member

commented Apr 24, 2019

What problem does this PR solve?

try to fixes #10037, keep retry region's other stores

  • meet hibernate region feature in tikv
  • keep using old data when tidb can not connect to pd

What is changed and how it works?

What region-cache can hold?

  • region: region maintains a range of data, one region has multiple stores, it will be created/deleted frequently(region split, region merge, leader change, schedule from one store to another)
  • stores: store is kv process, one machine can hold multiple stores, one store can hold multiple regions, it will be infrequent change(copy old kv to new machine, change network interface...)

What fails or data outdated region-cache will meet?

  1. send data failure

normally, it happens when machine down or network partition between tidb and kv or process crash.
for TiDB side, it will see send data failure event to identify the store failure.

but sometimes, this will be caused by someone to replace a new machine or change network interface, but people don't often do that.

  1. send success but got the error response

this means the store is working well, but info is miss matched.
for TiDB side, send data will success, but will get fail response from kv.

normally, it's the region info be changed, seldom it's store info changed.

How current running

on send fail, it will drop region cache and remove store info, it will trigger reload region and store info.

but have 2 problems:

  • re-fetch region info will get the old leader when kv doesn't trigger new election, but hibernate region need trigger other peer to start new election
  • fetch region from pd maybe failure if pd have network partition.

how this PR change

  1. mark store after send
  • mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
  • mark store as success when sending data success.

"blackout store when failure" keep the chance that failure peer can be used again but doesn't make retry flood.

  1. invalidate region cache

cache item never be deleted and only invalidate it to trigger pd re-fetch
it will make it validate again to keep use old data if fetch failure caused by pd down.

review this PR maybe need to take look at #6880 and #2792, also GetRPCContext method which isn't modified but vital to this logic.

  1. main focus
  • fast & lock-free for cache-hit code path?
  • when to try other peers when the current leader unreachable?
  • when to invalidate a region and trigger reload region?
  • when & how long to blackout a store when it continues failed?
  • when to trigger stores info to be reloaded?

Check List

Tests

  • Unit test(WIP add more)
  • Integration test

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change

Side effects

  • Increased code complexity

Related changes

  • N/A

This change is Reviewable

@lysu lysu requested a review from coocood Apr 24, 2019

@lysu

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2019

/run-all-tests

@lysu lysu added the status/WIP label Apr 24, 2019

store/tikv/region_cache.go Outdated Show resolved Hide resolved

@lysu lysu force-pushed the lysu:dev-region-retry-fail branch 5 times, most recently from 96f3b2a to e1700ab Apr 24, 2019

@lysu lysu added the require-LGT3 label Apr 25, 2019

@lysu lysu force-pushed the lysu:dev-region-retry-fail branch from e1700ab to a55b6bc Apr 25, 2019

@lysu lysu requested review from disksing and tiancaiamao Apr 25, 2019

@lysu lysu removed the status/WIP label Apr 25, 2019

@lysu lysu marked this pull request as ready for review Apr 25, 2019

@lysu lysu requested a review from jackysp Apr 25, 2019

@lysu

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

/run-all-tests

@zhouqiang-cl

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

/rebuild

1 similar comment
@zhouqiang-cl

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

/rebuild

@lysu

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

/run-all-tests

@zhouqiang-cl

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

/rebuild

1 similar comment
@mahjonp

This comment has been minimized.

Copy link

commented Apr 25, 2019

/rebuild

@lysu

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

/run-unit-test

@lysu lysu added the status/WIP label Apr 28, 2019

@lysu lysu force-pushed the lysu:dev-region-retry-fail branch from 397bfb2 to fcf3ca1 Apr 28, 2019

@codecov

This comment has been minimized.

Copy link

commented Apr 28, 2019

Codecov Report

Merging #10256 into master will decrease coverage by 0.0033%.
The diff coverage is 72.3404%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.8178%   77.8144%   -0.0034%     
================================================
  Files           410        410                
  Lines         84365      84438        +73     
================================================
+ Hits          65651      65705        +54     
- Misses        13813      13826        +13     
- Partials       4901       4907         +6
@codecov

This comment has been minimized.

Copy link

commented Apr 28, 2019

Codecov Report

Merging #10256 into master will decrease coverage by 0.0176%.
The diff coverage is 81.0096%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.2779%   77.2603%   -0.0177%     
================================================
  Files           413        413                
  Lines         86986      87244       +258     
================================================
+ Hits          67221      67405       +184     
- Misses        14600      14647        +47     
- Partials       5165       5192        +27

@lysu lysu force-pushed the lysu:dev-region-retry-fail branch from ebe6de1 to cbfd25e May 20, 2019

NotLeader's leader maybe miss in current cache
kv return NotLeader before EpochNotMatch
@coocood

This comment has been minimized.

Copy link
Member

commented May 20, 2019

LGTM

@tiancaiamao

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

LGTM

@tiancaiamao

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

/run-all-tests

@tiancaiamao tiancaiamao added status/LGT3 and removed status/LGT2 labels May 21, 2019

@tiancaiamao tiancaiamao merged commit f6346a1 into pingcap:master May 21, 2019

16 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 81.0096% of diff hit (target 0%)
Details
codecov/project Absolute coverage decreased by -0.0176% but relative coverage increased by +3.7316% compared to 2e9ddb2
Details
idc-jenkins-ci-tidb/build Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/build_check_race Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/check_dev Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/check_dev_2 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/common-test job succeeded
Details
idc-jenkins-ci-tidb/integration-common-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/integration-compatibility-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/integration-ddl-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/mybatis-test job succeeded
Details
idc-jenkins-ci-tidb/sqllogic-test-1 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/sqllogic-test-2 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/unit-test Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
db-storage added a commit to db-storage/tidb that referenced this pull request May 27, 2019
tikvclient: refine region-cache (pingcap#10256)
1. mark store after send
* mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
* mark store as success when sending data success.

2. invalidate region cache
* cache item never be deleted and only invalidate it to trigger pd re-fetch
* make cache item validate again to keep use old data if fetch failure caused by pd down.
db-storage added a commit to db-storage/tidb that referenced this pull request May 29, 2019
tikvclient: refine region-cache (pingcap#10256)
1. mark store after send
* mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
* mark store as success when sending data success.

2. invalidate region cache
* cache item never be deleted and only invalidate it to trigger pd re-fetch
* make cache item validate again to keep use old data if fetch failure caused by pd down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.