Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR suffers a 15x performance regression when single TiKV node down #42973

Closed
YuJuncen opened this issue Apr 12, 2023 · 5 comments · Fixed by #43099
Closed

BR suffers a 15x performance regression when single TiKV node down #42973

YuJuncen opened this issue Apr 12, 2023 · 5 comments · Fixed by #43099
Labels
affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.1 affects-6.5 component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.

Comments

@YuJuncen
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Create a cluster with 4 or more TiKV nodes.
  2. Shut down one TiKV node.
  3. Execute the backup.

2. What did you expect to see? (Required)

The backup should be slightly slower than backing up a healthy cluster.

3. What did you see instead (Required)

The backup speed is about 15x slower than backing up a healthy cluster. (4 mins vs 1 hour)
image

4. What is your TiDB version? (Required)

Near master, but this problem is not strong relative to the version.

@YuJuncen YuJuncen added the type/bug The issue is confirmed as a bug. label Apr 12, 2023
@YuJuncen
Copy link
Contributor Author

YuJuncen commented Apr 12, 2023

The reason is:

  • When we are dialing to a (down) host, it takes a long time to fail.
  • BR SHOULD skip those (unreachable) stores, and BR did implemented this:
    if s.GetState() != metapb.StoreState_Up {
    logutil.CL(lctx).Warn("skip store", zap.Stringer("State", s.GetState()))
    continue
  • But in our case, it seems the returned store status is still Up even pd-ctl returns a Down status.
  • Consequently, for backing up any range, we must wait about 30s for the store failed... (30s is the dial timeout) This also cannot concurrently wait for this because the mutual execution of StoreManager.
  • So... If there were a store down, we may take at least 1 min for each object to be backed up. In my TPCC workload, there are 116 objects (ranges), then it costs about 1 hour to fully backup the cluster.

@YuJuncen
Copy link
Contributor Author

BTW, the current implementation will directly jump to the fine-grained backup when there is any store unreachable:

if err != nil {
// BR should be able to backup even some of stores disconnected.
// The regions managed by this store can be retried at fine-grained backup then.
logutil.CL(lctx).Warn("fail to connect store, skipping", zap.Error(err))
return nil
}

This may slow down the backup speed in this scenario.

@YuJuncen
Copy link
Contributor Author

There isn't trivial fix to the problem. I think we must figure out why PD will return wrong state of store firstly (Note that StoreState_Up is the default value for StoreState).

@jebter jebter added component/br This issue is related to BR of TiDB. severity/critical labels Apr 17, 2023
@ti-chi-bot ti-chi-bot added may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Apr 17, 2023
@tonyxuqqi
Copy link

tonyxuqqi commented Apr 17, 2023

StoreState_Up

@nolouch , please take a look. The question is what's the best way to tell if the store is healthy or not. Apparently StoreState_Up seems not always accurate as it may need to wait for 30 minutes to be StoreState_Down

@bufferflies
Copy link
Contributor

br using client to get store information by grpc, but the pd-ctl using http interface to get it. In the http handling, PD will check the store down status:
https://github.com/tikv/pd/blob/c40e319f50822678cda71ae62ee2fd70a9cac010/server/api/store.go#L144-L151
In grpc handling, it doesn't contain this code:
https://github.com/tikv/pd/blob/c40e319f50822678cda71ae62ee2fd70a9cac010/server/api/store.go#L144-L151

@YuJuncen YuJuncen added affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.1 affects-6.5 and removed may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.1 affects-6.5 component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants