Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single table with 30k regions, two changefeeds sync this table, upstream insert tps: 8k, checkpoint moves only 5s after 1+hour #2055

Closed
Tammyxia opened this issue Jun 15, 2021 · 6 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • Upstream has single table that's 175GB, 30K regions, meanwhile, upstream has sysbench insert 6-8k tps.
  • Two changefeed sync this single table to different downstream. --> the two changefeed has problem that checkpoint lag a lot.
  • There's other changefeed sync other tables, they works normal with the latest checkpoint.
  1. What did you expect to see?
  • checkpoint is the lastest time or lag less then 5m since it's real-time sync.
  1. What did you see instead?
  • cf "replication-task-sysbench2" and "replication-task-b1" almost don't move on. Their source database is the single table with 30k region.
    image
  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

       Release Version: v5.1.0-20210608
      

Edition: Community
Git Commit Hash: 47f0f15b14ed54fc2222f3e304e29df7b05e6805
Git Branch: heads/refs/tags/v5.1.0-20210608
UTC Build Time: 2021-06-08 07:21:52
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |

    ```

- TiCDC version (execute `cdc version`):

    ```
    Release Version: v5.1.0-20210611
   Git Commit Hash: 3be68052276cabba42f98389798415508baa9c63
   Git Branch: heads/refs/tags/v5.1.0-20210611
   UTC Build Time: 2021-06-11 07:48:31
   Go Version: go version go1.16.4 linux/amd64
    ```
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/critical labels Jun 15, 2021
@amyangfei
Copy link
Contributor

amyangfei commented Jun 15, 2021

The replication becomes normal after restart cdc servers.

screenshot-20210615-213657

region initialization rate 20 by minutes in TiKV on server 172.16.6.225

   2266 19:51
   2066 19:50
   1834 19:52
   1434 19:47
    413 18:25
    341 18:08
    303 18:14
    292 18:10
    284 18:16
    264 18:12
    261 18:09
    261 18:07
    257 18:19
    248 18:13
    247 18:15
    245 18:23
    242 18:18
    240 18:06
    231 18:11
    229 19:48
    222 18:22
    220 18:05
    216 18:17
    215 18:24
    180 18:04
    179 18:20
    171 18:21
    148 19:05
    129 18:26
    109 19:04
    102 18:33
     87 18:55
     87 18:34
     80 19:02
     78 18:59
     78 18:02
     74 18:27
     73 18:03
     70 18:53
     70 18:49
     69 18:54
     67 18:01
     65 18:00
     62 19:03
     57 19:00
     56 18:58
     54 19:08
     48 19:18
     46 19:01
     46 18:28
     45 18:50
     45 18:35
     43 18:52
     39 19:06
     37 18:57
     37 18:51
     36 18:36
     35 18:31
     34 18:32
     33 18:42
     32 19:07
     30 18:40
     30 18:37
     28 18:46
     26 18:56
     26 18:30
     24 18:48
     24 18:29
     23 19:16
     22 18:45
     21 19:14
     21 18:41
     20 18:44
     20 18:38
     18 18:39
     17 18:47
     16 19:19
     16 19:09
     15 19:27
     15 19:23
     15 19:21
     14 19:24
     14 19:15
     13 19:12
     12 19:38
     12 19:34
     11 19:30
     10 19:33
     10 19:26
     10 18:43
      9 19:25
      9 19:13
      8 19:39
      8 19:28
      8 19:11
      7 19:37
      7 19:20
      6 19:41
      6 19:35
      6 19:32
      6 19:17
      5 19:22
      4 19:31
      4 19:10
      3 19:40
      3 19:29
      2 19:36
      1 19:42

This is not a bug, it is caused by the region scan limit in TiKV, where is 4 concurrent workers, and 6 concurrent tasks by default, and we should also take the server load of TiKV into consideration, since it can be easily observed that region initialization rate changed a lot

What do you think of it when the region count of a single table is extremely large, whether there exists any key metric to determine why the region initialization rate changed a lot @overvenus

Besides, TiCDC has a flaw, when replicating the same table with N different sinks, the same data will be pushed from TiKV to TiCDC for N times.

@amyangfei
Copy link
Contributor

Besides https://github.com/pingcap/ticdc/pull/2078 can also cause this

@Tammyxia
Copy link
Author

Tammyxia commented Jul 6, 2021

Change severity to major because it's cdc current ability, but latency is so large in this scenario is unmatch common sense.

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Dec 6, 2021
@overvenus
Copy link
Member

overvenus commented Dec 14, 2021

There are some optimization (tikv/tikv#11385) in TiKV in the next release, it should help mitigate this issue. Change severity to severity/moderate

@nongfushanquan
Copy link
Contributor

/close

@ti-chi-bot
Copy link
Member

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

6 participants