Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash cop thread pool can not handle request with high QPS #3696

Closed
Tracked by #6438
JaySon-Huang opened this issue Dec 21, 2021 · 9 comments
Closed
Tracked by #6438

TiFlash cop thread pool can not handle request with high QPS #3696

JaySon-Huang opened this issue Dec 21, 2021 · 9 comments
Assignees
Labels
type/enhancement Issue or PR for enhancement

Comments

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Dec 21, 2021

Enhancement

One of our users execute queries like select count(*) from table where (`url` like 'xxx%') and `uid` in (....) in tidb. If the size of uid is more than several hundred, tidb choose to route that request to TiFlash, with about 15 QPS.
image

However, TiFlash can not handle those queries quickly, the request is lined up by the coprocessor thread pool. Requests are stacking up while TiDB sees all requests are "timeout" and retry, which makes more requests sent to TiFlash. Finally, it makes TiFlash out of memory.

"cop_dag" means those coprocessor requests are being executing and ...
image

"cop" means the sum of those coprocessor requests are being executed and those requests are lined up.
image

@JaySon-Huang JaySon-Huang added the type/enhancement Issue or PR for enhancement label Dec 21, 2021
@JaySon-Huang
Copy link
Contributor Author

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Dec 22, 2021

I think if the pending number of cop requests is more than k times the size of the coprocessor thread pool, then just simple reply something like "TiFlash is busy" to the caller instead of pending by the thread pool. So that TiFlash can recover from large amount of useless retry requests.

@LittleFall
Copy link
Contributor

I think if the pending number of cop requests is more than k times the size of the coprocessor thread pool, then just simple reply something like "TiFlash is busy" to the caller instead of pending by the thread pool. So that TiFlash can recover from large amount of useless retry requests.

this behavior LGTM

@JaySon-Huang
Copy link
Contributor Author

Are there any plans to implement this behavior? @LittleFall

@JaySon-Huang
Copy link
Contributor Author

Should also take consideration for Elastic Thread Pool/Dynamic Thread Pool model. /cc @bestwoody @fuzhe1989

@fuzhe1989
Copy link
Contributor

@JaySon-Huang It depends on both TiFlash and TiDB. Do we use exponential backoff retry strategy?

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Dec 30, 2021

Reproduce when running a QA test that only use 8c for TiKV, and the CPU usage of all TiKV is high, making all read index timeout.

image
image
image

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Jun 29, 2022

Another similar problem from asktug: https://asktug.com/t/topic/694336/24
In this case, the number of cop task reach about 30k and make tiflash reach the limit of /proc/sys/vm/max_map_count, no more thread can be created and tiflash crash.

@LittleFall
Copy link
Contributor

LittleFall commented Jan 13, 2023

closed because #6438 has been basically implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Issue or PR for enhancement
Projects
None yet
Development

No branches or pull requests

4 participants