-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues. TCP Timeouts and slow updates. #1045
Comments
This might be related to #941 |
So, if I understand correctly, even after updating the KService resource, you're still receiving responses from the old revision when accessing the backend. If that's the case, it's possible that net-kourier did not synchronize with the 10 Gateway quickly enough. Do you have already tried increasing the CPU limits/requests both on net-kourier-controller and gateway or using fewer Gateway pods?
Could you please provide the exact error message? The timeout error is caused by client's 10 seconds timeout? I think the client timeout 10 seconds is too short because the response time includes the ksvc pod's scale up. Also, when the issue occurs, could you confirm the following: a. Does Kourier Gateway have any access logs?
|
I have written above but overall I think you should increase the resources and client timeout first of all. Controller's cpu 0.5, memory 500Mi for 823 KServices and 10 gateways would not be enough. |
@nak3 Thank you for responding. I switched our system to Contour in the meantime to work around the issues we were having, so I can't immediately test your suggestions. I do have a few questions and comments, though.
While we are not using Kourier now, we may go back. I would like to better understand how to scale Kourier. Do you have a rule of thumb for how many gateways per service or per request volume? Can we scale based on CPU and Memory usage? Thanks so much! |
I see. I'm glad to hear that you could workaround the issue. Please feel free to re-open the issue if you will back to Kourier. [1] #744 (comment)
|
Thanks @nak3 , since things are working for us on Contour, we'll stay there for now. But, it is good to know that there may be issues with it as we scale. In the ideal world, I'd expect Kourier to scale automatically using HPAs. If there are manual changes I need to make as it scales, it'd be nice if there was a document that explained how Kourier's individual components should be scaled. I appreciate your help! Thank you! |
I have a GKE cluster running KNative 1.7.2 and Kourier. I updated Kourier to 1.10.0 to try and resolve this issue. But, it persists.
I am seeing a couple of problems:
Very slow times to update a KService. It took 12 minutes until the KService switched to the new revision. The pod started up right away and became ready within a minute or 2.
TCP connection timeouts trying to call my KServices. I believe the client timeout time is 10 seconds.
I am running 10 replicas of the gateways with these resources:
I am running 3 net-kourier-controllers with these resources:
I have 823 KServices. Some are used much more than others.
These performance problems are giving me high error rates. How can I diagnose the cause and fix it?
Thanks!
The text was updated successfully, but these errors were encountered: