Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues. TCP Timeouts and slow updates. #1045

Closed
adamzr opened this issue May 10, 2023 · 7 comments
Closed

Performance Issues. TCP Timeouts and slow updates. #1045

adamzr opened this issue May 10, 2023 · 7 comments

Comments

@adamzr
Copy link

adamzr commented May 10, 2023

I have a GKE cluster running KNative 1.7.2 and Kourier. I updated Kourier to 1.10.0 to try and resolve this issue. But, it persists.

I am seeing a couple of problems:

  1. Very slow times to update a KService. It took 12 minutes until the KService switched to the new revision. The pod started up right away and became ready within a minute or 2.

  2. TCP connection timeouts trying to call my KServices. I believe the client timeout time is 10 seconds.

I am running 10 replicas of the gateways with these resources:

Limits: 
   cpu:     500m
   memory:  500Mi 
Requests:
    cpu:        200m
    memory:     200Mi

I am running 3 net-kourier-controllers with these resources:

    Limits:                                                                                                                                                                                
      cpu:     500m                                                                                                                                                                        
      memory:  4Gi                                                                                                                                                                         
    Requests:                                                                                                                                                                              
      cpu:     200m                                                                                                                                                                        
      memory:  200Mi 

I have 823 KServices. Some are used much more than others.

These performance problems are giving me high error rates. How can I diagnose the cause and fix it?

Thanks!

@adamzr
Copy link
Author

adamzr commented May 10, 2023

This might be related to #941

@dprotaso
Copy link
Contributor

Cc @nak3 @skonto

@nak3
Copy link
Contributor

nak3 commented May 11, 2023

  1. Very slow times to update a KService. It took 12 minutes until the KService switched to the new revision. The pod started up right away and became ready within a minute or 2.

So, if I understand correctly, even after updating the KService resource, you're still receiving responses from the old revision when accessing the backend. If that's the case, it's possible that net-kourier did not synchronize with the 10 Gateway quickly enough. Do you have already tried increasing the CPU limits/requests both on net-kourier-controller and gateway or using fewer Gateway pods?

  1. TCP connection timeouts trying to call my KServices. I believe the client timeout time is 10 seconds.

Could you please provide the exact error message? The timeout error is caused by client's 10 seconds timeout? I think the client timeout 10 seconds is too short because the response time includes the ksvc pod's scale up. Also, when the issue occurs, could you confirm the following:

a. Does Kourier Gateway have any access logs?
b. Has the ksvc's pod started up?

  • If a) has no logs, it may not be a Kourier issue and could be related to Node or infra network.
  • If a) has logs and b) ksvc pod has already started up, the issue could be related to the backend activator or ksvc pod issue.
  • If a) has logs but b) ksvc pod hasn't started up, then it may be a Kourier Gateway performance issue. In that case, you could try increasing the CPU limits/requests to address the issue.

@nak3
Copy link
Contributor

nak3 commented May 11, 2023

I have written above but overall I think you should increase the resources and client timeout first of all.

Controller's cpu 0.5, memory 500Mi for 823 KServices and 10 gateways would not be enough.
Also client's 10 seconds timeout is too short for the zero-scale Kservices.

@adamzr
Copy link
Author

adamzr commented May 13, 2023

@nak3 Thank you for responding. I switched our system to Contour in the meantime to work around the issues we were having, so I can't immediately test your suggestions. I do have a few questions and comments, though.

  • We had tried increasing CPU and memory for the net-kourier-controller. The usage was not high after we did the increases, so we didn't think we needed additional increases.
  • Why would fewer gateway pods help? I thought more would help?
  • I was wrong about the client having a 10 second timeout. It was a 30 second timeout. How does KNative / Kourier handle TCP connections - does the connection not complete until the pod is started? Or does it connect earlier to the activator?
  • We were seeing the pod start up quickly, but still not respond quickly. I did not check the gateway access logs. What would the access log entry look like?

While we are not using Kourier now, we may go back. I would like to better understand how to scale Kourier. Do you have a rule of thumb for how many gateways per service or per request volume? Can we scale based on CPU and Memory usage?

Thanks so much!

@nak3
Copy link
Contributor

nak3 commented Jun 2, 2023

I see. I'm glad to hear that you could workaround the issue.
Other user's comment says contour does not scale so they are using Kourier instead in production [1] - so I need to understand more your env.

Please feel free to re-open the issue if you will back to Kourier.

[1] #744 (comment)

We've tested Istio Gateway and Contour and we've noticed issues when it comes to deployment time and even scaling, especially for Contour: it does not scale pretty well when we have reached arround 700-1000 KSVC and the deployment was taking too much time even without enabling proxy-protocol or local rate limit.

@nak3 nak3 closed this as completed Jun 2, 2023
@adamzr
Copy link
Author

adamzr commented Jun 5, 2023

Thanks @nak3 , since things are working for us on Contour, we'll stay there for now. But, it is good to know that there may be issues with it as we scale.

In the ideal world, I'd expect Kourier to scale automatically using HPAs. If there are manual changes I need to make as it scales, it'd be nice if there was a document that explained how Kourier's individual components should be scaled.

I appreciate your help! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants