Performance Issues. TCP Timeouts and slow updates. #1045

adamzr · 2023-05-10T21:20:19Z

I have a GKE cluster running KNative 1.7.2 and Kourier. I updated Kourier to 1.10.0 to try and resolve this issue. But, it persists.

I am seeing a couple of problems:

Very slow times to update a KService. It took 12 minutes until the KService switched to the new revision. The pod started up right away and became ready within a minute or 2.
TCP connection timeouts trying to call my KServices. I believe the client timeout time is 10 seconds.

I am running 10 replicas of the gateways with these resources:

Limits: 
   cpu:     500m
   memory:  500Mi 
Requests:
    cpu:        200m
    memory:     200Mi

I am running 3 net-kourier-controllers with these resources:

    Limits:                                                                                                                                                                                
      cpu:     500m                                                                                                                                                                        
      memory:  4Gi                                                                                                                                                                         
    Requests:                                                                                                                                                                              
      cpu:     200m                                                                                                                                                                        
      memory:  200Mi

I have 823 KServices. Some are used much more than others.

These performance problems are giving me high error rates. How can I diagnose the cause and fix it?

Thanks!

The text was updated successfully, but these errors were encountered:

adamzr · 2023-05-10T21:20:48Z

This might be related to #941

dprotaso · 2023-05-10T21:21:58Z

Cc @nak3 @skonto

nak3 · 2023-05-11T03:26:29Z

Very slow times to update a KService. It took 12 minutes until the KService switched to the new revision. The pod started up right away and became ready within a minute or 2.

So, if I understand correctly, even after updating the KService resource, you're still receiving responses from the old revision when accessing the backend. If that's the case, it's possible that net-kourier did not synchronize with the 10 Gateway quickly enough. Do you have already tried increasing the CPU limits/requests both on net-kourier-controller and gateway or using fewer Gateway pods?

TCP connection timeouts trying to call my KServices. I believe the client timeout time is 10 seconds.

Could you please provide the exact error message? The timeout error is caused by client's 10 seconds timeout? I think the client timeout 10 seconds is too short because the response time includes the ksvc pod's scale up. Also, when the issue occurs, could you confirm the following:

a. Does Kourier Gateway have any access logs?
b. Has the ksvc's pod started up?

If a) has no logs, it may not be a Kourier issue and could be related to Node or infra network.
If a) has logs and b) ksvc pod has already started up, the issue could be related to the backend activator or ksvc pod issue.
If a) has logs but b) ksvc pod hasn't started up, then it may be a Kourier Gateway performance issue. In that case, you could try increasing the CPU limits/requests to address the issue.

nak3 · 2023-05-11T05:51:24Z

I have written above but overall I think you should increase the resources and client timeout first of all.

Controller's cpu 0.5, memory 500Mi for 823 KServices and 10 gateways would not be enough.
Also client's 10 seconds timeout is too short for the zero-scale Kservices.

adamzr · 2023-05-13T00:24:35Z

@nak3 Thank you for responding. I switched our system to Contour in the meantime to work around the issues we were having, so I can't immediately test your suggestions. I do have a few questions and comments, though.

We had tried increasing CPU and memory for the net-kourier-controller. The usage was not high after we did the increases, so we didn't think we needed additional increases.
Why would fewer gateway pods help? I thought more would help?
I was wrong about the client having a 10 second timeout. It was a 30 second timeout. How does KNative / Kourier handle TCP connections - does the connection not complete until the pod is started? Or does it connect earlier to the activator?
We were seeing the pod start up quickly, but still not respond quickly. I did not check the gateway access logs. What would the access log entry look like?

While we are not using Kourier now, we may go back. I would like to better understand how to scale Kourier. Do you have a rule of thumb for how many gateways per service or per request volume? Can we scale based on CPU and Memory usage?

Thanks so much!

nak3 · 2023-06-02T23:41:16Z

I see. I'm glad to hear that you could workaround the issue.
Other user's comment says contour does not scale so they are using Kourier instead in production [1] - so I need to understand more your env.

Please feel free to re-open the issue if you will back to Kourier.

[1] #744 (comment)

We've tested Istio Gateway and Contour and we've noticed issues when it comes to deployment time and even scaling, especially for Contour: it does not scale pretty well when we have reached arround 700-1000 KSVC and the deployment was taking too much time even without enabling proxy-protocol or local rate limit.

adamzr · 2023-06-05T20:54:50Z

Thanks @nak3 , since things are working for us on Contour, we'll stay there for now. But, it is good to know that there may be issues with it as we scale.

In the ideal world, I'd expect Kourier to scale automatically using HPAs. If there are manual changes I need to make as it scales, it'd be nice if there was a document that explained how Kourier's individual components should be scaled.

I appreciate your help! Thank you!

dprotaso mentioned this issue Jun 2, 2023

Do not allow to override kingress's ingress.class via ksvc's annotation knative/serving#14037

Open

nak3 closed this as completed Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issues. TCP Timeouts and slow updates. #1045

Performance Issues. TCP Timeouts and slow updates. #1045

adamzr commented May 10, 2023

adamzr commented May 10, 2023

dprotaso commented May 10, 2023

nak3 commented May 11, 2023 •

edited

Loading

nak3 commented May 11, 2023

adamzr commented May 13, 2023

nak3 commented Jun 2, 2023

adamzr commented Jun 5, 2023 •

edited

Loading

Performance Issues. TCP Timeouts and slow updates. #1045

Performance Issues. TCP Timeouts and slow updates. #1045

Comments

adamzr commented May 10, 2023

adamzr commented May 10, 2023

dprotaso commented May 10, 2023

nak3 commented May 11, 2023 • edited Loading

nak3 commented May 11, 2023

adamzr commented May 13, 2023

nak3 commented Jun 2, 2023

adamzr commented Jun 5, 2023 • edited Loading

nak3 commented May 11, 2023 •

edited

Loading

adamzr commented Jun 5, 2023 •

edited

Loading