Contour seems to stop sending updates when processing a large number of services #424

alexbrand · 2018-06-05T01:16:46Z

While doing performance testing, I ran into an issue where a subset of the envoy pods were not getting configured, which in turn resulted in bad benchmark results.

Environment setup:

Decoupled envoy/contour deployment
2 contour pods
5 envoy pods

The test involved creating 5,000 services in a backend Kubernetes cluster, waiting until they are all discovered by Gimbal, and finally running wrk2 against the Gimbal cluster.

The following chart seems to indicate the only one of the envoys was getting CDS updates, whereas the rest were not:

I then noticed that one of the Contour instance's memory consumption was different than the other one. It seems like we might have a memory leak:

The only interesting bits from the longs were a bunch of these:

time="2018-06-04T14:17:07Z" level=info msg="event channel is full, len: 128, cap: 128" context=buffer

The rest of the logs were stream_wait and skipping update messages.

I haven't had a chance to try reproducing the issue, but will most likely try this week. Happy to provide any other information that might be useful.

The text was updated successfully, but these errors were encountered:

davecheney · 2018-06-05T04:02:41Z

Thanks for reporting this issue, can you please test with 0.6.0.alpha.2 (i'm going to cut it soon). My hope is that the bug fixed in #423 may solve the lack of updates.

@alexbrand I'm going to ask you to raise a separate issue for the memory usage. That is not expected, but I don't want to conflate these two issues.

time="2018-06-04T14:17:07Z" level=info msg="event channel is full, len: 128, cap: 128" context=buffer

This message is, well, to be honest, incorrect. The channel is full is just informational, its basically saying the k8s watcher ran ahead of processing by 128 items and now processing items is blocking more being read from the watcher. It's not saying anything other than contour is busy, which might be indicative of a problem, or not, I expect and have seen this message in my scale testing.

However in alpha.2 processing of endpoints no longer goes through the buffer, see #404, so this message should be less prevalent. I'm not sure what is the right resolution to this is. I'm loathe to remove it, even though its noisy, it's a good way of finding out when the translator is busy.

rosskukulinski · 2018-06-05T16:09:57Z

@davecheney would the max size of the channel & current size of the channel be useful metrics to expose? As an operator responsible for managing Contour, is this something I might need to have visibility into?

davecheney · 2018-06-05T21:58:23Z

No thank you. It's an implementation detail that might be removed in the future. I don't want people to get addicted to that number as it is difficult to infer if a high reading on that channel is a good or bad thing.

…

On 6 June 2018 at 02:09, Ross Kukulinski ***@***.***> wrote: @davecheney <https://github.com/davecheney> would the max size of the channel & current size of the channel be useful metrics to expose? As an operator responsible for managing Contour, is this something I might need to have visibility into? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#424 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA0JD05sicmtwrjP1B6tcClH0QwUUks5t5q1WgaJpZM4UaAd0> .

davecheney · 2018-06-07T03:28:50Z

@alexbrand any update after trying 0.6-alpha.1 ?

alexbrand · 2018-06-11T17:46:05Z

Closing this issue as it did not recur while testing with 0.6-alpha.1.

Also refactor so both tests actually have to pass not just the second one and fix gateway nodeport example Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 5, 2018

alexbrand closed this as completed Jun 11, 2018

alexbrand mentioned this issue Jul 5, 2018

Observed envoy memory while adding|removing ingress #499

Closed

sunjayBhatia added a commit that referenced this issue Jan 30, 2023

Add time limit to example test curl (#424)

898379e

Also refactor so both tests actually have to pass not just the second one and fix gateway nodeport example Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contour seems to stop sending updates when processing a large number of services #424

Contour seems to stop sending updates when processing a large number of services #424

alexbrand commented Jun 5, 2018

davecheney commented Jun 5, 2018

rosskukulinski commented Jun 5, 2018

davecheney commented Jun 5, 2018 via email

davecheney commented Jun 7, 2018

alexbrand commented Jun 11, 2018

Contour seems to stop sending updates when processing a large number of services #424

Contour seems to stop sending updates when processing a large number of services #424

Comments

alexbrand commented Jun 5, 2018

davecheney commented Jun 5, 2018

rosskukulinski commented Jun 5, 2018

davecheney commented Jun 5, 2018 via email

davecheney commented Jun 7, 2018

alexbrand commented Jun 11, 2018