Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contour seems to stop sending updates when processing a large number of services #424

Closed
alexbrand opened this issue Jun 5, 2018 · 5 comments
Labels
blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it.

Comments

@alexbrand
Copy link
Contributor

While doing performance testing, I ran into an issue where a subset of the envoy pods were not getting configured, which in turn resulted in bad benchmark results.

Environment setup:

  • Decoupled envoy/contour deployment
  • 2 contour pods
  • 5 envoy pods

The test involved creating 5,000 services in a backend Kubernetes cluster, waiting until they are all discovered by Gimbal, and finally running wrk2 against the Gimbal cluster.

The following chart seems to indicate the only one of the envoys was getting CDS updates, whereas the rest were not:

image

I then noticed that one of the Contour instance's memory consumption was different than the other one. It seems like we might have a memory leak:

image

The only interesting bits from the longs were a bunch of these:

time="2018-06-04T14:17:07Z" level=info msg="event channel is full, len: 128, cap: 128" context=buffer

The rest of the logs were stream_wait and skipping update messages.

I haven't had a chance to try reproducing the issue, but will most likely try this week. Happy to provide any other information that might be useful.

@davecheney
Copy link
Contributor

Thanks for reporting this issue, can you please test with 0.6.0.alpha.2 (i'm going to cut it soon). My hope is that the bug fixed in #423 may solve the lack of updates.

@alexbrand I'm going to ask you to raise a separate issue for the memory usage. That is not expected, but I don't want to conflate these two issues.

time="2018-06-04T14:17:07Z" level=info msg="event channel is full, len: 128, cap: 128" context=buffer

This message is, well, to be honest, incorrect. The channel is full is just informational, its basically saying the k8s watcher ran ahead of processing by 128 items and now processing items is blocking more being read from the watcher. It's not saying anything other than contour is busy, which might be indicative of a problem, or not, I expect and have seen this message in my scale testing.

However in alpha.2 processing of endpoints no longer goes through the buffer, see #404, so this message should be less prevalent. I'm not sure what is the right resolution to this is. I'm loathe to remove it, even though its noisy, it's a good way of finding out when the translator is busy.

@davecheney davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 5, 2018
@rosskukulinski
Copy link
Contributor

@davecheney would the max size of the channel & current size of the channel be useful metrics to expose? As an operator responsible for managing Contour, is this something I might need to have visibility into?

@davecheney
Copy link
Contributor

davecheney commented Jun 5, 2018 via email

@davecheney
Copy link
Contributor

@alexbrand any update after trying 0.6-alpha.1 ?

@alexbrand
Copy link
Contributor Author

Closing this issue as it did not recur while testing with 0.6-alpha.1.

sunjayBhatia added a commit that referenced this issue Jan 30, 2023
Also refactor so both tests actually have to pass not just the second
one and fix gateway nodeport example

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it.
Projects
None yet
Development

No branches or pull requests

3 participants