Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge CPU usage spike in protokube after cluster scale up #7427

Closed
jacksontj opened this issue Aug 16, 2019 · 4 comments
Closed

Huge CPU usage spike in protokube after cluster scale up #7427

jacksontj opened this issue Aug 16, 2019 · 4 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jacksontj
Copy link
Contributor

jacksontj commented Aug 16, 2019

1. What kops version are you running? The command kops version, will display
this information.

Version c54e758 (git-c54e758d4)

which is 1.12.2 with a few backports (done prior to the 1.12.3 release -- is the same)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Seemingly not relevant, but just in case:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:26:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:25:46Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
We changed some of the instancegroups to scale up somewhat significantly (added about 60 nodes)

5. What happened after the commands executed?
Protokube's CPU usage spiked to consume ~1.5 cores (it usually sits around 6% of a single core).

Screenshot from 2019-08-16 10-39-42

A few things to note from the graph:

  • some nodes were impacted significantly more than others as far as CPU usage
  • some nodes recovered much quicker than others

I've done some digging around upstream and have found a few issues that are maybe related, but don't exactly inspire confidence :/

6. What did you expect to happen?

  1. I expect that protokube's gossip layer (mesh) should scale CPU linearly with the cluster size
  2. Ideally have options to configure cgroups limits for protokube to avoid this sort of an issue causing impact to there rest of the host
@jacksontj
Copy link
Contributor Author

I was able to repro this behavior in a test cluster and I have submitted a fix to upstream (weaveworks/mesh#107) -- waiting on feedback there.

@jacksontj
Copy link
Contributor Author

I was able to make a build of protokube with the fix I proposed, but I am continuing to run into issues. The next one I ran into is weaveworks/mesh#108 -- TLDR the gossip protocol is bad (its cost scales seemingly exponential to cluster size). So I'm going to spend a bit more time looking for a mechanism to reduce the cost -- but likely to get past this we'll have to move off of mesh (#7436)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2019
@jacksontj
Copy link
Contributor Author

This has been taken care of in #7436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants