Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only update peers if there was a significant change in nodes #234

Merged
merged 4 commits into from Apr 23, 2019

Conversation

Projects
None yet
4 participants
@cezarsa
Copy link

commented Apr 5, 2019

This PR updates confd to only recompute BGP peers when node updates include meaninful changes compared to the cached values.

A little backstory for the motivation behind this PR.

We run some moderated size kubernetes clusters on our infrastructure and we were running Calico 3.0.5 without any problems. They run with Typha enable and using kubernetes as the data storage. Also, we disable node-to-node mesh and control the mesh ourselves by using the BGPPeers CRD with https://github.com/tsuru/remesher.

We decided to upgrade our development cluster to Calico 3.6.1. This cluster usually has around 63 nodes and we have around 2500 bgppeers entries:

$ kubectl get bgppeers.crd.projectcalico.org  | wc -l
    2622
$ kubectl get nodes | wc -l
      64

After migrating to 3.6.1 we saw a huge increase in CPU usage on all calico-node pods. This high cpu usage was tracked to the calico-node -confd process:

confd-img1

We went on to enable debug logging on the confd process and found out that it was constantly trying to regenerate the rendered templates even though there wasn't any changes to them. We also found out that regenerating the peers was triggered by node update events:

3d5210796bba[22368]: 2019-04-05 14:54:16.519 [DEBUG][28749] client.go 623: Update: api.Update{KVPair:model.KVPair{Key:model.ResourceKey{Name:"10.224.141.40", Namespace:"", Kind:"Node"}, Value:(*v3.Node)(0xc000388000), Revision:"152509306", UID:(*types.UID)(nil), TTL:0}, UpdateType:0x2}
...
3d5210796bba[22368]: 2019-04-05 14:54:16.524 [DEBUG][28749] client.go 721: Recompute BGP peerings
...
3d5210796bba[22368]: 2019-04-05 14:54:19.909 [DEBUG][28749] client.go 623: Update: api.Update{KVPair:model.KVPair{Key:model.ResourceKey{Name:"10.224.141.110", Namespace:"", Kind:"Node"}, Value:(*v3.Node)(0xc000388140), Revision:"152509307", UID:(*types.UID)(nil), TTL:0}, UpdateType:0x2}
...
3d5210796bba[22368]: 2019-04-05 14:54:19.911 [DEBUG][28749] client.go 721: Recompute BGP peerings 

Node update events on kubernetes are tricky, they trigger roughly every 10 seconds for every node due to kubelet updating the node heartbeat statuses. This updates will then trigger confd to recompute all BGP peerings even though nothing meaninful has changed.

This PR changes OnUpdates() calls to first compare each node attribute to the existing value in the cache. Recomputing bgp peers is then only triggered if the cache was updated. We introduced this change on one of our nodes and the result for CPU usage was fantastic, it went back to the levels we had when running 3.0.5:

confd-img2

Only update peers if there was a significant change in nodes
This will prevent confd for regenerating all bgp peers for node status
updates that are unrelated to Calico. This has the potential of
drastically reducing CPU usage.
@CLAassistant

This comment has been minimized.

Copy link

commented Apr 5, 2019

CLA assistant check
All committers have signed the CLA.

@caseydavenport

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

@cezarsa thanks for the PR and the awesome investigation / write up!

@fasaxc

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

@cezarsa Sounds like a good patch, thanks!

Couple of questions:

  • Are you using Typha with v3.6? Typha does similar deduping and, in v3.6, confd connects to typha too so I'd expect it to get the benefit.
  • Why are you controlling the mesh yourselves? If you're going to that trouble, I'd recommend using our in-cluster route reflector instead. That'd reduce your number of peerings right down and make BGP more efficient.
@cezarsa

This comment has been minimized.

Copy link
Author

commented Apr 5, 2019

@fasaxc We're using the calico/typha:v3.6.1 image for typha. Usually when upgrading a cluster we follow the recommended manifest with a few modifications on top of it, in this particular case we didn't switch to calico-ipam yet and are still running with the host-local ipam. But all calico components were upgraded together.

Regarding the mesh configuration, one of the motivations for upgrading from 3.0 to 3.6 was being able to deploy a RR more easily using the in-cluster RR available since 3.3 I think. But we didn't want to do this in a single step and wanted to first migrate to 3.6 as we are today and after everything is working correctly update our network topology to include RRs.

@cezarsa

This comment has been minimized.

Copy link
Author

commented Apr 5, 2019

There was a bug in the previous version, empty values were never being properly added to the cache. I've just added a new commit with this change to ease review, feel free to squash everything in a single commit when merging.

@cezarsa

This comment has been minimized.

Copy link
Author

commented Apr 5, 2019

Yet another bugfix commit, this was the cause of the previous CI failure on etcd datastore. CI should be green now, sorry for the noise. :)

@cezarsa

This comment has been minimized.

Copy link
Author

commented Apr 8, 2019

Another update now, not a bugfix this time but a performance improvement over the last changeset.

First, here is an updated graph from our internal monitoring showing the performance improvements over time. The graphs on the PR description are no longer completely valid since they included a few bugs that were fixed since the original commit:

calico-node-annotated

Now onto the reason behind this latest commit. When I started this PR I was under the incorrect assumption that every cache key must be associated with the latest revision, this caused confd to still try rendering all templates every time a node change happened (because the revision was incremented), even when we were skipping the bgp peers recompute step.

Upon understanding how this actually works I figured that we don't need to update the revision for a cache key unless it actually changes. This avoids re-rendering the templates needlessly.

@caseydavenport
Copy link
Member

left a comment

These changes look good to me - thanks for digging into this and providing a fix!

We'll target these for v3.8, but once they go through a bit of testing on master it might make sense to cherry-pick this to one or two earlier releases as well.

@caseydavenport caseydavenport merged commit 1b901f5 into projectcalico:master Apr 23, 2019

2 checks passed

license/cla Contributor License Agreement is signed.
Details
semaphoreci The build passed on Semaphore.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.