Optimisation for Neighbourhood discovery on scale #55

myaser · 2022-10-26T13:24:46Z

Hello

Neighborhood check is expensive when running on the scale. For instance:

neighbor discovery creates a load on the API server proportional to the cluster size and frequency of the checks. The same applies to node watchers
we have n^2 TCP/IP handshakes every 5s, where n is the number of nodes. This also creates more packets into node network queues (FIFO queues), and regular network traffic for production applications could be hit by a latency increase

My proposal would be:

configurable check scheduling, so we can control to reduce the check frequency to once per minute, for example, when needed
optimize the way we query the API server for discovery information; one option is a SWIM-based solution like hashicorp/memberlist

While the first point is straightforward and does not change the behavior much, the latter requires discussion

djboris9 · 2022-10-26T14:07:05Z

Hi @myaser

The first point (make the interval configurable) is something that is definitely worth to implement and is not a big deal.

Your second point is very interesting and I need to have a deeper look at the SWIM paper. From what I've seen now, it should be a good mechanism to detect communication loss to other instances of the kubenurse while reducing the network traffic.

Our initial reason why we developed the kubenurse the way it is now was to understand where network issues happen. Therefore we wanted to do a real HTTP request and record the latencies between each instance to narrow down issues on routing, virtualisation, switches, firewalls, kubelet etc. This is easy to process when the check interval is static and low.

With the SWIM protocol I'm not sure if we can easily detect which network path is causing troubles because of the non uniform check interval and distribution. But it would be a good solution to find if something is fishy in the network at larger scales.

Therefore I see that the first point (make the interval configurable) can easily be implemented and is a quick win. For larger clusters we would probably have to implement a SWIM (or similar) based check additionally.

This is my impression after scratching the paper. But probably there are also other solutions to this problem. I definitely need to dig deeper into it. It looks interesting. What about you @zbindenren ?

djboris9 · 2022-10-26T14:09:12Z

btw, I started with #45 before my holidays and need to finish it when I got time. This could probably be impacted but is no big deal

myaser · 2022-10-26T14:39:12Z

Therefore, we wanted to do a real HTTP request and record the latencies between each instance to narrow down issues on routing, virtualization, switches, firewalls, kubelet, etc

Maybe we can use SWIM to track membership and not rely on k8s API server discovery and still call the /alwayshappy of every neighbor. This way, each pod queries k8s API only once at the bootstrap time and joins the memberlist. SWIM will then track the members as they join/leave without further queries to the k8s server. example

However, this leaves us with the problem of the number of TCP/IP handshakes that remain unsolved

szuecs · 2022-10-26T14:54:08Z

I am colleague of @myaser and suggested internally the use of SWIM and hashicorp memberlist.
I think having swim as discovery of new neighbors would be great already, reducing load to kube-apiserver is always great and kubenurse by design creates a lot of load to apiserver. You need only the initial list from the kube-apiserver.

Maybe we need not to use a full Mesh of TCP/IP handshakes or http requests, but just a subset that you can define via option. For example test 10% of your neighbors and find via memberlist who is your neighbor peer, so you can have the full ring tested without having "wholes". I agree testing http would be great.

zbindenren · 2022-10-26T19:47:12Z

Maybe using a cached client could reduce the load on the API server?

szuecs · 2022-10-27T12:23:36Z

If you use informers you have a cache and a maintained WATCH, which in case you get a new node with kubenurse it will send to all kubernurse. Then to get unstuck in most cases you should also use refresh interval and then the client automatically does a LIST (get all endpoints) and this from every node all time.Duration.

zbindenren · 2022-10-27T16:46:19Z

Yes that is the plan.

clementnuss · 2023-08-08T06:45:19Z

digging up this issue, as we (postfinance) are currently being impacted by scale issues on our largest cluster.

I will work on that in the coming weeks!

zbindenren · 2023-08-09T05:15:37Z

@clementnuss thanks for doing this. Here some of my thoughts:

When using a cached client we could get rid of this nodeCache struct and the load on the api server for getting all pods here would hopefully decrease too.

I propose to switch to controller-runtime's client, then it would be pretty easy to use their cached client like in this simplified example:

// TODO: handle errors
config, _ := rest.InClusterConfig()
if err != nil {
	return nil, fmt.Errorf("kubernetes config: %w", err)
}

c, _ := cache.New(config, cache.Options{
	ByObject: map[client.Object]cache.ByObject{
		&corev1.Pod{}: {},
	},
})

_ = c.Start(ctx)

You can even use labels for the cache. Here you can find the documentation: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/cache

permits to get rid of kubediscovery package and to simplify the codebase linked to: #55 Signed-off-by: Clément Nussbaumer <clement.nussbaumer@postfinance.ch>

clementnuss · 2024-02-07T05:48:02Z

So I've implemented caching with controller-runtime client 🙃 makes the codebase much simpler, and the effects on a 100 nodes cluster are promising:

here we have a reduction of around 20RPS on the API server, for a 103 nodes cluster, which does make sense given that we use the default interval of 5s between checks, and that each checks incurred a GET on /api/v1/namespaces/kubenurse/pods

Those GET requests are now transparently cached by controller-runtime's client.

To be noted: the nodes list was already being cached (that is, watched with an informer, etc.) So this 1st improvement reduces the load on the API server.

Now regarding the $O(n^2)$ scale issue when it comes to checking all neighbours, I've had an idea that revolves around consistent hashing. Will implement it and detail that later !

clementnuss · 2024-02-19T07:24:04Z

after a few days, there doesn't seem to be a memory leak 🙃 I'll release a new version with the caching already!

clementnuss · 2024-02-20T05:35:28Z

you can test caching with release v1.10.0. let me know if this has an impact

instead of querying all (valid) neighbouring pods, we filter and only query KUBENURSE_NEIGHBOUR_LIMIT other neighbours. To make sure that all nodes get the right number of queries, we hash each node name, put that in a sorted list (l), and each pod will query the e.g. 10 next nodes in the list, starting from the position given by the node name' hash. i.e. for node named n, we query the 10 next nodes in the list, starting from position l[hash(n)] + 1 cf #55 Signed-off-by: Clément Nussbaumer <clement.nussbaumer@postfinance.ch>

clementnuss · 2024-03-15T09:56:39Z

https://github.com/postfinance/kubenurse/blob/master/CHANGELOG.md#1110---2024-03-15

v1.11.0 is out!

most notable change is using hashing to distribute the neighbouring nodes checks. for details see the PR #120 and the commit description.

thanks to that, each pod only queries 10 other neighbours.
also thanks to that, I was able to add the request type (type) label to the prometheus metrics, which makes for more interesting histograms !

please try it out and let me know how this goes. this basically should get us from $O(n^2)$ checks to $O(n)$, where the number of neighbouring node checks is defined with the neihbour filter env variable, and defaults to 10

clementnuss · 2024-04-05T07:39:52Z

I've released 1.12.0, and also documented the node filtering feature in the README with a drawing:

I'll close this issue 🚀
If you have some feedback about the functionality, I'd be really happy to hear about it (on the K8s slack channel for example?) or per mail.

myaser mentioned this issue Oct 26, 2022

introduce a feature flags to control the frequency of checks #56

Merged

clementnuss closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation for Neighbourhood discovery on scale #55

Optimisation for Neighbourhood discovery on scale #55

myaser commented Oct 26, 2022

djboris9 commented Oct 26, 2022

djboris9 commented Oct 26, 2022

myaser commented Oct 26, 2022 •

edited

szuecs commented Oct 26, 2022 •

edited

zbindenren commented Oct 26, 2022

szuecs commented Oct 27, 2022

zbindenren commented Oct 27, 2022

clementnuss commented Aug 8, 2023

zbindenren commented Aug 9, 2023

clementnuss commented Feb 7, 2024

clementnuss commented Feb 19, 2024

clementnuss commented Feb 20, 2024

clementnuss commented Mar 15, 2024 •

edited

clementnuss commented Apr 5, 2024

Optimisation for Neighbourhood discovery on scale #55

Optimisation for Neighbourhood discovery on scale #55

Comments

myaser commented Oct 26, 2022

djboris9 commented Oct 26, 2022

djboris9 commented Oct 26, 2022

myaser commented Oct 26, 2022 • edited

szuecs commented Oct 26, 2022 • edited

zbindenren commented Oct 26, 2022

szuecs commented Oct 27, 2022

zbindenren commented Oct 27, 2022

clementnuss commented Aug 8, 2023

zbindenren commented Aug 9, 2023

clementnuss commented Feb 7, 2024

clementnuss commented Feb 19, 2024

clementnuss commented Feb 20, 2024

clementnuss commented Mar 15, 2024 • edited

clementnuss commented Apr 5, 2024

myaser commented Oct 26, 2022 •

edited

szuecs commented Oct 26, 2022 •

edited

clementnuss commented Mar 15, 2024 •

edited