Skip to content

NodeReconciler being invoked too frequently, may trigger errant reloads #1639

@mac-chaffee

Description

@mac-chaffee

In issues #1562 and #1606 , there were frequent ConfigReconciler calls caused by frequent Node changes. That was fixed by #1607 by adding a filters inside the ConfigReconciler that ignored some Node changes. On the bright side, I can confirm that fix worked for ConfigReconciler!

But I think we need to apply a similar filter to the NodeReconciler:

func (r *NodeReconciler) SetupWithManager(mgr ctrl.Manager) error {
p := predicate.NewPredicateFuncs(
func(obj client.Object) bool {
node, ok := obj.(*v1.Node)
if !ok {
level.Error(r.Logger).Log("controller", "NodeReconciler", "error", "object is not node", "name", obj.GetName())
return false
}
return node.Name == r.NodeName
})

so that this doesn't flood the logs:

{"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/n4-mac","ts":"2022-10-12T14:59:50Z"}
{"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-10-12T14:59:50Z"}
{"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/n4-mac","level":"info","ts":"2022-10-12T14:59:50Z"}
{"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/n4-mac","ts":"2022-10-12T15:00:00Z"}
{"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-10-12T15:00:00Z"}
{"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/n4-mac","level":"info","ts":"2022-10-12T15:00:00Z"}

I don't have hard proof, but I have seen very rare cases where metallb mistakenly triggers a failover of the VIP due to one of the node change events, even though none of the nodes appeared to have had any issues.

I think what we should learn from #1606 is that ALL node watching code needs some kind of filter, because nodes change very frequently under normal conditions (maybe the conditions or the images sections?).

What kind of node update events does the NodeReconciler really care about? Nodes being added or deleted? Or also nodes being NotReady/Cordoned? I think we should put those checks inside the filter here to save CPU, reduce log spam, and avoid risks of errant failovers.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions