Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always allow egressip-node-updates #4102

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tssurya
Copy link
Member

@tssurya tssurya commented Jan 26, 2024

  1. ovnkube-node pod goes offline
  2. healthcheck fails
  3. node deleted from EIP.status, patched
  4. all OVN DB setup deleted across zones
  5. CPIPC starts to process delete on Cloud Platforms
  6. ovnkube-node pod comes back online
  7. healtchcheck succeeds
  8. egress-node add happens, BUT patch fails because CPIPC is still being
    deleted: err: cloud update request failed, CloudPrivateIPConfig: 10.0.128.100 is being deleted
  9. node add is added to retry queue, we wait for 30seconds
  10. kubelet sends update every 10seconds -> update succeeds and
    overrides the retry add entry because oldNode.ready == newNode.ready
  11. THUS WE MISS TO PROCESS AN EVENT, patch never happens
  12. OVN setup never happens, we are stuck.

This PR removes the oldNode.ready == newNode.ready; return logic which is present for scale reasons since this is needed to allow events to get synced correctly. Not proud of this fix, but this is why we need level driven EIP controllers big time :)

Does this have scale implications? Maybe at 0.0001% because the addEgressNode which is what will be called always checks if EIP.Spec != EIP.Status before calling reconcileEIP and thus we should be fine...

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

@kyrtapz
Copy link
Contributor

kyrtapz commented Feb 7, 2024

This approach also addresses an issue reported here: https://issues.redhat.com/browse/OCPBUGS-27266.
Without it the retry mechanism for egressIP nodes doesn't really work for anything other than reachability change so I am in favor of getting this in, it would be best to move egressIP to level driven entirely... :)

If we go with the proposed solution I think we would have to modify the addNode function as it currently lists all egress IPs from the API:

egressIPs, err := eIPC.kube.GetEgressIPs()
err := pager.New(func(ctx context.Context, opts metav1.ListOptions) (runtime.Object, error) {
return k.EIPClient.K8sV1().EgressIPs().List(ctx, opts)

We might also consider adapting deleteEgressNode and setNodeEgressReachable if possible.

@tssurya
Copy link
Member Author

tssurya commented Feb 10, 2024

great point @kyrtapz : yeah let's start using the informer cache instead of making list calls to API because that will cause scale issues

@tssurya
Copy link
Member Author

tssurya commented Feb 15, 2024

This is a bandaid knife bandaid that may or may not hurt another spot in the body :D

@tssurya
Copy link
Member Author

tssurya commented Feb 15, 2024

@jcaamano will be our batman bandaid who will do a permanent bandaid when he does level driven controller

@tssurya
Copy link
Member Author

tssurya commented Feb 15, 2024

get back to pat's offline comments for this one

@jcaamano
Copy link
Contributor

Side question: Do we update the reach-ability checker with added/removed nodes?

@jcaamano
Copy link
Contributor

Side question: Do we update the reach-ability checker with added/removed nodes?

Nevermind, confused on the initialization receiving an array of nodes that is not used for anything

func (eIPC *egressIPClusterController) initEgressNodeReachability(nodes []interface{}) error {
	go eIPC.checkEgressNodesReachability()
	return nil
}

Also confusing that that is run from sync function instead of just normal startup. Wondering if there is a reason behind that.

if !isNewReady {
klog.Warningf("Node: %s is not ready, deleting it from egress assignment", newNode.Name)
if err := h.eIPC.deleteEgressNode(newNode.Name); err != nil {
return err
}
} else if isNewReady && isNewReachable {
klog.Infof("Node: %s is ready and reachable, adding it for egress assignment", newNode.Name)
if isOldReady != isNewReady {
klog.Infof("Node: %s is ready and reachable, adding it for egress assignment", newNode.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyrtapz suggested not to do this as its stupid and not useful
so fixed this.

@tssurya tssurya force-pushed the always-allow-egress-node-updates branch from 422bfd4 to ff8d45c Compare February 16, 2024 15:28
1) ovnkube-node pod goes offline
2) healthcheck fails
3) node deleted from EIP.status, patched
4) all OVN DB setup deleted across zones
5) CPIPC starts to process delete on Cloud Platforms
6) ovnkube-node pod comes back online
7) healtchcheck succeeds
8) egress-node add happens, BUT patch fails because CPIPC is still being
   deleted: err: cloud update request failed, CloudPrivateIPConfig: 10.0.128.100 is being deleted
9) node add is added to retry queue, we wait for 30seconds
10) kubelet sends update every 10seconds -> update succeeds and
    overrides the retry add entry because oldNode.ready == newNode.ready
11) THUS WE MISS TO PROCESS AN EVENT, patch never happens
12) OVN setup never happens, we are stuck.

This PR removes the oldNode.ready == newNode.ready; return logic which
is present for scale reasons since this is needed to allow events to get
synced correctly. Not proud of this fix, but this is why we need level
driven EIP controllers big time :)

Does this have scale implications? Maybe at 0.0001% because the
addEgressNode which is what will be called always checks if EIP.Spec !=
EIP.Status before calling reconcileEIP and thus we should be fine...

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Since in the previous commit we will always call addEgressNode
for every kubelet update which is every 10seconds for every node,
we must stop using kapi listing for EIPs since this will not
scale well. Let us use the informer caches to pull EIPs and then
unless there are EIPs that have status.items != spec.items
rest will be shortcircuited in this function.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@tssurya tssurya force-pushed the always-allow-egress-node-updates branch from ff8d45c to f9bd58d Compare February 16, 2024 15:29
@tssurya
Copy link
Member Author

tssurya commented Feb 16, 2024

/assign @trozet

@@ -1579,91 +1579,6 @@ var _ = ginkgo.Describe("OVN cluster-manager EgressIP Operations", func() {
err := app.Run([]string{app.Name})
gomega.Expect(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("egress node update should not mark the node as reachable if there was no label/readiness change", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original code that bypasses adding/updating a node if the ready states are equal: 1f101eb#diff-796adb9d8a9b5b5636dc81e7350a00ef143e6e5bcb52ec30dc350c866cd9208cR784

This pre-dates the reachability check. So this test case was added later. This test case seems flawed to me in the first place. We should use any event to trigger us to check and ensure that the node is added, whether or not that is a node update or a reachability check success.

@@ -133,16 +132,12 @@ func (h *egressIPClusterControllerEventHandler) UpdateResource(oldObj, newObj in
}
return nil
}
if isOldReady == isNewReady && !isHostCIDRsAltered {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so walking through the consequences of not skipping potentially duplicate events:

  • Delete path:
  1. We list all egress ips, with kube client (bad and we should fix that), negligible impact
  2. If we already deleted this node, none of the status will match so no reconcile, no perf impact
  • Add path:
  1. Set node reachable, takes a global allocator lock, we could improve this with per key map lock, but shouldnt be a big deal since 15 threads competing for the lock on an operation that takes no time
  2. Add egress node: we get egress ips via kube client again (wth...), then we go through every egress IP and reconcile...yikes

This seems like a pretty big performance issue if there were a lot of egress IPs. We would constantly be churning on every node update all egress ips. Or am I missing something?

Would a better approach be to expose the reAddorDelete map that the reachability check uses, and then we can have the state if the node wasReachable. Then we can change this if statement to be:

if isOldReady == isNewReady && !isHostCIDRsAltered && wasReachable == isReachable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a better approach be to expose the reAddorDelete map that the reachability check uses, and then we can have the state if the node wasReachable. Then we can change this if statement to be:

I had the same thought to have a previous state of Reachability but the problem here is that the kubelet updates we get with oldNode, newNode both those obejcts will have healthy Reachability matrix so it still doesn't take action :( and doesn't help here.

Copy link
Member Author

@tssurya tssurya Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty big performance issue if there were a lot of egress IPs. We would constantly be churning on every node update all egress ips. Or am I missing something?

SO this is a legit scale question, one I had too. But the thing to notice inside the add/delete node func is that we call the reconcileEIP only when status!=spec; so only if the EIP needs to be reconciled we are calling reconcile not otherwise; I meant: https://github.com/ovn-org/ovn-kubernetes/pull/4102/files#diff-f706ecf5fbc17ede0f5a27577b1d4b3c08c00fe2837f6c1be461503de71ce21bR760 for the update/add case.

The delete seems to be doing something else.. lemme double check

gomega.Expect(egressIPs[0]).To(gomega.Equal(egressIP))
hcClient := fakeClusterManagerOVN.eIPC.allocator.cache[node.Name].healthClient.(*fakeEgressIPHealthClient)
hcClient.FakeProbeFailure = true
// explicitly call check reachability, periodic checker is not active
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume even if we delete this test case, there is another test case that checks that the reachability functionality works correctly, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point, ; let me check this

@tssurya tssurya added kind/bug All issues that are bugs and PRs opened to fix bugs egress-ip Issues related to EgressIP feature labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
egress-ip Issues related to EgressIP feature kind/bug All issues that are bugs and PRs opened to fix bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants