Bug 1898160: Egress IP fail-over and health checks #394

alexanderConstantinescu · 2021-01-04T10:57:07Z

- What this PR does and why is it needed

This PR fixes the bug referenced. It brings in several commits (which only touch the egress IP code) which are needed for a smooth back-port. All commits have been cherry-picked without any additional modification.

These commits have fixed the incorrect behavior on master.

/assign @danwinship

- Special notes for reviewers

- How to verify it

- Description for the changelog

Now that local gateway mode is removed we can cleanup stuff like modeEgressIP and the likes and just use one data container as a controller for egress IP assignment/setup Signed-off-by: Alexander Constantinescu <aconstan@redhat.com> Cleanup egress IP conditions There were a couple of nits on PR: ovn-org/ovn-kubernetes#1668 that I didn't get the chance to fix before it merged. So I am doing that here instead Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Kubelet status is not enough to be able to determine if a node's networking is up or not. The NotReady status can be assigned to nodes for other reasons other than networking being down and moreover, even when it is correctly indicating the node status, it might not be doing that fast enough. Thus add a recurring connectivity checker to all egress nodes to ascertain if they are up or not. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Unrelated to the previous commits, this one fixes an update issue during the re-assignment procedure, namely: if a re-assignment is attempted when there are no more nodes to assign to, we previously exited the function `reassignEgressIP` without updating the status. This commit thus changes the execution to continue on to the update phase, yet log the error once the re-assignment is finished. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Previously we never actually reacted to the fact that an egress IP might have been deleted while we were trying to re-assign it. We should in such a case drop if from our re-assignment cache so that it's not retried anymore. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Host network pods are not egress IP assignable, since that will break cluster networking and moreover apply the egress IP to all pods running on that node. No check however existed insuring that such pods are skipped for egress assignment. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

This fixes an issue seen when multiple assignment retries are attempted for the same EgressIP object. When multiple nodes are being labelled, we attempt a re-assignment of all EgressIPs which did not get fully assigned in round N-1, however we ended up deleting every EgressIP object once it was retried: thus, ending up retrying every object only once. This patch fixes that, only deleting the EgressIP once all requested .spec.egressIPs have been assigned. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

openshift-ci-robot · 2021-01-04T10:57:15Z

@alexanderConstantinescu: This pull request references Bugzilla bug 1898160, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.z) matches configured target release for branch (4.6.z)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1877273 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
dependent Bugzilla bug 1877273 targets the "4.7.0" release, which is one of the valid target releases: 4.7.0
bug has dependents

In response to this:

Bug 1898160: [release-4.6] Egress IP fail-over and health checks

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danwinship · 2021-01-06T17:14:27Z

/lgtm

openshift-ci-robot · 2021-01-06T17:14:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-01-06T17:22:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-06T18:14:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-06T19:58:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-06T20:24:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-06T21:55:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

russellb · 2021-01-08T23:19:29Z

(patch manager) 4.6 blocker identified, so pulling unmerged patches from the queue

alexanderConstantinescu · 2021-01-11T11:17:39Z

(patch manager) 4.6 blocker identified, so pulling unmerged patches from the queue

@russellb : just for my understanding: you added and then removed the cherry-pick-approved label on the same day. Did the CI jobs not pass by the time you removed the label, or was there another reason this patch didn't merge?

openshift-bot · 2021-01-22T00:27:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-22T02:11:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-22T02:37:25Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-22T02:50:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-22T04:08:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-22T04:34:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2021-01-22T05:34:14Z

@alexanderConstantinescu: All pull requests linked via external trackers have merged:

openshift/ovn-kubernetes#394

Bugzilla bug 1898160 has been moved to the MODIFIED state.

In response to this:

Bug 1898160: Egress IP fail-over and health checks

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexanderConstantinescu added 7 commits January 4, 2021 11:33

Re-assign once kubelet starts posting NotReady status for node

ab4352f

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

openshift-ci-robot assigned danwinship Jan 4, 2021

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 4, 2021

openshift-ci-robot requested review from danielmellado and trozet January 4, 2021 10:57

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2021

russellb added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. labels Jan 8, 2021

crawford changed the title ~~Bug 1898160: [release-4.6] Egress IP fail-over and health checks~~ Bug 1898160: Egress IP fail-over and health checks Jan 21, 2021

crawford added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jan 21, 2021

openshift-merge-robot merged commit b4900a8 into openshift:release-4.6 Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1898160: Egress IP fail-over and health checks #394

Bug 1898160: Egress IP fail-over and health checks #394

alexanderConstantinescu commented Jan 4, 2021

openshift-ci-robot commented Jan 4, 2021

danwinship commented Jan 6, 2021

openshift-ci-robot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

russellb commented Jan 8, 2021

alexanderConstantinescu commented Jan 11, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-ci-robot commented Jan 22, 2021

Bug 1898160: Egress IP fail-over and health checks #394

Bug 1898160: Egress IP fail-over and health checks #394

Conversation

alexanderConstantinescu commented Jan 4, 2021

openshift-ci-robot commented Jan 4, 2021

danwinship commented Jan 6, 2021

openshift-ci-robot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

openshift-bot commented Jan 6, 2021

russellb commented Jan 8, 2021

alexanderConstantinescu commented Jan 11, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-bot commented Jan 22, 2021

openshift-ci-robot commented Jan 22, 2021