Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2039099: EgressIP fixes for 4.10 #917

Merged

Conversation

alexanderConstantinescu
Copy link
Contributor

/hold

until #915 and openshift/cluster-network-operator#1285 gets in

/assign @trozet

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

@openshift-ci openshift-ci bot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jan 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2022

@alexanderConstantinescu: This pull request references Bugzilla bug 2039099, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

Bug 2039099: EgressIP fixes for 4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 19, 2022
@alexanderConstantinescu
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 19, 2022
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2022
kyrtapz and others added 13 commits January 20, 2022 10:10
On CloudPrivateIPConfig deletion there can still be a finalizer left in the CR, instead of checking if there are any finalizers left check if new is nil.

Signed-off-by: Patryk Diak <pdiak@redhat.com>
(cherry picked from commit 391bbe2)
When ovnkube-master restarts we might have multiple objetct modification
being performed from WatchEgressNodes and WatchEgressIP, this will also
happen in quick succession: meaning we need to make sure we constantly
act on the latest state of the object, which might not be the state of
the informer cache. Hence: always retrieve the object from the API
server

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 2fa8386)
When deleting a egressip.status.items we are better off just utilizing
the fact that the setup made in the NB DB gets tagged with the
egress IP name in the external_id fields for the LRP and NAT objects.

Doing this fixes two issues:

1. If a ovnkube-master doesn't run while an egress IP matching pod gets
   deleted, we'll still be able to cleanup the setup made for that pod
   when we restart by doing this.
2. If an EgressIP gets deleted we will start deleting
   CloudPrivateIPConfigs. However, with the previous implementation we
   needed to fetch the EgressIP object and its podSelector /
   namespaceSelector, as to figure out which pods it matched. That
   won't be possible because the EgressIP object has been deleted, and
   so we were left with setups for all pods which matched the EgressIP
   object

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 4eb926e)
The pod assignment cache logic was completely off, leading to incorrect
setups in many cases. This fixes the logic

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 24dbb96)
The reconciliation logic for egress IP was a bit off in the sense that
we were processing our own updates, leading to unecessary and also
incorrect changes for each object.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit db0e0ae)
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit baaea66)
There are two caches used for egress IP: the allocator cache, tracking
which egress IP has been assigned to which node, and the podAssignment
cache, tracking what set up has been performed on a per-pod basis. The
logic in reconcileEgressIP only warmed up the cache when new assignments
were actually being performed, overlooking the fact that we need to add
everything that is already existing and which won't be re-assigned to
the cache as well.

Moreover, when running in a cloud environment we also never cleared the
`statusToRemove` from the allocator cache before possibly trying to
re-assign new IPs. This fixes that too

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 994ab9d)
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Annotate the CloudPrivateIPConfig with the EgressIP owner reference

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 1e50941)
Pre-this-patch-series: we always deleted the external GW setup for each
pod matching an egress IP and proceeded to wiring it up for egress IP.
That can happen around 20 times in a short time span since pod updates
are atomic and many controllers can change many fields on the pod when
it's created. With this series of patches: we now have something which
tracks what has been performed for each pod and do not perform repeating
work. This is also needed since we have several different handlers which
react to changes and can perform the egress IP setup, WatchEgressIP,
WatchEgressNodes, etc.

We need to be sure that the external GW setup created in addLogicalPort
has actually been performed once the egress IP code is supposed to
delete that. This was never the case before because it wasn't required
given that we always re-did the same work for every pod update and were
bound to "finish last" at some point.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 9353e94)
Pods should not match multiple EgressIP objects: that is considered a
user error and is undefined. However, in such events iterate all EgressIPs
and clean up as much as possible. By iterating all EgressIPs we also
cover the case where a pod has its labels changed from matching one
EgressIP to another, ex: EgressIP1 matching "blue pods" and EgressIP2
matching "red pods". If a pod with a red label gets changed to a blue
label: we need add and remove the set up for both EgressIP obejcts -
since we can't be sure of which EgressIP object we process first,
always iterate all.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit a75838a)
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit e920435)
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit 6d66ec2)
WatchEgressIPNamespaces and WatchEgressIPPods only use the informer
cache to retrieve the egress IPs when determining if namespace/pods
match. It is thus better if we initialize them first and allow
WatchEgressNodes / WatchEgressIP to initialize after. Those handlers
might change the assignments of the existing objects. If we do the
inverse and start WatchEgressIPNamespaces / WatchEgressIPPod last, we
risk performing a bunch of modifications on the EgressIP objects when
we restart and then have these handlers act on stale data when they
sync.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
(cherry picked from commit eca24d0)
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2022
@alexanderConstantinescu
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 20, 2022
@alexanderConstantinescu
Copy link
Contributor Author

/test e2e-gcp-ovn

@trozet
Copy link
Contributor

trozet commented Jan 20, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 20, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 20, 2022

@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-metal-ipi-ovn-dualstack

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-dualstack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@trozet
Copy link
Contributor

trozet commented Jan 20, 2022

/override ci/prow/e2e-metal-ipi-ovn-dualstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 20, 2022

@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-metal-ipi-ovn-dualstack

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-dualstack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@alexanderConstantinescu
Copy link
Contributor Author

/skip

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@alexanderConstantinescu
Copy link
Contributor Author

/skip

@alexanderConstantinescu
Copy link
Contributor Author

/test e2e-aws-ovn-upgrade

@trozet
Copy link
Contributor

trozet commented Jan 21, 2022

/override ci/prow/e2e-metal-ipi-ovn-dualstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-metal-ipi-ovn-dualstack

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-dualstack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@alexanderConstantinescu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade 2715f97 link false /test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
ci/prow/okd-e2e-gcp-ovn 2715f97 link false /test okd-e2e-gcp-ovn
ci/prow/e2e-vsphere-ovn 2715f97 link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@trozet
Copy link
Contributor

trozet commented Jan 21, 2022

aws-ovn-upgrade already passed twice on this commit. The previous run's only failure was:
etcdHighNumberOfLeaderChanges was at or above info for at least 3m57s (maxAllowed=0s): pending for 14m57s, firing for 3m57s:

Jan 21 13:05:30.373 - 119s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-155-187.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-155-187.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}

Unrelated to this PR.

/override ci/prow/e2e-aws-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-aws-ovn-upgrade

In response to this:

aws-ovn-upgrade already passed twice on this commit. The previous run's only failure was:
etcdHighNumberOfLeaderChanges was at or above info for at least 3m57s (maxAllowed=0s): pending for 14m57s, firing for 3m57s:

Jan 21 13:05:30.373 - 119s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-155-187.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-155-187.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}

Unrelated to this PR.

/override ci/prow/e2e-aws-ovn-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 1e95054 into openshift:master Jan 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@alexanderConstantinescu: All pull requests linked via external trackers have merged:

Bugzilla bug 2039099 has been moved to the MODIFIED state.

In response to this:

Bug 2039099: EgressIP fixes for 4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants