Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2091157: [release-4.10] Free IPs and delete resources for completed pods #1152

Merged
merged 7 commits into from Jun 27, 2022

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Jun 21, 2022

Not a totally clean backport due to generic retry watcher library added in 4.11.

Billy99 and others added 6 commits June 21, 2022 09:28
This pull request is to delete the logicial ports associated with a pod when a
pods runs to completion. The intent of this change is to reduce the size of OVN
databases by removing entries that are no longer needed.

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Co-authored-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 999f344)
Get on port cache was returning the actual object in the cache instead
of a copy.

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit b382400)
With the update to deleteLogicalPort with pod completed status, there
are several other places in the code that could have stale applications
of the old pod IP. This commit modifies those to also remove the pod IP
from their usage during pod update. Additionally, other places in the
code rely on the logical port cache to get the portInfo. However, this
may be removed after deleteLogicalPort. Therefore remove getting the
logical port where IPs can be retrieved via the kapi pod object itself.

Main paths affected:
 - Network Policy: updating port groups and addr sets
 - Namespace exgw: get pod ips from pod object, not port cache
 - Egress IP: update pod handling for completed pod
 - Hybrid Overlay: update pod handling for completed pod

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit b94351d)
The nested allocator calls were propagating an error type up during an
IP release. However in the bitmap allocator function it was never
possible to error during an IP release. Remove the return type.

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 1a430bc)
Changes-Include:
 - During pod sync we allocate IPs of all existing pods, but we should
   ignore completed pods
 - During namespace add we add all of the pod IPs to the ns address set,
   but we should ignore completed pods
 - During processing of delete event for a completed pod, we were trying
   to delete the pod again, which would try to free the IP that was
   previously released and could be in use by another pod. We should
   ignore delete events for completed resources as they would have been
   handled during update.
 - On node add, we add all existing pods on that node back to retry as
   an "add". We should skip completed pods here.
 - We now check during deletion of a completed pod (should happen on
   update only) to make sure no other running pods are using this IP as
   a failsafe to ensure we never release an IP in use by another pod or
   the related OVN config

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 7bcc8da)
Upon fetching all of the NATs on a router, if the NATs or the router
dont exist this should not be an error for deletion.

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 8ad0d79)
@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 21, 2022

@trozet: This pull request references Bugzilla bug 2091157, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.z) matches configured target release for branch (4.10.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2026461 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2026461 targets the "4.11.0" release, which is one of the valid target releases: 4.11.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

Bug 2091157: [release-4.10] Free IPs and delete resources for completed pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2022
@trozet
Copy link
Contributor Author

trozet commented Jun 21, 2022

/hold

Investigating how "allows allocation after pods are completed" passes in 4.11/upstream. In this case the IP is provided in the pod status when myPod2 is created:
https://github.com/openshift/ovn-kubernetes/blob/release-4.11/go-controller/pkg/ovn/pods_test.go#L434

This is the same IP as the already created pod. So when first pod is deleted, the IP is not freed because we detect it is in use by myPod2. However, creating myPod2 fails because it does not have any annotation yet with this IP. This seems like a bug in the test case, not sure how it passes upstream.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 21, 2022
@trozet
Copy link
Contributor Author

trozet commented Jun 21, 2022

Figured out there was a regression only in 4.11: ovn-org/ovn-kubernetes#3045

The pods were being incorrectly created with the expected IP already
present in pod status field...before it was ever even allocated by OVNK.
This would cause the tests to function incorrectly as OVNK checks this
field to know whether or not it is safe to deallocate IPs (in case any
other pod has a duplicate IP).

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 2a17c21)
(cherry picked from commit 1b4e65f)
@trozet
Copy link
Contributor Author

trozet commented Jun 23, 2022

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2022
@trozet
Copy link
Contributor Author

trozet commented Jun 23, 2022

/assign @tssurya

@tssurya
Copy link
Contributor

tssurya commented Jun 24, 2022

/retest-required

Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
First 4 commits are clean, 5th/6th/7th have minor conflicts, but they look good to me. CI passes as well (I hope that duplicate IP test case is also run on 4.10, if not maybe we should)

@@ -1240,8 +1246,18 @@ func (oc *Controller) WatchNodes() {
if err != nil {
klog.Errorf("Unable to list existing pods on node: %s, existing pods on this node may not function")
} else {
oc.addRetryPods(pods.Items)
oc.requestRetryPods()
filteredPods := make([]kapi.Pod, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -424,7 +425,7 @@ var _ = ginkgo.Describe("OVN Pod Operations", func() {
)

myPod2, err := fakeOvn.fakeClient.KubeClient.CoreV1().Pods(t.namespace).Create(context.TODO(),
newPod(t2.namespace, t2.podName, t2.nodeName, t2.podIP), metav1.CreateOptions{})
newPod(t2.namespace, t2.podName, t2.nodeName, ""), metav1.CreateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

myPod2, err := fakeOvn.fakeClient.KubeClient.CoreV1().Pods(t.namespace).Create(context.TODO(),
newPod(t2.namespace, t2.podName, t2.nodeName, t2.podIP), metav1.CreateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -527,7 +527,7 @@ var _ = ginkgo.Describe("OVN Pod Operations", func() {
)

myPod2, err := fakeOvn.fakeClient.KubeClient.CoreV1().Pods(t.namespace).Create(context.TODO(),
newPod(t2.namespace, t2.podName, t2.nodeName, t2.podIP), metav1.CreateOptions{})
newPod(t2.namespace, t2.podName, t2.nodeName, ""), metav1.CreateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you fixed it in this commit! all good.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: trozet, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2022

@trozet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn ac81e65 link false /test e2e-vsphere-ovn
ci/prow/e2e-azure-ovn ac81e65 link false /test e2e-azure-ovn
ci/prow/e2e-ovn-hybrid-step-registry ac81e65 link false /test e2e-ovn-hybrid-step-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@trozet
Copy link
Contributor Author

trozet commented Jun 27, 2022

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Jun 27, 2022
@trozet
Copy link
Contributor Author

trozet commented Jun 27, 2022

/assign @anuragthehatter

@anuragthehatter
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jun 27, 2022
@openshift-ci openshift-ci bot merged commit 8500cdb into openshift:release-4.10 Jun 27, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2022

@trozet: All pull requests linked via external trackers have merged:

Bugzilla bug 2091157 has been moved to the MODIFIED state.

In response to this:

Bug 2091157: [release-4.10] Free IPs and delete resources for completed pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants