ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

martinkennelly · 2024-02-13T15:10:11Z

Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.

See the issue below for full details.

Fixes: #4127

tssurya · 2024-02-13T15:11:51Z

test/e2e/egressip.go

@@ -761,7 +761,7 @@ spec:
 		if utilnet.IsIPv6String(egress2Node.nodeIP) {
 			otherDstIP = "fc00:f853:ccd:e793:ffff::1"
 		} else {
-			otherDstIP = "172.18.1.1"
+			otherDstIP = "172.18.1.99"


haha, which other test is using this?
btw I am using 172.18.1.3 or something for another EFW test. => but my after each removes this secondary IP and I cleanup, so other tests could potentially reuse this
the key is to remove and cleanup in the aftereach of every test so that this cross over doesn't happen.
are we giving the same IP out to two nodes in the same test?

Should validate the egress SVC SNAT functionality against host-networked pods

are we giving the same IP out to two nodes in the same test?

No. 2 tests I can see are using the same IP but not at the same time. Tests are cleaning up the IP after themselves - removing that IP.

hmm so I am confused, if at a given time only 1test is using the given IP and after that it is getting cleanedup before it is applied to the next node, is it the fact that the mac binding entry's timeout has not reached so it still has the mac of the oldnode?

ack I am lgtm,
can you please update the PR description or commit description by elaboraing more on:

Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.

so that anyone passing by this PR can get the whole picture/
once you do that I will merge this, sorry for being a pusher here

self-note: takes 5mins for the entry to expire

@martinkennelly also said this is just short-term fix to get around the CI flake. We should have a better plan to not allocate sameIPs which have a MAC binding already? Even that I guess is a hack, Ideally once the IP get's deleted we should not be having issues for like 5mins from being able to reuse this :/
We should track this work somewhere

tssurya

@martinkennelly after CI runs on this one I will merge this
thanks for fixing this flake! nice catch!
can't wait to see your comment detailing how you figured out the stale mac entry :)

Two e2es assign the same node IP to different k8 nodes during their test. The two tests do not run concurrently normally. When a test pod attempted to connect to this newly added node IP, a mac binding entry is created and persists for 300s. If that node IP moves to a different node, and therefore has a different mac address and if the test pod attempts to connect to this IP again, ovn will use a stale mac entry. Follow-up work is needed within e2e tests to create a global e2e non-repeating ip allocator. Fixes: ovn-org#4127 Signed-off-by: Martin Kennelly <mkennell@redhat.com>

martinkennelly · 2024-02-14T10:06:17Z

Forced pushed to update commit message and add todo.

coveralls · 2024-02-14T10:06:33Z

coverage: 52.0% (-0.004%) from 52.004%
when pulling 78492ba on martinkennelly:prevent-ip-clash
into 1aa87b4 on ovn-org:master.

martinkennelly · 2024-02-14T10:07:02Z

can't wait to see your comment detailing how you figured out the stale mac entry :)

The power of ovn trace: #4127 (comment)

tssurya · 2024-02-14T11:01:58Z

test/e2e/egress_services.go

@almusil since you had some ongoing threads to make the mac binding aging better and staleness issues better on OVN side, tagging you here
#3678 (comment) is another discussion we have had in the past
Just outside of the test framework here, (even if we say don't reuse IPs "often") in real world too if we run into this issue seems a bit icky to say, wait for 5mins before reusing the IP?

martinkennelly · 2024-02-14T12:08:15Z

Exploring if we can enable garp by default within Linux kernel to permanently fix this.

tssurya · 2024-02-14T12:32:16Z

Summarizing 3 Failures:
  [FAIL] e2e egress IP validation [OVN network] Using different methods to disable a node's availability for egress Should validate the egress IP functionality against remote hosts [It] disabling egress nodes impeding GRCP health check
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] [OVN network] Should validate the egress IP SNAT functionality against host-networked pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] Should validate the egress IP SNAT functionality for stateful-sets
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419

OUCH @martinkennelly
https://github.com/ovn-org/ovn-kubernetes/actions/runs/7899523854/job/21560119103?pr=4149

tssurya · 2024-02-14T12:33:47Z

2024-02-14T11:03:46.9927469Z   Feb 14 11:03:46.992: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:39913 --kubeconfig=/home/runner/ovn.conf --namespace=default get eip -o json'
2024-02-14T11:03:47.0455302Z   Feb 14 11:03:47.045: INFO: stderr: ""
2024-02-14T11:03:47.0460987Z   Feb 14 11:03:47.045: INFO: stdout: "{\n    \"apiVersion\": \"v1\",\n    \"items\": [\n        {\n            \"apiVersion\": \"k8s.ovn.org/v1\",\n            \"kind\": \"EgressIP\",\n            \"metadata\": {\n                \"creationTimestamp\": \"2024-02-14T11:01:46Z\",\n                \"generation\": 1,\n                \"name\": \"egressip\",\n                \"resourceVersion\": \"5420\",\n                \"uid\": \"9e6e6aaf-a2f9-4b8c-bb21-dd31c0a00c72\"\n            },\n            \"spec\": {\n                \"egressIPs\": [\n                    \"172.18.1.2\"\n                ],\n                \"namespaceSelector\": {\n                    \"matchLabels\": {\n                        \"name\": \"egressip-9873\"\n                    }\n                },\n                \"podSelector\": {\n                    \"matchLabels\": {\n                        \"wants\": \"egress\"\n                    }\n                }\n            }\n        }\n    ],\n    \"kind\": \"List\",\n    \"metadata\": {\n        \"resourceVersion\": \"\"\n    }\n}\n"
2024-02-14T11:03:47.0466098Z   Feb 14 11:03:47.045: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:39913 --kubeconfig=/home/runner/ovn.conf --namespace=default get eip -o json'
2024-02-14T11:03:47.0987380Z   Feb 14 11:03:47.098: INFO: stderr: ""
2024-02-14T11:03:47.0994144Z   Feb 14 11:03:47.098: INFO: stdout: "{\n    \"apiVersion\": \"v1\",\n    \"items\": [\n        {\n            \"apiVersion\": \"k8s.ovn.org/v1\",\n            \"kind\": \"EgressIP\",\n            \"metadata\": {\n                \"creationTimestamp\": \"2024-02-14T11:01:46Z\",\n                \"generation\": 1,\n                \"name\": \"egressip\",\n                \"resourceVersion\": \"5420\",\n                \"uid\": \"9e6e6aaf-a2f9-4b8c-bb21-dd31c0a00c72\"\n            },\n            \"spec\": {\n                \"egressIPs\": [\n                    \"172.18.1.2\"\n                ],\n                \"namespaceSelector\": {\n                    \"matchLabels\": {\n                        \"name\": \"egressip-9873\"\n                    }\n                },\n                \"podSelector\": {\n                    \"matchLabels\": {\n                        \"wants\": \"egress\"\n                    }\n                }\n            }\n        }\n    ],\n    \"kind\": \"List\",\n    \"metadata\": {\n        \"resourceVersion\": \"\"\n    }\n}\n"
2024-02-14T11:03:47.0998767Z   �[38;5;9m[FAILED]�[0m in [It] - /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419 �[38;5;243m@ 02/14/24 11:03:47.098�[0m

patch failed?

tssurya · 2024-02-14T12:34:53Z

is 1.2 also facing same issue? between:

  [FAIL] e2e egress IP validation [It] [OVN network] Should validate the egress IP SNAT functionality against host-networked pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] Should validate the egress IP SNAT functionality for stateful-sets
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419

?

tssurya · 2024-02-14T12:37:00Z

I tried to grep for 172.18.1.99 in the local gateway lane and came up with no match in the logs...I see it in the shared gateway lane though where everything passed.

martinkennelly · 2024-02-14T13:43:07Z

Looks like a different issue, in test Using different methods to disable a node's availability for egress Should validate the egress IP functionality against remote hosts disabling egress nodes impeding GRCP health check, I see CM isnt rolling out and probably is dead for subsequent tests:

2024-02-14T10:55:18.1451269Z   Feb 14 10:55:18.144: INFO: Waiting for deployment/ovnkube-control-plane rollout to finish: 0 of 1 updated pods are available (600 seconds elapsed)

I downloaded the kind logs and couldnt see logs at this timeframe unfortunately.
But I looked at the kublet logs and see this for that pod:

Feb 14 10:45:19 ovn-control-plane kubelet[668]: I0214 10:45:19.411800     668 kuberuntime_manager.go:1004] "computePodActions got for pod" podActions="<internal error: json: unsupport       ed type: map[container.ContainerID]kuberuntime.containerToKillInfo>" pod="ovn-kubernetes/ovnkube-control-plane-5d9c49bc84-28fhr"

I see this repeating..

tssurya · 2024-02-14T13:44:39Z

ack anyways unrelated to this change, if we see this happening again we can open a new flake.

martinkennelly · 2024-02-14T13:52:36Z

For posterity, I see we set an env var within the CM container and then it restarts. The env var looks fine. Strange.

martinkennelly requested review from trozet, dcbw, girishmg, jcaamano and tssurya as code owners February 13, 2024 15:10

tssurya reviewed Feb 13, 2024

View reviewed changes

tssurya self-assigned this Feb 14, 2024

tssurya previously approved these changes Feb 14, 2024

View reviewed changes

martinkennelly dismissed tssurya’s stale review via 78492ba February 14, 2024 10:06

martinkennelly force-pushed the prevent-ip-clash branch from c5ace0f to 78492ba Compare February 14, 2024 10:06

tssurya reviewed Feb 14, 2024

View reviewed changes

tssurya approved these changes Feb 14, 2024

View reviewed changes

tssurya merged commit 0a0dd0b into ovn-org:master Feb 14, 2024
29 of 30 checks passed

tssurya added the area/e2e-testing label Mar 12, 2024

tssurya added this to the v1.0.0 milestone Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

martinkennelly commented Feb 13, 2024 •

edited

tssurya Feb 13, 2024 •

edited

martinkennelly Feb 13, 2024

martinkennelly Feb 13, 2024

tssurya Feb 14, 2024

martinkennelly Feb 14, 2024

tssurya Feb 14, 2024

tssurya Feb 14, 2024

tssurya Feb 14, 2024

martinkennelly Feb 14, 2024

tssurya left a comment

martinkennelly commented Feb 14, 2024

coveralls commented Feb 14, 2024 •

edited

martinkennelly commented Feb 14, 2024

tssurya Feb 14, 2024 •

edited

martinkennelly commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024 •

edited

martinkennelly commented Feb 14, 2024

tssurya commented Feb 14, 2024

martinkennelly commented Feb 14, 2024

ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

Conversation

martinkennelly commented Feb 13, 2024 • edited

tssurya Feb 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tssurya left a comment

Choose a reason for hiding this comment

martinkennelly commented Feb 14, 2024

coveralls commented Feb 14, 2024 • edited

martinkennelly commented Feb 14, 2024

tssurya Feb 14, 2024 • edited

Choose a reason for hiding this comment

martinkennelly commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024

tssurya commented Feb 14, 2024 • edited

martinkennelly commented Feb 14, 2024

tssurya commented Feb 14, 2024

martinkennelly commented Feb 14, 2024

martinkennelly commented Feb 13, 2024 •

edited

tssurya Feb 13, 2024 •

edited

coveralls commented Feb 14, 2024 •

edited

tssurya Feb 14, 2024 •

edited

tssurya commented Feb 14, 2024 •

edited