Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

martinkennelly
Copy link
Member

@martinkennelly martinkennelly commented Feb 13, 2024

Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.

See the issue below for full details.

Fixes: #4127

@@ -761,7 +761,7 @@ spec:
if utilnet.IsIPv6String(egress2Node.nodeIP) {
otherDstIP = "fc00:f853:ccd:e793:ffff::1"
} else {
otherDstIP = "172.18.1.1"
otherDstIP = "172.18.1.99"
Copy link
Member

@tssurya tssurya Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, which other test is using this?
btw I am using 172.18.1.3 or something for another EFW test. => but my after each removes this secondary IP and I cleanup, so other tests could potentially reuse this
the key is to remove and cleanup in the aftereach of every test so that this cross over doesn't happen.
are we giving the same IP out to two nodes in the same test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should validate the egress SVC SNAT functionality against host-networked pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we giving the same IP out to two nodes in the same test?

No. 2 tests I can see are using the same IP but not at the same time. Tests are cleaning up the IP after themselves - removing that IP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm so I am confused, if at a given time only 1test is using the given IP and after that it is getting cleanedup before it is applied to the next node, is it the fact that the mac binding entry's timeout has not reached so it still has the mac of the oldnode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack I am lgtm,
can you please update the PR description or commit description by elaboraing more on:

Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.

so that anyone passing by this PR can get the whole picture/
once you do that I will merge this, sorry for being a pusher here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note: takes 5mins for the entry to expire

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinkennelly also said this is just short-term fix to get around the CI flake. We should have a better plan to not allocate sameIPs which have a MAC binding already? Even that I guess is a hack, Ideally once the IP get's deleted we should not be having issues for like 5mins from being able to reuse this :/
We should track this work somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tssurya tssurya self-assigned this Feb 14, 2024
tssurya
tssurya previously approved these changes Feb 14, 2024
Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinkennelly after CI runs on this one I will merge this
thanks for fixing this flake! nice catch!
can't wait to see your comment detailing how you figured out the stale mac entry :)

Two e2es assign the same node IP to different k8 nodes during their test.
The two tests do not run concurrently normally.

When a test pod attempted to connect to this newly added node IP,
a mac binding entry is created and persists for 300s.

If that node IP moves to a different node, and therefore has a
different mac address and if the test pod attempts to connect
to this IP again, ovn will use a stale mac entry.

Follow-up work is needed within e2e tests to create a global
e2e non-repeating ip allocator.

Fixes: ovn-org#4127

Signed-off-by: Martin Kennelly <mkennell@redhat.com>
@martinkennelly
Copy link
Member Author

Forced pushed to update commit message and add todo.

@coveralls
Copy link

coveralls commented Feb 14, 2024

Coverage Status

coverage: 52.0% (-0.004%) from 52.004%
when pulling 78492ba on martinkennelly:prevent-ip-clash
into 1aa87b4 on ovn-org:master.

@martinkennelly
Copy link
Member Author

can't wait to see your comment detailing how you figured out the stale mac entry :)

The power of ovn trace: #4127 (comment)

Copy link
Member

@tssurya tssurya Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@almusil since you had some ongoing threads to make the mac binding aging better and staleness issues better on OVN side, tagging you here
#3678 (comment) is another discussion we have had in the past
Just outside of the test framework here, (even if we say don't reuse IPs "often") in real world too if we run into this issue seems a bit icky to say, wait for 5mins before reusing the IP?

@martinkennelly
Copy link
Member Author

Exploring if we can enable garp by default within Linux kernel to permanently fix this.

@tssurya
Copy link
Member

tssurya commented Feb 14, 2024

Summarizing 3 Failures:
  [FAIL] e2e egress IP validation [OVN network] Using different methods to disable a node's availability for egress Should validate the egress IP functionality against remote hosts [It] disabling egress nodes impeding GRCP health check
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] [OVN network] Should validate the egress IP SNAT functionality against host-networked pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] Should validate the egress IP SNAT functionality for stateful-sets
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419

OUCH @martinkennelly
https://github.com/ovn-org/ovn-kubernetes/actions/runs/7899523854/job/21560119103?pr=4149

@tssurya
Copy link
Member

tssurya commented Feb 14, 2024

2024-02-14T11:03:46.9927469Z   Feb 14 11:03:46.992: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:39913 --kubeconfig=/home/runner/ovn.conf --namespace=default get eip -o json'
2024-02-14T11:03:47.0455302Z   Feb 14 11:03:47.045: INFO: stderr: ""
2024-02-14T11:03:47.0460987Z   Feb 14 11:03:47.045: INFO: stdout: "{\n    \"apiVersion\": \"v1\",\n    \"items\": [\n        {\n            \"apiVersion\": \"k8s.ovn.org/v1\",\n            \"kind\": \"EgressIP\",\n            \"metadata\": {\n                \"creationTimestamp\": \"2024-02-14T11:01:46Z\",\n                \"generation\": 1,\n                \"name\": \"egressip\",\n                \"resourceVersion\": \"5420\",\n                \"uid\": \"9e6e6aaf-a2f9-4b8c-bb21-dd31c0a00c72\"\n            },\n            \"spec\": {\n                \"egressIPs\": [\n                    \"172.18.1.2\"\n                ],\n                \"namespaceSelector\": {\n                    \"matchLabels\": {\n                        \"name\": \"egressip-9873\"\n                    }\n                },\n                \"podSelector\": {\n                    \"matchLabels\": {\n                        \"wants\": \"egress\"\n                    }\n                }\n            }\n        }\n    ],\n    \"kind\": \"List\",\n    \"metadata\": {\n        \"resourceVersion\": \"\"\n    }\n}\n"
2024-02-14T11:03:47.0466098Z   Feb 14 11:03:47.045: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:39913 --kubeconfig=/home/runner/ovn.conf --namespace=default get eip -o json'
2024-02-14T11:03:47.0987380Z   Feb 14 11:03:47.098: INFO: stderr: ""
2024-02-14T11:03:47.0994144Z   Feb 14 11:03:47.098: INFO: stdout: "{\n    \"apiVersion\": \"v1\",\n    \"items\": [\n        {\n            \"apiVersion\": \"k8s.ovn.org/v1\",\n            \"kind\": \"EgressIP\",\n            \"metadata\": {\n                \"creationTimestamp\": \"2024-02-14T11:01:46Z\",\n                \"generation\": 1,\n                \"name\": \"egressip\",\n                \"resourceVersion\": \"5420\",\n                \"uid\": \"9e6e6aaf-a2f9-4b8c-bb21-dd31c0a00c72\"\n            },\n            \"spec\": {\n                \"egressIPs\": [\n                    \"172.18.1.2\"\n                ],\n                \"namespaceSelector\": {\n                    \"matchLabels\": {\n                        \"name\": \"egressip-9873\"\n                    }\n                },\n                \"podSelector\": {\n                    \"matchLabels\": {\n                        \"wants\": \"egress\"\n                    }\n                }\n            }\n        }\n    ],\n    \"kind\": \"List\",\n    \"metadata\": {\n        \"resourceVersion\": \"\"\n    }\n}\n"
2024-02-14T11:03:47.0998767Z   �[38;5;9m[FAILED]�[0m in [It] - /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419 �[38;5;243m@ 02/14/24 11:03:47.098�[0m

patch failed?

@tssurya
Copy link
Member

tssurya commented Feb 14, 2024

is 1.2 also facing same issue? between:

  [FAIL] e2e egress IP validation [It] [OVN network] Should validate the egress IP SNAT functionality against host-networked pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419
  [FAIL] e2e egress IP validation [It] Should validate the egress IP SNAT functionality for stateful-sets
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:419

?

@tssurya
Copy link
Member

tssurya commented Feb 14, 2024

I tried to grep for 172.18.1.99 in the local gateway lane and came up with no match in the logs...I see it in the shared gateway lane though where everything passed.

@martinkennelly
Copy link
Member Author

Looks like a different issue, in test Using different methods to disable a node's availability for egress Should validate the egress IP functionality against remote hosts disabling egress nodes impeding GRCP health check, I see CM isnt rolling out and probably is dead for subsequent tests:

2024-02-14T10:55:18.1451269Z   Feb 14 10:55:18.144: INFO: Waiting for deployment/ovnkube-control-plane rollout to finish: 0 of 1 updated pods are available (600 seconds elapsed)

I downloaded the kind logs and couldnt see logs at this timeframe unfortunately.
But I looked at the kublet logs and see this for that pod:

Feb 14 10:45:19 ovn-control-plane kubelet[668]: I0214 10:45:19.411800     668 kuberuntime_manager.go:1004] "computePodActions got for pod" podActions="<internal error: json: unsupport       ed type: map[container.ContainerID]kuberuntime.containerToKillInfo>" pod="ovn-kubernetes/ovnkube-control-plane-5d9c49bc84-28fhr"  

I see this repeating..

@tssurya
Copy link
Member

tssurya commented Feb 14, 2024

ack anyways unrelated to this change, if we see this happening again we can open a new flake.

@tssurya tssurya merged commit 0a0dd0b into ovn-org:master Feb 14, 2024
29 of 30 checks passed
@martinkennelly
Copy link
Member Author

For posterity, I see we set an env var within the CM container and then it restarts. The env var looks fine. Strange.

@tssurya tssurya added this to the v1.0.0 milestone Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
3 participants