-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESVC/EIP E2Es: stop sharing node IP address across multiple nodes #4149
Conversation
@@ -761,7 +761,7 @@ spec: | |||
if utilnet.IsIPv6String(egress2Node.nodeIP) { | |||
otherDstIP = "fc00:f853:ccd:e793:ffff::1" | |||
} else { | |||
otherDstIP = "172.18.1.1" | |||
otherDstIP = "172.18.1.99" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, which other test is using this?
btw I am using 172.18.1.3 or something for another EFW test. => but my after each removes this secondary IP and I cleanup, so other tests could potentially reuse this
the key is to remove and cleanup in the aftereach of every test so that this cross over doesn't happen.
are we giving the same IP out to two nodes in the same test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should validate the egress SVC SNAT functionality against host-networked pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we giving the same IP out to two nodes in the same test?
No. 2 tests I can see are using the same IP but not at the same time. Tests are cleaning up the IP after themselves - removing that IP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm so I am confused, if at a given time only 1test is using the given IP and after that it is getting cleanedup before it is applied to the next node, is it the fact that the mac binding entry's timeout has not reached so it still has the mac of the oldnode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack I am lgtm,
can you please update the PR description or commit description by elaboraing more on:
Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.
so that anyone passing by this PR can get the whole picture/
once you do that I will merge this, sorry for being a pusher here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-note: takes 5mins for the entry to expire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinkennelly also said this is just short-term fix to get around the CI flake. We should have a better plan to not allocate sameIPs which have a MAC binding already? Even that I guess is a hack, Ideally once the IP get's deleted we should not be having issues for like 5mins from being able to reuse this :/
We should track this work somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinkennelly after CI runs on this one I will merge this
thanks for fixing this flake! nice catch!
can't wait to see your comment detailing how you figured out the stale mac entry :)
Two e2es assign the same node IP to different k8 nodes during their test. The two tests do not run concurrently normally. When a test pod attempted to connect to this newly added node IP, a mac binding entry is created and persists for 300s. If that node IP moves to a different node, and therefore has a different mac address and if the test pod attempts to connect to this IP again, ovn will use a stale mac entry. Follow-up work is needed within e2e tests to create a global e2e non-repeating ip allocator. Fixes: ovn-org#4127 Signed-off-by: Martin Kennelly <mkennell@redhat.com>
c5ace0f
to
78492ba
Compare
Forced pushed to update commit message and add todo. |
The power of ovn trace: #4127 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@almusil since you had some ongoing threads to make the mac binding aging better and staleness issues better on OVN side, tagging you here
#3678 (comment) is another discussion we have had in the past
Just outside of the test framework here, (even if we say don't reuse IPs "often") in real world too if we run into this issue seems a bit icky to say, wait for 5mins before reusing the IP?
Exploring if we can enable garp by default within Linux kernel to permanently fix this. |
OUCH @martinkennelly |
patch failed? |
is 1.2 also facing same issue? between:
? |
I tried to grep for 172.18.1.99 in the local gateway lane and came up with no match in the logs...I see it in the shared gateway lane though where everything passed. |
Looks like a different issue, in test
I downloaded the kind logs and couldnt see logs at this timeframe unfortunately.
I see this repeating.. |
ack anyways unrelated to this change, if we see this happening again we can open a new flake. |
For posterity, I see we set an env var within the CM container and then it restarts. The env var looks fine. Strange. |
Two e2es shared the node IP 172.18.1.1 and assigned them to different nodes. This caused some e2es to timeout due to stale mac binding entry.
See the issue below for full details.
Fixes: #4127