New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ovn-ic: node update missing static routes #3724
Conversation
2d04856
to
2939719
Compare
Verified fix by starting cluster in kind using Also by ensuring that ipv6 static routes on all nodes have the appropriate routes:
|
84ad24f
to
8df1540
Compare
oh i had fixed this just a couple of weeks ago and its starting to flake again:
I will investigate why: link to failure for safe-keeping: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5408009751/jobs/9826941679?pr=3724 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this.. could you add a test if its not too complicated? we don't want to regress on this again.. to me it seems like we might not have dualstack UTs that cover the routes being created in the update from 0 to 1 subnets... I think @dcbw had written some UTs on similar-ish bug before so you could use that as inspiration.
LGTM |
8df1540
to
74d8af7
Compare
@tssurya : I have added tests in |
+1 for the new test cases and fixing the typo in the test cases |
return false | ||
} | ||
return true | ||
}, 10).Should(gomega.BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow it takes 10seconds? i'd expect millisecond order but this is fine i guess..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! It is a "copy and paste" value from line 1575 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not a fan of such long timeouts but its fine, won't hold the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm on the main logic, few questions in the test portion
THANK YOU for adding the tests :)
} | ||
clusterRouter, err := libovsdbops.GetLogicalRouter(libovsdbOvnNBClient, &r) | ||
if err != nil { | ||
return fmt.Errorf("could not find cluster router %s in the nb db for default network : err - %w", r.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's weird.. why not use
gomega.Expect(err).NotTo(gomega.HaveOccurred())
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed! I copied that logic from the non-test code and failed to adapt this error checking. good catch!
} | ||
sr, err := libovsdbops.FindLogicalRouterStaticRoutesWithPredicate(libovsdbOvnNBClient, newPredicate) | ||
gomega.Expect(err).NotTo(gomega.HaveOccurred()) | ||
gomega.Expect(sr).Should(gomega.HaveLen(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why only 1? why don't we find the v4 route as well since its a dualstack test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is bc the loop walks each one of the static route from the logical router and looks it up by uuid. It should be getting 1 at each iteration.
When node from a remote zone is updated, we only perform the actual update when necessary. This commit improved the logic for doing the remote update in cases where the subnets of the remote node change. That is particularly needed when node changes from ipv4 to dual stack (ipv4 + ipv6) Reported-at: https://issues.redhat.com/browse/SDN-3993 Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
74d8af7
to
980abd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misc fixes. TY @tssurya !
return false | ||
} | ||
return true | ||
}, 10).Should(gomega.BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! It is a "copy and paste" value from line 1575 above.
} | ||
clusterRouter, err := libovsdbops.GetLogicalRouter(libovsdbOvnNBClient, &r) | ||
if err != nil { | ||
return fmt.Errorf("could not find cluster router %s in the nb db for default network : err - %w", r.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed! I copied that logic from the non-test code and failed to adapt this error checking. good catch!
} | ||
sr, err := libovsdbops.FindLogicalRouterStaticRoutesWithPredicate(libovsdbOvnNBClient, newPredicate) | ||
gomega.Expect(err).NotTo(gomega.HaveOccurred()) | ||
gomega.Expect(sr).Should(gomega.HaveLen(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is bc the loop walks each one of the static route from the logical router and looks it up by uuid. It should be getting 1 at each iteration.
2 known UT failures in https://github.com/ovn-org/ovn-kubernetes/actions/runs/5433226854/jobs/9880779775?pr=3724 :
will retest... |
¯_(ツ)_/¯ |
this is getting a bit annoying now.. :/ its happening on all PRs unfortunately:
|
again unrelated failures:
See #3739 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
thanks Flavio
cc @jcaamano for approval review
return false | ||
} | ||
return true | ||
}, 10).Should(gomega.BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not a fan of such long timeouts but its fine, won't hold the PR
LGTM |
syncZoneIC = syncZoneIC || h.oc.isLocalZoneNode(oldNode) | ||
// Check if the node moved from local zone to remote zone and if so syncZoneIC should be set to true. | ||
// Also check if node subnet changed, so static routes are properly set | ||
syncZoneIC = syncZoneIC || h.oc.isLocalZoneNode(oldNode) || nodeSubnetChanged(oldNode, newNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: this is great, but not enough! It does not take into consideration cases where the node annotation set via zone_cluster_controller see node.Annotations[ovnTransitSwitchPortAddr]
changed.
See: #3770
When node from a remote zone is updated, we only perform the actual update when necessary. This commit improved the logic for doing the remote update in cases where the subnets of the remote node change. That is particularly needed when node changes from ipv4 to dual stack (ipv4 + ipv6)
Reported-at: https://issues.redhat.com/browse/SDN-3993