Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ovn-ic: node update missing static routes #3724

Merged
merged 1 commit into from Jul 6, 2023

Conversation

flavio-fernandes
Copy link
Contributor

When node from a remote zone is updated, we only perform the actual update when necessary. This commit improved the logic for doing the remote update in cases where the subnets of the remote node change. That is particularly needed when node changes from ipv4 to dual stack (ipv4 + ipv6)

Reported-at: https://issues.redhat.com/browse/SDN-3993

@flavio-fernandes
Copy link
Contributor Author

@trozet @tssurya @numansiddique PTAL

@flavio-fernandes
Copy link
Contributor Author

flavio-fernandes commented Jun 29, 2023

Verified fix by starting cluster in kind using kind.sh -ci and then running ./contrib/kind-dual-stack-conversion.sh

Also by ensuring that ipv6 static routes on all nodes have the appropriate routes:

❯ which kn
kn='kubectl -n ovn-kubernetes'

❯ k get nodes
NAME                STATUS   ROLES           AGE   VERSION
ovn-control-plane   Ready    control-plane   11m   v1.26.0
ovn-worker          Ready    <none>          10m   v1.26.0
ovn-worker2         Ready    <none>          10m   v1.26.0

❯ kn get pods
NAME                                     READY   STATUS    RESTARTS        AGE
ovnkube-control-plane-5c67bbc48f-f4vkz   1/1     Running   1 (8m27s ago)   8m44s
ovnkube-node-22hxp                       7/7     Running   1 (7m58s ago)   8m16s
ovnkube-node-lgp9r                       7/7     Running   1               8m15s
ovnkube-node-wpbjg                       7/7     Running   1 (7m58s ago)   8m16s
ovs-node-dgcxd                           1/1     Running   0               12m
ovs-node-n2z7g                           1/1     Running   0               12m
ovs-node-sttjq                           1/1     Running   0               12m

❯ kn exec -ti ovnkube-node-wpbjg -c nb-ovsdb -- bash
[root@ovn-control-plane ~]# ovn-nbctl lr-route-list ovn_cluster_router
IPv4 Routes
Route Table <main>:
               100.64.0.2               168.254.0.2 dst-ip
               100.64.0.3               168.254.0.3 dst-ip
               100.64.0.4                100.64.0.4 dst-ip
            10.244.1.0/24               168.254.0.2 dst-ip
            10.244.2.0/24               168.254.0.3 dst-ip
            10.244.0.0/24                100.64.0.4 src-ip

IPv6 Routes
Route Table <main>:
                  fd98::2                   fd97::2 dst-ip  <==
                  fd98::3                   fd97::3 dst-ip  <==
                  fd98::4                   fd98::4 dst-ip
         fd00:10:244::/64                   fd97::2 dst-ip
       fd00:10:244:2::/64                   fd97::3 dst-ip
       fd00:10:244:1::/64                   fd98::4 src-ip
[root@ovn-control-plane ~]#
exit

@tssurya
Copy link
Member

tssurya commented Jun 29, 2023

oh i had fixed this just a couple of weeks ago and its starting to flake again:

2023-06-29T03:59:51.7506240Z Jun 29 03:59:51.750: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-06-29T03:59:52.7509490Z Jun 29 03:59:52.750: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-06-29T03:59:53.7507017Z Jun 29 03:59:53.750: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-06-29T03:59:54.7506905Z Jun 29 03:59:54.750: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-06-29T04:00:00.0460884Z �[1mSTEP�[0m: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-06-29T04:00:00.5498635Z �[1mSTEP�[0m: patching service lbservice-test to allocateLoadBalancerNodePorts=false and externalTrafficPolicy=local
2023-06-29T04:00:00.6030910Z Jun 29 04:00:00.602: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:38913 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.allocateLoadBalancerNodePorts}''
2023-06-29T04:00:00.8051635Z Jun 29 04:00:00.804: INFO: stderr: ""
2023-06-29T04:00:00.8052383Z Jun 29 04:00:00.804: INFO: stdout: "'false'"
2023-06-29T04:00:00.8907192Z Jun 29 04:00:00.890: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:38913 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.externalTrafficPolicy}''
2023-06-29T04:00:01.0847896Z Jun 29 04:00:01.084: INFO: stderr: ""
2023-06-29T04:00:06.2577322Z Jun 29 04:00:01.084: INFO: stdout: "'Local'"
2023-06-29T04:00:06.2578102Z �[1mSTEP�[0m: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-06-29T04:00:47.0109478Z Jun 29 04:00:47.004: FAIL: Couldn't fetch the correct number of iptable rules, err: timed out waiting for the condition
2023-06-29T04:00:47.0110012Z Unexpected error:
2023-06-29T04:00:47.0110412Z     <*errors.errorString | 0xc000208280>: {
2023-06-29T04:00:47.0110885Z         s: "timed out waiting for the condition",
2023-06-29T04:00:47.0111151Z     }
2023-06-29T04:00:47.0111470Z     timed out waiting for the condition
2023-06-29T04:00:47.0111751Z occurred
2023-06-29T04:00:47.0111891Z 
2023-06-29T04:00:47.0111999Z Full Stack Trace

I will investigate why: link to failure for safe-keeping: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5408009751/jobs/9826941679?pr=3724

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this.. could you add a test if its not too complicated? we don't want to regress on this again.. to me it seems like we might not have dualstack UTs that cover the routes being created in the update from 0 to 1 subnets... I think @dcbw had written some UTs on similar-ish bug before so you could use that as inspiration.

@numansiddique
Copy link
Contributor

LGTM

@flavio-fernandes
Copy link
Contributor Author

thanks for fixing this.. could you add a test if its not too complicated? we don't want to regress on this again.. to me it seems like we might not have dualstack UTs that cover the routes being created in the update from 0 to 1 subnets... I think @dcbw had written some UTs on similar-ish bug before so you could use that as inspiration.

@tssurya : I have added tests in go-controller/pkg/ovn/zone_interconnect/zone_ic_handler_test.go to exercise the adding of ipv6 static routes in the ovn_cluster_router.
I also added a test in go-controller/pkg/ovn/master_test.go to ensure that a node update in a cluster with ipv6 does indeed add the ipv6prefix to the corresponding logical switch.
PTAL

@numansiddique
Copy link
Contributor

+1 for the new test cases and fixing the typo in the test cases

return false
}
return true
}, 10).Should(gomega.BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow it takes 10seconds? i'd expect millisecond order but this is fine i guess..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! It is a "copy and paste" value from line 1575 above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not a fan of such long timeouts but its fine, won't hold the PR

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm on the main logic, few questions in the test portion
THANK YOU for adding the tests :)

}
clusterRouter, err := libovsdbops.GetLogicalRouter(libovsdbOvnNBClient, &r)
if err != nil {
return fmt.Errorf("could not find cluster router %s in the nb db for default network : err - %w", r.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's weird.. why not use
gomega.Expect(err).NotTo(gomega.HaveOccurred()) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! I copied that logic from the non-test code and failed to adapt this error checking. good catch!

}
sr, err := libovsdbops.FindLogicalRouterStaticRoutesWithPredicate(libovsdbOvnNBClient, newPredicate)
gomega.Expect(err).NotTo(gomega.HaveOccurred())
gomega.Expect(sr).Should(gomega.HaveLen(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why only 1? why don't we find the v4 route as well since its a dualstack test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is bc the loop walks each one of the static route from the logical router and looks it up by uuid. It should be getting 1 at each iteration.

When node from a remote zone is updated, we only perform the
actual update when necessary. This commit improved the logic
for doing the remote update in cases where the subnets of the
remote node change. That is particularly needed when node
changes from ipv4 to dual stack (ipv4 + ipv6)

Reported-at: https://issues.redhat.com/browse/SDN-3993
Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
Copy link
Contributor Author

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misc fixes. TY @tssurya !

go-controller/pkg/ovn/master_test.go Outdated Show resolved Hide resolved
return false
}
return true
}, 10).Should(gomega.BeTrue())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! It is a "copy and paste" value from line 1575 above.

}
clusterRouter, err := libovsdbops.GetLogicalRouter(libovsdbOvnNBClient, &r)
if err != nil {
return fmt.Errorf("could not find cluster router %s in the nb db for default network : err - %w", r.Name, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! I copied that logic from the non-test code and failed to adapt this error checking. good catch!

}
sr, err := libovsdbops.FindLogicalRouterStaticRoutesWithPredicate(libovsdbOvnNBClient, newPredicate)
gomega.Expect(err).NotTo(gomega.HaveOccurred())
gomega.Expect(sr).Should(gomega.HaveLen(1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is bc the loop walks each one of the static route from the logical router and looks it up by uuid. It should be getting 1 at each iteration.

@flavio-fernandes
Copy link
Contributor Author

flavio-fernandes commented Jul 1, 2023

2 known UT failures in https://github.com/ovn-org/ovn-kubernetes/actions/runs/5433226854/jobs/9880779775?pr=3724 :

2023-07-01T22:07:24.8914473Z �[91m�[1m[Fail] �[0m�[90mOVN for APB External Route Operations �[0m�[0mon setting namespace gateway static hop �[0m�[90mreconciles deleting a pod with namespace double exgw static gateway already set �[0m�[91m�[1m[It] No BFD �[0m
2023-07-01T22:07:24.8915588Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/external_gateway_apb_test.go:716�[0m
2023-07-01T22:07:24.8916016Z 
2023-07-01T22:07:24.8916905Z �[91m�[1m[Fail] �[0m�[90mOVN Egress Gateway Operations �[0m�[0mhybrid route policy operations in lgw mode �[0m�[91m�[1m[It] should keep the hybrid route policy after deleting the namespace gateway annotation when there is an APB External Route CR overlapping the same external gateway IP �[0m
2023-07-01T22:07:24.8917825Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressgw_test.go:2553�[0m
2023-07-01T22:07:24.8918160Z 
2023-07-01T22:07:24.8918447Z �[1m�[91mRan 314 of 314 Specs in 562.448 seconds�[0m

will retest...

@flavio-fernandes
Copy link
Contributor Author

¯_(ツ)_/¯

@tssurya
Copy link
Member

tssurya commented Jul 3, 2023

this is getting a bit annoying now.. :/ its happening on all PRs unfortunately:

2023-07-02T20:29:02.5913771Z JUnit report was created: /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/_artifacts/junit-pkg_ovn.xml
2023-07-02T20:29:02.5916756Z 
2023-07-02T20:29:02.5916792Z 
2023-07-02T20:29:02.5917456Z �[91m�[1mSummarizing 1 Failure:�[0m
2023-07-02T20:29:02.5917729Z 
2023-07-02T20:29:02.5921680Z �[91m�[1m[Fail] �[0m�[90mOVN for APB External Route Operations �[0m�[0mon using bfd �[0m�[91m�[1m[It] should disable bfd when removing the static hop from the namespace �[0m
2023-07-02T20:29:02.5922895Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/external_gateway_apb_test.go:2253�[0m
2023-07-02T20:29:02.5923459Z 
2023-07-02T20:29:02.5923961Z �[1m�[91mRan 314 of 314 Specs in 539.999 seconds�[0m
2023-07-02T20:29:02.5924513Z �[1m�[91mFAIL!�[0m -- �[32m�[1m313 Passed�[0m | �[91m�[1m1 Failed�[0m | �[33m�[1m0 Pending�[0m | �[36m�[1m0 Skipped�[0m
2023-07-02T20:29:02.5926174Z --- FAIL: TestClusterNode (540.29s)
2023-07-02T20:29:02.5928317Z FAIL
2023-07-02T20:29:02.6409697Z coverage: 68.4% of statements

@tssurya tssurya closed this Jul 3, 2023
@tssurya tssurya reopened this Jul 3, 2023
@tssurya tssurya closed this Jul 3, 2023
@tssurya tssurya reopened this Jul 3, 2023
@coveralls
Copy link

Coverage Status

coverage: 53.465% (+0.02%) from 53.444% when pulling 980abd3 on flavio-fernandes:ovnic-static-routes into 6870575 on ovn-org:master.

@tssurya
Copy link
Member

tssurya commented Jul 3, 2023

again unrelated failures:

2023-07-03T20:08:38.1031137Z �[91m�[1mSummarizing 2 Failures:�[0m
2023-07-03T20:08:38.1031380Z 
2023-07-03T20:08:38.1032992Z �[91m�[1m[Fail] �[0m�[90me2e ingress traffic validation �[0m�[0mValidating ingress traffic �[0m�[91m�[1m[It] Should be allowed by nodeport services �[0m
2023-07-03T20:08:38.1033744Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/e2e.go:1844�[0m
2023-07-03T20:08:38.1034014Z 
2023-07-03T20:08:38.1034761Z �[91m�[1m[Fail] �[0m�[90me2e ingress traffic validation �[0m�[0mValidating ingress traffic �[0m�[91m�[1m[It] Should be allowed by nodeport services �[0m
2023-07-03T20:08:38.1035420Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/e2e.go:1844�[0m
2023-07-03T20:08:38.1035693Z 
2023-07-03T20:08:38.1036032Z �[1m�[91mRan 82 of 228 Specs in 3411.446 seconds�[0m
2023-07-03T20:08:38.1036544Z �[1m�[91mFAIL!�[0m -- �[32m�[1m81 Passed�[0m | �[91m�[1m1 Failed�[0m | �[33m�[1m0 Flaked�[0m | �[33m�[1m0 Pending�[0m | �[36m�[1m146 Skipped�[0m
2023-07-03T20:08:38.1040425Z 
2023-07-03T20:08:38.1040908Z �[38;5;228mYou're using deprecated Ginkgo functionality:�[0m
2023-07-03T20:08:38.1041463Z �[38;5;228m=============================================�[0m

See #3739
Saving link for debugging: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447544367/jobs/9910105970?pr=3724

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
thanks Flavio
cc @jcaamano for approval review

return false
}
return true
}, 10).Should(gomega.BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not a fan of such long timeouts but its fine, won't hold the PR

@flavio-fernandes
Copy link
Contributor Author

/assign @jcaamano
@jcaamano PTAL

@dcbw
Copy link
Contributor

dcbw commented Jul 6, 2023

LGTM

@dcbw dcbw merged commit 137556d into ovn-org:master Jul 6, 2023
49 of 54 checks passed
syncZoneIC = syncZoneIC || h.oc.isLocalZoneNode(oldNode)
// Check if the node moved from local zone to remote zone and if so syncZoneIC should be set to true.
// Also check if node subnet changed, so static routes are properly set
syncZoneIC = syncZoneIC || h.oc.isLocalZoneNode(oldNode) || nodeSubnetChanged(oldNode, newNode)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: this is great, but not enough! It does not take into consideration cases where the node annotation set via zone_cluster_controller see node.Annotations[ovnTransitSwitchPortAddr] changed.

See: #3770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants