Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes panic with delete NAT operations #3714

Merged
merged 1 commit into from Jun 28, 2023

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Jun 28, 2023

Regression introduced by:
25d892c

Would end up appending a nil NAT and causing an NPE when trying to execute EquivalentNAT:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x178ea7b]

goroutine 140 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.isEquivalentNAT(0x0, 0xc002e6cea0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/libovsdbops/router.go:935 +0x9b
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.DeleteNATsOps({0x24aa068, 0xc0008c0fc0}, {0x0, 0x0, 0x0}, 0xc002e1a4e0, {0xc000130c48, 0x1, 0xc00106f6f8?})
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/libovsdbops/router.go:1087 +0x728

Seen downstream:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1726/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-ipv6/1673751230423240704/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-fb52f_ovnkube-master_previous.log

Regression introduced by:
25d892c

Would end up appending a nil NAT and causing an NPE when trying to
execute EquivalentNAT:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x178ea7b]

goroutine 140 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.isEquivalentNAT(0x0, 0xc002e6cea0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/libovsdbops/router.go:935 +0x9b
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.DeleteNATsOps({0x24aa068, 0xc0008c0fc0}, {0x0, 0x0, 0x0}, 0xc002e1a4e0, {0xc000130c48, 0x1, 0xc00106f6f8?})
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/libovsdbops/router.go:1087 +0x728

Signed-off-by: Tim Rozet <trozet@redhat.com>
@trozet
Copy link
Contributor Author

trozet commented Jun 28, 2023

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to think more on why this panic wasn't caught upstream in our CI -> but I won't hold the PR for that, glad we caught it before this made it to downstream.

@@ -1005,7 +1005,10 @@ func GetRouterNATs(nbClient libovsdbclient.Client, router *nbdb.LogicalRouter) (
nats := []*nbdb.NAT{}
for _, uuid := range r.Nat {
nat, err := GetNAT(nbClient, &nbdb.NAT{UUID: uuid})
if err != nil && err != libovsdbclient.ErrNotFound {
if err == libovsdbclient.ErrNotFound {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we were hitting another error that wasn't "not found" error which we weren't handling?

also was this the only spot we need to fix? I am unsure if I reviewed the initial fix but we had other places where we did similar fixes right @flavio-fernandes ? -> let's make sure we double check those..
but based on the introduced commit hash: 25d892c this LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly scanned #3647 last night and didn't see anything...but that was like after midnight soooo :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tssurya @trozet I went over #3647 as well and dis not find any other places with the same misteka. ;)

@trozet
Copy link
Contributor Author

trozet commented Jun 28, 2023

hmm egress ip flake: ```
2023-06-28T07:05:30.8607565Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m
2023-06-28T07:05:30.8608235Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m
2023-06-28T07:05:30.8608488Z
2023-06-28T07:05:30.8608984Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m
2023-06-28T07:05:30.8609883Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5397299956/jobs/9802740204?pr=3714

@jcaamano
Copy link
Contributor

hmm egress ip flake: ``` 2023-06-28T07:05:30.8607565Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m 2023-06-28T07:05:30.8608235Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m 2023-06-28T07:05:30.8608488Z 2023-06-28T07:05:30.8608984Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m 2023-06-28T07:05:30.8609883Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5397299956/jobs/9802740204?pr=3714

Looks like an IP was not correctly removed from the external gateway address set, and then a pod for the egressip test was re-allocated that IP and did not mach the node no-reroute policy (prio 102) because it matched the external gateway policy first (prio 501)

❯ ovn-nbctl --db unix:/tmp/ovs-offline/var-run/ovn_nb/ovnnb_db.sock lr-policy-list ovn_cluster_router 
Routing Policies
      1004 inport == "rtos-ovn-worker" && ip4.dst == 172.18.0.4 /* ovn-worker */         reroute                10.244.1.2
       501 inport == "rtos-ovn-worker" && ip4.src == $a6142007691221330953 && ip4.dst != 10.244.0.0/16         reroute                100.64.0.3
       102 (ip4.src == $a4548040316634674295 || ip4.src == $a13607449821398607916) && ip4.dst == $a14918748166599097711           allow               pkt_mark=1008
       102 ip4.src == 10.244.0.0/16 && ip4.dst == 10.244.0.0/16           allow
       102 ip4.src == 10.244.0.0/16 && ip4.dst == 100.64.0.0/16           allow
       100                              ip4.src == 10.244.1.7         reroute                100.64.0.3
❯ ovn-nbctl --db unix:/tmp/ovs-offline/var-run/ovn_nb/ovnnb_db.sock list Address_Set a6142007691221330953
_uuid               : 160b5649-a76f-4bcf-a199-39fe5cf7bfcb
addresses           : ["10.244.1.7"]
external_ids        : {ip-family=v4, "k8s.ovn.org/id"="default-network-controller:HybridNodeRoute:ovn-worker:v4", "k8s.ovn.org/name"=ovn-worker, "k8s.ovn.org/owner-controller"=default-network-controller, "k8s.ovn.org/owner-type"=HybridNodeRoute}
name                : a6142007691221330953
❯ ovn-nbctl --db unix:/tmp/ovs-offline/var-run/ovn_nb/ovnnb_db.sock list Address_Set a4548040316634674295
_uuid               : 63de535f-b6a6-4c6b-9932-ae8eaa2c6864
addresses           : ["10.244.1.7"]
external_ids        : {ip-family=v4, "k8s.ovn.org/id"="default-network-controller:EgressIP:egressip-served-pods:v4", "k8s.ovn.org/name"=egressip-served-pods, "k8s.ovn.org/owner-controller"=default-network-controller, "k8s.ovn.org/owner-type"=EgressIP}
name                : a4548040316634674295

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -1005,7 +1005,10 @@ func GetRouterNATs(nbClient libovsdbclient.Client, router *nbdb.LogicalRouter) (
nats := []*nbdb.NAT{}
for _, uuid := range r.Nat {
nat, err := GetNAT(nbClient, &nbdb.NAT{UUID: uuid})
if err != nil && err != libovsdbclient.ErrNotFound {
if err == libovsdbclient.ErrNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I would use errors.Is(err, libovsdbclient.ErrNotFound) in case we wrap it in the future

if errors.Is(err, libovsdbclient.ErrNotFound) {

@@ -1005,7 +1005,10 @@ func GetRouterNATs(nbClient libovsdbclient.Client, router *nbdb.LogicalRouter) (
nats := []*nbdb.NAT{}
for _, uuid := range r.Nat {
nat, err := GetNAT(nbClient, &nbdb.NAT{UUID: uuid})
if err != nil && err != libovsdbclient.ErrNotFound {
if err == libovsdbclient.ErrNotFound {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh! 🤦 I can't believe I missed that in previous fix.

@jcaamano
Copy link
Contributor

hmm egress ip flake: ``` 2023-06-28T07:05:30.8607565Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m 2023-06-28T07:05:30.8608235Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m 2023-06-28T07:05:30.8608488Z 2023-06-28T07:05:30.8608984Z �[91m�[1m[Fail] �[0m�[90me2e egress IP validation �[0m�[91m�[1m[It] Should validate the egress IP SNAT functionality against host-networked pods �[0m 2023-06-28T07:05:30.8609883Z �[37m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egressip.go:751�[0m

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5397299956/jobs/9802740204?pr=3714

The second attempt fails because the host network pod is not able to open the port

2023-06-28T06:51:11.9926850Z Jun 28 06:51:11.992: INFO: pod egressip-9277/ovn-worker2-host-net-pod logs:
2023-06-28T06:51:11.9928337Z AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using fc00:f853:ccd:e793::3. Set the 'ServerName' directive globally to suppress this message
2023-06-28T06:51:11.9929089Z (98)Address already in use: AH00072: make_sock: could not bind to address [::]:80
2023-06-28T06:51:11.9930077Z (98)Address already in use: AH00072: make_sock: could not bind to address 0.0.0.0:80
2023-06-28T06:51:11.9930498Z no listening sockets available, shutting down
2023-06-28T06:51:11.9930826Z AH00015: Unable to open logs
2023-06-28T06:51:11.9930985Z 

@flavio-fernandes
Copy link
Contributor

This also seems wrong: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5397299956/jobs/9802739319?pr=3714

/me taking a closer look at it right now.

e2e-dual-conversion (noHA, interconnect-single-node-zones, 3, 1)

+ kubectl -n ovn-kubernetes delete pod -l name=ovnkube-db
No resources found
+ kubectl -n ovn-kubernetes delete pod -l name=ovnkube-zone-controller
No resources found    <==== ?!?!
+ kubectl -n ovn-kubernetes delete pod -l name=ovnkube-master
No resources found
+ kubectl -n ovn-kubernetes delete pod -l name=ovnkube-control-plane
pod "ovnkube-control-plane-7b8fd669f4-m94gz" deleted
+ kubectl -n ovn-kubernetes rollout restart daemonset ovnkube-node
Warning: spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
daemonset.apps/ovnkube-node restarted
+ kubectl -n ovn-kubernetes rollout status daemonset ovnkube-node

@trozet trozet mentioned this pull request Jun 28, 2023
@trozet
Copy link
Contributor Author

trozet commented Jun 28, 2023

We know the egress IP failure was due to an exgw issue fixed by another PR, and @flavio-fernandes and @tssurya identified the dualstack conversion failure is because routes are missing on ovn_cluster_router for ipv6. Therefore they have nothing to do with this PR. Will merge.

@trozet trozet merged commit 8bd07ff into ovn-org:master Jun 28, 2023
24 of 26 checks passed
@flavio-fernandes
Copy link
Contributor

Note: The fix for the missing ipv6 route is in PR #3724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants