Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable shared gateway mode for OVN #727

Merged
merged 1 commit into from Aug 2, 2020

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Jul 23, 2020

This patch migrates from using Local gateway mode to Shared gateway mode
with ovn-kubernetes. With shared gateway mode, the external network NIC
is now directly configured as part of an OVS bridge, which has a Layer 2
connection to OVN, effectively allowing OVN to share the NIC with the
host as a Layer 2 network.

Unlike Local gateway mode, this eliminates the need to route through the
kernel for certain OVN traffic to egress the host.

Signed-off-by: Tim Rozet trozet@redhat.com

@trozet
Copy link
Contributor Author

trozet commented Jul 23, 2020

/hold

Wait for corresponding MCO change:
openshift/machine-config-operator#1860

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2020
@trozet trozet force-pushed the enabled_shared_gw branch 2 times, most recently from dbcb3c1 to ae664c2 Compare July 24, 2020 22:32
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2020
@danwinship
Copy link
Contributor

lgtm and I think it can be un-hold-ed now, but it needs to be rebased anyway

@trozet
Copy link
Contributor Author

trozet commented Jul 29, 2020

lgtm and I think it can be un-hold-ed now, but it needs to be rebased anyway

waiting for openshift/ovn-kubernetes#216 which has a shared gateway fix in it

This patch migrates from using Local gateway mode to Shared gateway mode
with ovn-kubernetes. With shared gateway mode, the external network NIC
is now directly configured as part of an OVS bridge, which has a Layer 2
connection to OVN, effectively allowing OVN to share the NIC with the
host as a Layer 2 network.

Unlike Local gateway mode, this eliminates the need to route through the
kernel for certain OVN traffic to egress the host.

Signed-off-by: Tim Rozet <trozet@redhat.com>
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2020
@trozet
Copy link
Contributor Author

trozet commented Jul 29, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 29, 2020
@trozet
Copy link
Contributor Author

trozet commented Jul 29, 2020

/retest

@trozet
Copy link
Contributor Author

trozet commented Jul 29, 2020

/assign @danwinship

Note GCP will fail until we have a working fix for shared gw mode and routes. AWS should pass.

@trozet
Copy link
Contributor Author

trozet commented Jul 30, 2020

/retest

2 similar comments
@trozet
Copy link
Contributor Author

trozet commented Jul 30, 2020

/retest

@trozet
Copy link
Contributor Author

trozet commented Jul 30, 2020

/retest

@trozet
Copy link
Contributor Author

trozet commented Jul 30, 2020

@stbenjam introspection is failing here:

level=error
level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed', last error was 'Failed to inspect hardware. Reason: unable to start inspection: 'System' object has no attribute 'set_system_boot_options''"
level=error
level=error msg="  on ../../tmp/openshift-install-198912340/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"
level=error msg="   1: resource \"ironic_node_v1\" \"openshift-master-host\" {"
level=error
level=error

I'm guessing that this networking change may have some effect on metal ipi. We are moving the physical interface into OVS now during coreos startup. With ironic introspection in openstack, we used to load the introspection image first, run introspection, reboot with the real image. How is it being done in OCP?

@knobunc
Copy link
Contributor

knobunc commented Jul 30, 2020

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2020
@knobunc
Copy link
Contributor

knobunc commented Jul 30, 2020

Closing the loop on an earlier comment by @trozet. @stbenjam says that the metal failure is not due to this PR. They have a fix at openshift-metal3/dev-scripts#1076

@stbenjam
Copy link
Member

Things should be better now. We've fixed the introspection error and moved back to IPv6.

/test e2e-metal-ipi

@stbenjam
Copy link
Member

The current e2e-metal-ipi failure looks real:

level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-m46pw is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-b8jdn is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-hctv9 is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - last change 2020-07-30T20:41:19Z"

Bootstrap log bundle is @ https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi/1288926158213091328/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/

@trozet
Copy link
Contributor Author

trozet commented Jul 31, 2020

/retest

@trozet
Copy link
Contributor Author

trozet commented Jul 31, 2020

The current e2e-metal-ipi failure looks real:

level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-m46pw is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-b8jdn is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-hctv9 is in CrashLoopBackOff State\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - last change 2020-07-30T20:41:19Z"

Bootstrap log bundle is @ https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi/1288926158213091328/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/

@stbenjam For some reason I dont see an ovnkube-log in must-gather-ipi...tar -> bootstrap/containers

@stbenjam
Copy link
Member

stbenjam commented Jul 31, 2020

@stbenjam For some reason I dont see an ovnkube-log in must-gather-ipi...tar -> bootstrap/containers

Yea sorry, the log bundle is what we were able to capture (log-bundle-ipi-ci-op-92dzpci2-0c056-20200730T200748.tar). It doesn't look like things came up enough to collect a whole must-gather, but we do have some logs from the masters.

e.g.,. From ./control-plane/fd2e:6f44:5dd8:c956::14/containers/ovn-controller-9723b213aad102045724dc6b668432db9343f16cbd4afd5d4ecdf098992dafa9.log:

2020-07-30T21:08:00Z|00123|patch|ERR|Dropped 1 log messages in last 11 seconds (most recently, 11 seconds ago) due to excessive rate
2020-07-30T21:08:00Z|00124|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'
2020-07-30T21:08:18Z|00125|patch|ERR|Dropped 3 log messages in last 18 seconds (most recently, 18 seconds ago) due to excessive rate
2020-07-30T21:08:18Z|00126|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'

Not sure if that's relevant, but there's a bunch more logs in there from the other containers as well.

@trozet
Copy link
Contributor Author

trozet commented Jul 31, 2020

@stbenjam thanks. Now I see the error is:
F0730 21:08:03.130489 58258 ovnkube.go:130] failed to get default gateway interface

and this is because if we look in ovs-configuration.service.log:

Jul 30 20:38:13 master-2.ostest.test.metalkube.org systemd[1]: Starting Configures OVS with proper host networking configuration... Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + iface= Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + counter=0 Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + '[' 0 -lt 12 ']' Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: ++ jq -r '.[0].dev' Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: ++ ip -j route show default Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + iface=null Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + '[' -n null ']' Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + echo 'Default gateway interface found: null' Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: Default gateway interface found: null Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + break Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + '[' null = br-ex ']' Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: + '[' -z null ']' Jul 30 20:38:13 master-2.ostest.test.metalkube.org systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE Jul 30 20:38:13 master-2.ostest.test.metalkube.org configure-ovs.sh[1898]: /usr/local/bin/configure-ovs.sh: line 35: /sys/class/net/null/address: No such file or directory

Looks like we get back "null" from jq so we are only trying 1 time there. So that is a bug, but regardless, why was there no default gateway on this host? The node should come up and DHCP automatically before NetworkManager wait online. Are there system journals in this tar ball?

@openshift-ci-robot
Copy link
Contributor

@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor Author

trozet commented Aug 2, 2020

@stbenjam FYI we do not have a fix yet for ipv6 with shared gw mode. @Billy99 is working on it. I would recommend reverting to ipv4 for a few days until he has that fixed (once this merges)

@trozet
Copy link
Contributor Author

trozet commented Aug 2, 2020

/test e2e-aws-ovn

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

17 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 2, 2020

@trozet: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi dbe8527 link /test e2e-metal-ipi
ci/prow/e2e-azure dbe8527 link /test e2e-azure
ci/prow/e2e-aws-ovn dbe8527 link /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@trozet
Copy link
Contributor Author

trozet commented Aug 2, 2020

Aws had 1 failure, gcp only failed prometheus on last run @knobunc

Still would be nice to see ovn step registry or aws fully pass.

@knobunc
Copy link
Contributor

knobunc commented Aug 2, 2020

/override ci/prow/e2e-gcp-ovn
It's failing a prometheus test. There seems to be a bug with how prometheus measures availability. Tracking with bug https://bugzilla.redhat.com/show_bug.cgi?id=1862806.

@openshift-ci-robot
Copy link
Contributor

@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn
It's failing a prometheus test. There seems to be a bug with how prometheus measures availability. Tracking with bug https://bugzilla.redhat.com/show_bug.cgi?id=1862806.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit ed9ccd7 into openshift:master Aug 2, 2020
@@ -537,7 +539,8 @@ spec:
hostPath:
path: /var/lib/ovn/data
- name: run-openvswitch
emptyDir: {}
hostPath:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet can you add a type: Directory here so we never accidentally create this?

@stbenjam
Copy link
Member

stbenjam commented Aug 3, 2020

@stbenjam FYI we do not have a fix yet for ipv6 with shared gw mode. @Billy99 is working on it. I would recommend reverting to ipv4 for a few days until he has that fixed (once this merges)

Thanks -- is there a BZ tracking the IPv6 bug?

@danwinship
Copy link
Contributor

shared gateway IPv6 support: ovn-org/ovn-kubernetes#1462

@trozet
Copy link
Contributor Author

trozet commented Aug 10, 2020

@stbenjam FYI we do not have a fix yet for ipv6 with shared gw mode. @Billy99 is working on it. I would recommend reverting to ipv4 for a few days until he has that fixed (once this merges)

Thanks -- is there a BZ tracking the IPv6 bug?

https://bugzilla.redhat.com/show_bug.cgi?id=1866464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants