New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start openvswitch and ovsdb-server when network is ovn/ovs #1636
Start openvswitch and ovsdb-server when network is ovn/ovs #1636
Conversation
c58f5a5
to
b06275a
Compare
569f304
to
9a50347
Compare
ci/prow/e2e-gcp-op failure looks legit:
|
Interesting, looks like the units were still disabled:
|
Does bootstrap need to take network config as an input?
Nm, it's using the default path to read it. So we have the same values the other operators have... |
@cgwalters ptal as well |
9a50347
to
b2333c8
Compare
/hold |
Now that this is correctly failing, @mccv1r0 should be able to update his PR to get the OVS daemonset to detect if the host OVS is running and do a poll loop as long as it is up (basically if host OVS is running, delegate to it), which will allow upgrades and downgrades. He can test this PR and his in the cluster-bot. Once he's got it working, we can merge his, then this should go green (we can also verify both forward and backward upgrades - PR1+PR2 -> master, master -> PR1+PR2, PR1+PR2 -> PR1+PR2). I'm sure we'll catch some other scenarios. |
b2333c8
to
d0f4e4b
Compare
@runcom @smarterclayton This PR seems to only have rhcos start OVS on master nodes. On worker nodes rhcos doesn't seem to be starting OVS. |
d0f4e4b
to
07518f5
Compare
Few notes from a local aws cluster run by using custom payload with PR #1636 + openshift/cluster-network-operator#477 and setting
|
Do you have the artifacts? It would help to compare OVN/OVS logs to what I've seen. Those logs added debug jourunalctl -xeu calls. To be safe, you might also need to use: openshift/ovn-kubernetes#149 to deal with a permission issue in addition to PR #1636 + openshift/cluster-network-operator#477 |
I have destroyed the test cluster but I do have log-bundle and journal log from bootstrap node and one of master node at https://sinnykumari.fedorapeople.org/ovn/ . I hope it helps.
ah ok, will include it dnext time |
custom image build fails with PR openshift/ovn-kubernetes#149
|
The rpm repos needed cannot be reached. Can you use gcp for what you need to do, just in case the problem accessing repos is platform specific? |
Did couple of runs with custom payload including PRs openshift/ovn-kubernetes#149, openshift/cluster-network-operator#477 and #1636 . Cluster install fails as multiple operators fail to update Worker nodes did come up on both AWS and GCP as I can do Have saved must-gather log from AWS and GCP cluster install. @mccv1r0 maybe you can find something relevant which could be related to the ovn network switch. |
By looking at rendered config once bootstraping is done, From one of the failed test cluster after bootstrapping
It looks like there is a mismatch of these service being enabled during bootstrapping (are enabled) and after bootstrapping (are disabled) and that's why rendered config mismatch happens on master nodes.
MCD log running on one of the master node:
If we look at the mcs-machine-config-content.json content on one of the master node:
I wonder if .NetworkType is getting evaluated correctly 🤔 |
So, I did another test run with few modification (see https://github.com/openshift/machine-config-operator/pull/1786/files) in which enabled openvswitch.service and ovsdb-server.service. Cluster came up successfully with NetworkType: OVNKubernetes 🎉 Can confirm that NetworkType gets set correctly . Log from m-c-o pod @runcom Looks to me that |
uhm, that's surprising, masters evaluate that properly right? this issue seems to be specific to workers somehow 🤔 |
No, both final rendered master and worker evaluates to false but during initial bootstrap it seems master config get evaluated to true. see #1636 (comment) . |
07518f5
to
2c94b50
Compare
Pushed a fix to make sure NetworkType doesn't end up being empty and evaluates the templates correctly - this should fix the weird things we're seeing. |
2c94b50
to
567566e
Compare
Signed-off-by: Antonio Murdaca <runcom@linux.com>
567566e
to
be883a8
Compare
@runcom: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
NetworkType is getting correctly set in MCO now! Tested latest content of this PR with openshift/cluster-network-operator#477 and openshift/ovn-kubernetes#149, cluster came up perfectly fine in both cases when networkType is set to OpenShiftSDN (default one) and OVNKubernetes. |
awesome, I guess we can merge this PR then /skip |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: runcom, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
shall we cancel hold now? |
Hang on. Shouldn't we merge the changes in to CNO first so that we don't fight with system openvswitch? It looks like they're fighting back and forth, and it only happens to work because we |
Correct. Can this be rolled back? |
Taking the discussion as to revert or not elsewhere for a moment with Casey and Mike :) |
Seems like this merge is causing failure in CNO e2e-gcp-ovn job failures. All the jobs are failing at this moment. |
Ensures changes to MCO (like openshift/machine-config-operator#1636) don't break ovn-kubernetes, Windows, hybrid overlay, etc.
Needed by CNO for now - aim is those 2 services enablement are gonna be owned by CNO itself. Left some comments for things I'm not aware of.
Signed-off-by: Antonio Murdaca runcom@linux.com