Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 results in failed networking #1169

Closed
abaxo opened this issue Mar 20, 2022 · 15 comments

Comments

@abaxo
Copy link

abaxo commented Mar 20, 2022

Describe the bug
After an upgrade from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 existing nodes failed to update and ended up in dracut emergency mode with no ability to recover back to an older version. I am not certain that this event is related to the network issues that I am seeing in the cluster, but I am noting it in case it is relevant. This impacted 6 of 8 worker nodes, the 2 that were not affected were held by the upgrade process.

The upgrade of the cluster itself as far as the control plane and cluster operators was successful, including master node upgrades and reboots, so I provisioned 6 new worker nodes in the same machine set to replace the failed nodes gracefully on the assumption that if the config is somehow broken on the old nodes, fresh configuration would be applied to fresh nodes. The new nodes came up and show as provisioned in the cluster, and joins the cluster but do not enter a ready state. It appears that the SDN is not coming up. The cluster is configured with OpenshiftSDN. The multus pods have started on the new nodes, the multus-additional-plugins container which has successfully copied plugins to /host/opt/cni/bin/. The SDN pod

Version
4.10.0-0.okd-2022-03-07-131213 - IPI on vSphere

How reproducible
Every new node provisioned by the cloud controller has this issue - have provisioned 6 nodes.

Log bundle
Openshift SDN container logs:

I0320 20:50:22.254658 1050350 cmd.go:128] Reading proxy configuration from /config/kube-proxy-config.yaml
2
I0320 20:50:22.258717 1050350 feature_gate.go:245] feature gates: &{map[]}
3
I0320 20:50:22.258812 1050350 cmd.go:233] Watching config file /config/kube-proxy-config.yaml for changes
4
I0320 20:50:22.258865 1050350 cmd.go:233] Watching config file /config/..2022_03_19_23_13_19.1555170389/kube-proxy-config.yaml for changes
5
I0320 20:50:22.324054 1050350 node.go:159] Initializing SDN node "green-56w99-worker-s9c8w" (10.50.108.213) of type "redhat/openshift-ovs-networkpolicy"
6
I0320 20:50:22.324644 1050350 cmd.go:174] Starting node networking (v0.0.0-alpha.0-476-g9d0cd6e)
7
I0320 20:50:22.324669 1050350 node.go:370] Starting openshift-sdn network plugin
8
I0320 20:50:22.718695 1050350 healthcheck_ovs.go:18] Starting OVS health check
9
F0320 20:50:22.719093 1050350 cmd.go:118] Failed to start sdn: node SDN setup failed: Error connecting to OVS: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory

Multus container logs:


Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/
2
2022-03-20T20:48:05+00:00 WARN: {unknown parameter "-"}
3
2022-03-20T20:48:05+00:00 Entrypoint skipped copying Multus binary.
4
2022-03-20T20:48:05+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
5
2022-03-20T20:48:05+00:00 Attempting to find master plugin configuration, attempt 0
6
2022-03-20T20:48:10+00:00 Attempting to find master plugin configuration, attempt 5
7
2022-03-20T20:48:15+00:00 Attempting to find master plugin configuration, attempt 10
8
2022-03-20T20:48:20+00:00 Attempting to find master plugin configuration, attempt 15
9
2022-03-20T20:48:25+00:00 Attempting to find master plugin configuration, attempt 20

Node Logs:

Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-network-diagnostics/network-check-target-gmt9b" podUID=48373302-e297-494e-beaf-7e69ba23bda7
2
Mar 20 20:19:37.730674 green-56w99-worker-s9c8w hyperkube[1781]: E0320 20:19:37.730502    1781 pod_workers.go:949] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-multus/network-metrics-daemon-kp6mx" podUID=ca970b1b-034f-4326-818f-550c9cf8c7c6
3
Mar 20 20:19:38.550000 green-56w99-worker-s9c8w audit[1016796]: AVC avc:  denied  { ioctl } for  pid=1016796 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
4
Mar 20 20:19:38.556041 green-56w99-worker-s9c8w kernel: kauditd_printk_skb: 2 callbacks suppressed
5
Mar 20 20:19:38.556486 green-56w99-worker-s9c8w kernel: audit: type=1400 audit(1647807578.550:8311): avc:  denied  { ioctl } for  pid=1016796 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
6
Mar 20 20:19:38.550000 green-56w99-worker-s9c8w audit[1016796]: SYSCALL arch=c000003e syscall=59 success=yes exit=0 a0=c00163a2e8 a1=c000ba74f0 a2=c001c75e00 a3=8 items=0 ppid=1781 pid=1016796 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
7
Mar 20 20:19:38.571863 green-56w99-worker-s9c8w kernel: audit: type=1300 audit(1647807578.550:8311): arch=c000003e syscall=59 success=yes exit=0 a0=c00163a2e8 a1=c000ba74f0 a2=c001c75e00 a3=8 items=0 ppid=1781 pid=1016796 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
8
Mar 20 20:19:38.572150 green-56w99-worker-s9c8w kernel: audit: type=1309 audit(1647807578.550:8311): argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
9
Mar 20 20:19:38.550000 green-56w99-worker-s9c8w audit: EXECVE argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
10
Mar 20 20:19:38.550000 green-56w99-worker-s9c8w audit: PROCTITLE proctitle=69707461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65

@fortinj66
Copy link
Contributor

It would be helpful if you are able to create a must-gather and post it somewhere accessible... They can be fairly large...

@abaxo
Copy link
Author

abaxo commented Mar 20, 2022

hi @fortinj66 , of course - I’m on an ipad at the mo, trying to figure out best way to do it.

@abaxo
Copy link
Author

abaxo commented Mar 21, 2022

must-gather - The password for this is openshift

@abaxo
Copy link
Author

abaxo commented Mar 21, 2022

Just a thought:

  • green-56w99-worker-qlqqx is a good example of a worker with the network issue, as is green-56w99-worker-nc647. These are both newly provisioned nodes after the upgrade.
  • green-56w99-worker-26wjr is an example of a node that failed to reboot during the os upgrade phase
  • green-56w99-worker-v9227 is an example of a worker that has not had an os upgrade and is working, as is green-56w99-worker-rvt5m.
  • all 3 masters are working as expected

@abaxo
Copy link
Author

abaxo commented Mar 22, 2022

Fun test - I manually placed the same config into /host/var/run/multus/cni/net.d/80-openshift-network.conf copied from a working node. Multus immediately recognised that the config existed, before the config was wiped. Can you point me towards which container would ordinarily be managing the creation of that config file? Is it the Multus additional plugins container?

Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/
2022-03-22T22:28:52+00:00 WARN: {unknown parameter "-"}
2022-03-22T22:28:52+00:00 Entrypoint skipped copying Multus binary.
2022-03-22T22:28:52+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2022-03-22T22:28:52+00:00 Attempting to find master plugin configuration, attempt 0
2022-03-22T22:28:57+00:00 Attempting to find master plugin configuration, attempt 5
2022-03-22T22:29:03+00:00 Attempting to find master plugin configuration, attempt 10
2022-03-22T22:29:08+00:00 Attempting to find master plugin configuration, attempt 15
2022-03-22T22:29:13+00:00 Attempting to find master plugin configuration, attempt 20
2022-03-22T22:29:18+00:00 Attempting to find master plugin configuration, attempt 25
2022-03-22T22:29:23+00:00 Attempting to find master plugin configuration, attempt 30
2022-03-22T22:29:28+00:00 Attempting to find master plugin configuration, attempt 35
2022-03-22T22:29:33+00:00 Attempting to find master plugin configuration, attempt 40
2022-03-22T22:29:38+00:00 Attempting to find master plugin configuration, attempt 45
2022-03-22T22:29:43+00:00 Attempting to find master plugin configuration, attempt 50
2022-03-22T22:29:48+00:00 Attempting to find master plugin configuration, attempt 55
2022-03-22T22:29:53+00:00 Attempting to find master plugin configuration, attempt 60
2022-03-22T22:29:59+00:00 Attempting to find master plugin configuration, attempt 65
2022-03-22T22:30:04+00:00 Attempting to find master plugin configuration, attempt 70
2022-03-22T22:30:09+00:00 Attempting to find master plugin configuration, attempt 75
2022-03-22T22:30:14+00:00 Attempting to find master plugin configuration, attempt 80
2022-03-22T22:30:19+00:00 Attempting to find master plugin configuration, attempt 85
2022-03-22T22:30:24+00:00 Attempting to find master plugin configuration, attempt 90
2022-03-22T22:30:26+00:00 Using MASTER_PLUGIN: 80-openshift-network.conf
2022-03-22T22:30:26+00:00 Nested capabilities string:
2022-03-22T22:30:26+00:00 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration
2022-03-22T22:30:26+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/80-openshift-network.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
2022-03-22T22:30:26+00:00 Entering watch loop...
2022-03-22T22:31:05+00:00 Master plugin @ /host/var/run/multus/cni/net.d/80-openshift-network.conf has been deleted. Allowing 45 seconds for its restoration...
2022-03-22T22:31:50+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2022-03-22T22:31:51+00:00 Attempting to find master plugin configuration, attempt 0
2022-03-22T22:31:56+00:00 Attempting to find master plugin configuration, attempt 5
2022-03-22T22:32:01+00:00 Attempting to find master plugin configuration, attempt 10
2022-03-22T22:32:06+00:00 Attempting to find master plugin configuration, attempt 15

@abaxo
Copy link
Author

abaxo commented Mar 28, 2022

I've been able to establish 2 things:

TLDR:

  1. Turn off vmware HA VM monitoring, it hard resets vm's during update, os causes corruption
  2. Update the template workers use to provision to https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220227.3.0/x86_64/fedora-coreos-35.20220227.3.0-vmware.x86_64.ova

Slightly longer explanation:

  1. The cause of the corrupt disks lays with the os update (correctly) making vmware tools stop for a while. My vmware HA configuration is set to reset machines which dont respond for more than a couple of minutes. I've been able to repeat this behaviour - so it could be that the docs need to reflect that (if they dont already, I couldnt see it anywhere but could easily have missed it.

  2. As far as the upgrade goes, it seems to have an issue in pivoting from the os content shipped with 4.9 to 4.10. I have been able to restore service to my cluster by grabbing the latest stable fcos OVA 35.20220227.3.0 stable and using that as the machine template. This has worked as expected in applying the correct configuration. I'm not sure how repeatable this is, whether it is something specific to my environment (maybe I have messed with a config file that has given this behaviour) or if its a common issue on the path from 4.9 to 4.10. I am also surprised that the masters did not have the same issue, which gives me some doubt that the issue is part of the upgrade process, and more likely environmental.

@abaxo
Copy link
Author

abaxo commented Mar 28, 2022

After a short time running with this configuration, I'm still seeing issues with the freshly provisioned workers where the nodes are flapping ready/not ready, so this doesn't totally solve the issue - just a step forwards.

@rvanderp3
Copy link
Contributor

hi @abaxo would you be able to grab the worker node logs and an updated must-gather? I'd like to see how the failure mechanism has changed.

@abaxo
Copy link
Author

abaxo commented Mar 31, 2022

Hi @rvanderp3 - latest must gather is located at must-gather password is openshift

@abaxo
Copy link
Author

abaxo commented Mar 31, 2022

I tried replacing the template in my environment with the RHCOS latest stable for 4.10 and ran into the issue I had earlier in the process where the SDN didn't come up, but I could also see that the cluster was still pivoting from that template to the defined fcos machine content. Not really an improvement, but could be useful.

Probably worth noting as well that this cluster has been running for over 2 years and I think I initially deployed it at 4.1 so could be representative of a customer journey with a long running cluster.

@abaxo
Copy link
Author

abaxo commented Apr 4, 2022

@rvanderp3 - I have just applied the user/group config as per #1168 (comment) and it appears to have done the trick and allowed me to restart the SDN service on the host, and SDN pod successfully.

Weirdly, I've only needed to do it on nodes that are flapping, it doesn't seem to be all nodes. Remember that these are all fresh nodes, provisioned from a base template of fcos coreos-35.20220227.3.0-stable. I wonder if there is some config being pushed out from the upgrade still which changes permissions and causes a restart of the SDN.

@abaxo
Copy link
Author

abaxo commented Apr 5, 2022

I am leaning further towards this issue being environmental - I have just discovered an issue with the garbage collection caused by a broken CRD which is absolutely not going to have helped here.

@abaxo
Copy link
Author

abaxo commented Apr 5, 2022

Okay so having spent a few hours cleaning up and remediating the garbage collection, I've had the cluster in a good state to progress with swapping the vsphere csi driver from the vmware provided driver to the built in openshift version. As part of that process I did a full cluster stop and start primarily to clear the disk attachments cache. Doing this reboot has caused the SDN errors to reoccur, and has caused all but the worker which I applied the permission fix to, to go into dracut emergency mode.

Vmware shows that the guest OS halted the CPU. So none of those workers are bootable. Oh dear.

I think, but am not sure how to troubleshoot, that I am running into this issue coreos/fedora-coreos-tracker#784 or similar. Not certain it is the same because this looks to have been fixed in an earlier cos version.

@abaxo
Copy link
Author

abaxo commented Apr 23, 2022

I've updated the machine config api configmap machine-config-osimageurl in the openshift-machine-config-operator namespace, that specifies the os image to use for the cluster and set it to use the previously 'good' os image content (quay.io/openshift/okd-content@sha256:d9d805d53b5ac3836a893c5ad6b8e31405578add5eea9da745b50ec950114f18 ), this is the version from build quay.io/openshift/okd:4.9.0-0.okd-2022-02-12-140851.

This seems to have solved my SDN and OS instability issues, at least until its time to try a fixed OS version.

@markusdd
Copy link

Do you have SELinux modifications? The AVC messages above could indicate this. We had to make some and for the newer upgrades we suddenly get failures on the nodes, this is because coreos does not handle this gracefully yet.
coreos/fedora-coreos-tracker#701

Run sudo ostree admin config-diff | grep selinux to see if any selinux files are modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants