Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438 failing #1527

Closed
nate-duke opened this issue Feb 28, 2023 · 6 comments

Comments

@nate-duke
Copy link

nate-duke commented Feb 28, 2023

Describe the bug

while updating from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438 control plane degrades due to first member being upgraded failing to return.

It seems that two nodes were updated, one of them being a control plane member caused the upgrade to stop progressing.

During the update (after all the operators were updated) the vsphere-problem-detector threw the following log:

I0228 11:39:51.577708 1 vsphere_check.go:236] CheckAccountPermissions failed: missing privileges for vcenter: Cns.Searchable, InventoryService.Tagging.DeleteCategory, Sessions.ValidateSession

After seeing that I had our vmware infrastructure team add those privileges to the role our cluster service account uses in vsphere and that issue cleared up. However, this was all while the two nodes (one infra/worker and one master) were in the state they're in now and in the must-gather.

Both systems are reachable should diagnostics outside of the must-gather be helpful:

❯ ssh core@10.138.5.216 rpm-ostree status # os-infra-dev-01-zh59r 
State: idle
Deployments:
● ostree-unverified-registry:quay.io/openshift/okd-content@sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8
                   Digest: sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8
                  Version: 37.20230110.3.1 (2023-02-28T11:39:57Z)

  pivot://quay.io/openshift/okd-content@sha256:bc4fe370cd76415d045b6cc2cf08e5f696ece912661cfe4370910020be9fe0b6
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202301141513-0 (2023-01-14T15:17:08Z)

❯ ssh core@10.138.5.243 rpm-ostree status # dev-nkjpp-master-1
State: idle
Deployments:
● ostree-unverified-registry:quay.io/openshift/okd-content@sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8
                   Digest: sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8
                  Version: 37.20230110.3.1 (2023-02-28T11:41:09Z)

  pivot://quay.io/openshift/okd-content@sha256:bc4fe370cd76415d045b6cc2cf08e5f696ece912661cfe4370910020be9fe0b6
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202301141513-0 (2023-01-14T15:17:08Z)

Version

4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438
vSphere IPI

How reproducible

1 for 1 right now. We'll try our other clusters if we can keep this one from tipping over.

Log bundle

must-gather.local.8365916698519417107.zip

@nate-duke nate-duke changed the title updatefrom 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438 failing update from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438 failing Feb 28, 2023
@melledouwsma
Copy link

Kubelet seems to be unavailable on both nodes, so the must-gather does not contain much logs from the nodes. Does sudo systemctl status kubelet report anything useful from the nodes?

$ omg get nodes os-infra-dev-01-zh59r -o json | jq .status.conditions
[
  {
    "lastHeartbeatTime": "2023-02-28T11:39:56Z",
    "lastTransitionTime": "2023-02-28T11:40:37Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "MemoryPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:39:56Z",
    "lastTransitionTime": "2023-02-28T11:40:37Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "DiskPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:39:56Z",
    "lastTransitionTime": "2023-02-28T11:40:37Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "PIDPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:39:56Z",
    "lastTransitionTime": "2023-02-28T11:40:37Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "Ready"
  }
]
$ omg get nodes dev-nkjpp-master-1 -o json | jq .status.conditions
[
  {
    "lastHeartbeatTime": "2023-02-28T11:40:32Z",
    "lastTransitionTime": "2023-02-28T11:42:12Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "MemoryPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:40:32Z",
    "lastTransitionTime": "2023-02-28T11:42:12Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "DiskPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:40:32Z",
    "lastTransitionTime": "2023-02-28T11:42:12Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "PIDPressure"
  },
  {
    "lastHeartbeatTime": "2023-02-28T11:40:32Z",
    "lastTransitionTime": "2023-02-28T11:42:12Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "Ready"
  }
]

@nate-duke
Copy link
Author

nate-duke commented Mar 1, 2023

thanks for taking a look @melledouwsma

Yeah, the kublet isn't running due to the absence of /run/resolv-prepender-kni-conf-done which seems to be managed by
/etc/NetworkManager/dispatcher.d/30-resolv-prepender which I am endeavoring to understand this morning to try and understand better where the root of this issue is.

[core@dev-nkjpp-master-1 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─01-kubens.conf, 10-mco-default-madv.conf, 10-mco-on-prem-wait-resolv.conf, 20-logging.conf, 20-nodenet.conf
     Active: activating (auto-restart) (Result: exit-code) since Wed 2023-03-01 11:30:40 UTC; 4s ago
    Process: 2578 ExecCondition=/bin/bash -c test -f /run/resolv-prepender-kni-conf-done || exit 255 (code=exited, status=255/EXCEPTION)
        CPU: 2ms
        
[core@dev-nkjpp-master-1 ~]$ stat /run/resolv-prepender-kni-conf-done
stat: cannot statx '/run/resolv-prepender-kni-conf-done': No such file or directory

[core@dev-nkjpp-master-1 ~]$ systemctl status NetworkManager
● NetworkManager.service - Network Manager
     Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/NetworkManager.service.d
             └─NetworkManager-ovs.conf
     Active: active (running) since Wed 2023-03-01 11:16:05 UTC; 16min ago
       Docs: man:NetworkManager(8)
   Main PID: 1052 (NetworkManager)
      Tasks: 3 (limit: 38420)
     Memory: 8.5M
        CPU: 519ms
     CGroup: /system.slice/NetworkManager.service
             └─1052 /usr/sbin/NetworkManager --no-daemon

@vrutkovs
Copy link
Member

vrutkovs commented Mar 1, 2023

Check logs on the node for "nm-dispatcher" - this would have logs from 30-resolv-prepender

@nate-duke
Copy link
Author

nate-duke commented Mar 1, 2023

looks like maybe some more selinux gremlins?

Mar 01 11:16:08 dev-nkjpp-master-1 audit[1063]: AVC avc:  denied  { read } for  pid=1063 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=4207358 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
Mar 01 11:16:08 dev-nkjpp-master-1 audit[1063]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=55fbd8e47ae0 a2=90800 a3=0 items=0 ppid=1 pid=1063 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
Mar 01 11:16:08 dev-nkjpp-master-1 audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
Mar 01 11:16:08 dev-nkjpp-master-1 nm-dispatcher[1063]: req:22 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied

Applying the workaround from #1425

[root@dev-nkjpp-master-1 ~]# restorecon -R -v /etc/NetworkManager/dispatcher.d/
Relabeled /etc/NetworkManager/dispatcher.d from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/pre-up.d from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/pre-up.d/10-ofport-request.sh from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/30-resolv-prepender from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0

@vrutkovs
Copy link
Member

vrutkovs commented Mar 1, 2023

Looks like a dupe of #1475

@vrutkovs vrutkovs closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023
@nate-duke
Copy link
Author

Yeah, i think you may be right. I swear tried this yesterday. Will open another issue if I encounter another problem.

Apologies for the oversight on my part and thank you very much for the eyes and brains @vrutkovs and @melledouwsma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants