[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

parseltongued · 2024-08-04T10:47:24Z

Describe the bug
I have a 3-node OKD FCOS 4.15 airgap cluster and with an on-prem quay. On cluster restart, port 6443 is green on HAProxy but machine-config 22623 maintains red and console doesn't. I can't query anything with oc from my bastion server but I can ssh into each master. May I know how to check the boot-up ignition errors of each individual master node to see why it's failing? I've queried coreos-ignition-write-issues.service but doesnt show significant errors.

On this note, would like to check besides communicating with the quay repo, is the apache httpd server master.ign and worker.ign important for the cluster startup? Although I my master.ign is accessible, I thought there's a csr expiry of 24hrs and would not be necessary after cluster initialization, is my theory correct?

Cluster environment
OKD Cluster Version: 4.15.0-0.okd-2024-03-10-010116
Kernel version: v1.28.2-3598+6e2789bbd58938-dirty
Installation method: Bare-metal UPI (Airgapped, self hosted quay)

melledouwsma · 2024-08-04T16:02:41Z

As port 6443 is up and you're able to SSH into the control-planes, I would use that route to find out some information about the cluster. You can use the node-kubeconfig files in /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/ to connect to the API.

For example, when connected to one of the control-plane nodes:

export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get clusterversion,co,nodes,csr

Based on the results you can find out what part of the cluster needs your attention.

parseltongued · 2024-08-05T04:13:16Z

Hi melledouwsma,

Thanks for your prompt reply.

Below is a screengrab after running the command above,
certificatesigningrequest.certificates.k8s.io/csr-57g7j 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-9tsrs 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-tstwl 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending

I had previously made changes to the network adapter settings prior to this in an attempt to change to another subnet range, but I hope vm snapshot should have reverted all of this. is there anyway to further debug and fix this?

Thank you

melledouwsma · 2024-08-05T06:50:08Z

This cluster has Pending internal certificates, which could be due to restoring the cluster from backup snapshots. To resolve this issue, follow these steps to approve the pending certificates:

SSH into one of the control plane nodes.
Reconfigure kubeconfig with a command like:

export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"

Approve the pending CSRs manually using:

oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve

Wait about a minute and verify the certificates have been issued with:

oc get csr

When new Pending certificates appear, use the previous step to approve them. This proces can take a couple of minutes. After a few minutes, when all certificates are issued and no new pending certificates appear, check the external API endpoint.

parseltongued · 2024-08-05T12:23:52Z

Hi Melle, my cluster revived and it works! thanks so much for your instantaneous support on my case!

parseltongued closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

parseltongued commented Aug 4, 2024

melledouwsma commented Aug 4, 2024

parseltongued commented Aug 5, 2024

melledouwsma commented Aug 5, 2024

parseltongued commented Aug 5, 2024

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

Comments

parseltongued commented Aug 4, 2024

melledouwsma commented Aug 4, 2024

parseltongued commented Aug 5, 2024

melledouwsma commented Aug 5, 2024

parseltongued commented Aug 5, 2024