Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

Closed
parseltongued opened this issue Aug 4, 2024 · 4 comments

Comments

@parseltongued
Copy link

Describe the bug
I have a 3-node OKD FCOS 4.15 airgap cluster and with an on-prem quay. On cluster restart, port 6443 is green on HAProxy but machine-config 22623 maintains red and console doesn't. I can't query anything with oc from my bastion server but I can ssh into each master. May I know how to check the boot-up ignition errors of each individual master node to see why it's failing? I've queried coreos-ignition-write-issues.service but doesnt show significant errors.
image

On this note, would like to check besides communicating with the quay repo, is the apache httpd server master.ign and worker.ign important for the cluster startup? Although I my master.ign is accessible, I thought there's a csr expiry of 24hrs and would not be necessary after cluster initialization, is my theory correct?

Cluster environment
OKD Cluster Version: 4.15.0-0.okd-2024-03-10-010116
Kernel version: v1.28.2-3598+6e2789bbd58938-dirty
Installation method: Bare-metal UPI (Airgapped, self hosted quay)

@melledouwsma
Copy link

As port 6443 is up and you're able to SSH into the control-planes, I would use that route to find out some information about the cluster. You can use the node-kubeconfig files in /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/ to connect to the API.

For example, when connected to one of the control-plane nodes:

export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get clusterversion,co,nodes,csr

Based on the results you can find out what part of the cluster needs your attention.

@parseltongued
Copy link
Author

Hi melledouwsma,

Thanks for your prompt reply.

Below is a screengrab after running the command above,
certificatesigningrequest.certificates.k8s.io/csr-57g7j 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-9tsrs 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-tstwl 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
image

I had previously made changes to the network adapter settings prior to this in an attempt to change to another subnet range, but I hope vm snapshot should have reverted all of this. is there anyway to further debug and fix this?

Thank you

@melledouwsma
Copy link

This cluster has Pending internal certificates, which could be due to restoring the cluster from backup snapshots. To resolve this issue, follow these steps to approve the pending certificates:

  • SSH into one of the control plane nodes.
  • Reconfigure kubeconfig with a command like:
export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
  • Approve the pending CSRs manually using:
oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve
  • Wait about a minute and verify the certificates have been issued with:
oc get csr

When new Pending certificates appear, use the previous step to approve them. This proces can take a couple of minutes. After a few minutes, when all certificates are issued and no new pending certificates appear, check the external API endpoint.

@parseltongued
Copy link
Author

Hi Melle, my cluster revived and it works! thanks so much for your instantaneous support on my case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants