Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libvirt IPI: Workers not getting created on RHEL 8.6+ with virsh 8.0.0 #7004

Closed
pratham-m opened this issue Mar 21, 2023 · 10 comments
Closed
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pratham-m
Copy link

pratham-m commented Mar 21, 2023

Version

$ openshift-install version
openshift-install unreleased-master-7855-gf11c21e5e98c0f19516a0c0b13b8350a8f636b36-dirty
built from commit f11c21e5e98c0f19516a0c0b13b8350a8f636b36
release image registry.ci.openshift.org/origin/release:4.13
release architecture ppc64le

Platform: Libvirt IPI

$ virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

$ lsb_release -a
LSB Version:    :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.7 (Ootpa)
Release:        8.7
Codename:       Ootpa

What happened?

$ openshift-install create cluster --dir=$CLUSTER_DIR --log-level=debug fails while waiting for the worker nodes to come up.

level=info msg=Waiting up to 40m0s (until 10:10AM) for the cluster at https://api.ppc64le-qe53c.psi.redhat.com:6443/ to initialize...
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, machine-api, monitoring, openshift-apiserver, openshift-controller-manager, openshift-samples, operator-lifecycle-manager-packageserver are not available
level=debug msg=* Could not update imagestream "openshift/driver-toolkit" (581 of 840): the server is down or not responding
level=debug msg=* Could not update oauthclient "console" (524 of 840): the server does not recognize this resource, check extension API servers
level=debug msg=* Could not update role "openshift-console-operator/prometheus-k8s" (757 of 840): resource may have been deleted
level=debug msg=* Could not update role "openshift-console/prometheus-k8s" (760 of 840): resource may have been deleted
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, machine-api, monitoring, openshift-apiserver, openshift-controller-manager, openshift-samples, operator-lifecycle-manager-packageserver are not available
level=debug msg=* Could not update imagestream "openshift/driver-toolkit" (581 of 840): the server is down or not responding
level=debug msg=* Could not update oauthclient "console" (524 of 840): the server does not recognize this resource, check extension API servers
level=debug msg=* Could not update role "openshift-console-operator/prometheus-k8s" (757 of 840): resource may have been deleted
level=debug msg=* Could not update role "openshift-console/prometheus-k8s" (760 of 840): resource may have been deleted
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.13.0-rc.0
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.13.0-rc.0: 581 of 840 done (69% complete)
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=* Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (11 of 840)
level=debug msg=* Could not update prometheusrule "openshift-etcd-operator/etcd-prometheus-rules" (769 of 840)
level=debug msg=* Could not update servicemonitor "openshift-console/console" (762 of 840)
level=debug msg=* Could not update servicemonitor "openshift-ingress-operator/ingress-operator" (773 of 840)
level=debug msg=* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (809 of 840)
level=debug msg=* Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (831 of 840)
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, machine-api, monitoring are not available
...
...
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=error msg=failed to initialize the cluster: Cluster operators authentication, console, image-registry, ingress, machine-api, monitoring are not available
$ virsh list
 Id   Name                                 State
----------------------------------------------------
 15   ppc64le-qe53c-5pvdd-master-2         running
 16   ppc64le-qe53c-5pvdd-master-0         running
 17   ppc64le-qe53c-5pvdd-master-1         running
 19   ppc64le-qe53c-5pvdd-worker-0-7c6lw   running
 20   ppc64le-qe53c-5pvdd-worker-0-4vn7d   running
 21   ppc64le-qe53c-5pvdd-worker-0-kzwrk   running

$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ppc64le-qe53c-5pvdd-master-0   Ready    control-plane,master   44m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-1   Ready    control-plane,master   44m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-2   Ready    control-plane,master   44m   v1.26.2+06e8c46

$ oc get machines -A
NAMESPACE               NAME                                 PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ppc64le-qe53c-5pvdd-master-0         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-master-1         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-master-2         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-4vn7d   Provisioning                          41m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-7c6lw   Provisioning                          41m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-kzwrk   Provisioning                          41m

$ oc get machinesets -A
NAMESPACE               NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0   3         3                             44m

What you expected to happen?

OCP cluster creation should succeed and all worker nodes should come up.
Expected o/p is as below:

$ oc get nodes
NAME                                 STATUS   ROLES    AGE    VERSION
ppc64le-qe53c-5pvdd-master-0         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-1         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-2         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-4vn7d   Ready    worker   145m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-7c6lw   Ready    worker   141m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-kzwrk   Ready    worker   145m   v1.26.2+06e8c46

How to reproduce it

Clone the repository and install pre-requisites as per https://github.com/openshift/installer/tree/master/docs/dev/libvirt#libvirt-howto

$ TAGS=libvirt DEFAULT_ARCH=ppc64le hack/build.sh
$ openshift-install --dir=$CLUSTER_DIR create install-config
$ openshift-install --dir=$CLUSTER_DIR create manifests
$ openshift-install --dir=$CLUSTER_DIR create cluster --log-level=debug

Anything else we need to know?

Issue is not specific to any OCP version and is re-producible on 4.12.x, 4.11.x, etc.
Same steps work fine on RHEL 8.5 with Virsh 6.0.0

References

Below issues might not be related, but seems to have few similarities.

@pratham-m
Copy link
Author

$ cat install-config.yaml
apiVersion: v1
baseDomain: psi.redhat.com
compute:
- architecture: ppc64le
  hyperthreading: Enabled
  name: worker
  replicas: 3
controlPlane:
  architecture: ppc64le
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  creationTimestamp: null
  name: ppc64le-qe53c
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.128.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  libvirt:
    network:
      if: tt2
pullSecret: .........
sshKey: ............

@pratham-m
Copy link
Author

$ oc --namespace=openshift-machine-api get deployments
NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler-operator          1/1     1            1           113d
cluster-baremetal-operator           1/1     1            1           113d
control-plane-machine-set-operator   1/1     1            1           113d
machine-api-controllers              1/1     1            1           113d
machine-api-operator                 1/1     1            1           113d

$  oc --namespace=openshift-machine-api logs deployments/machine-api-controllers --container=machine-controller
...
...
I0321 06:14:03.212356       1 controller.go:187] ppc64le-qe6a-mr77v-master-2: reconciling Machine
I0321 06:14:03.212429       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-master-2 exists.
I0321 06:14:03.214188       1 client.go:142] Created libvirt connection: 0xc000640e68
I0321 06:14:03.214738       1 client.go:317] Check if "ppc64le-qe6a-mr77v-master-2" domain exists
I0321 06:14:03.215200       1 client.go:158] Freeing the client pool
I0321 06:14:03.215357       1 client.go:164] Closing libvirt connection: 0xc000640e68
I0321 06:14:03.215807       1 controller.go:313] ppc64le-qe6a-mr77v-master-2: reconciling machine triggers idempotent update
I0321 06:14:03.215858       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-master-2
I0321 06:14:03.218036       1 client.go:142] Created libvirt connection: 0xc000641158
I0321 06:14:03.218356       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-master-2"
I0321 06:14:03.218688       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-master-2
I0321 06:14:03.220932       1 client.go:158] Freeing the client pool
I0321 06:14:03.221011       1 client.go:164] Closing libvirt connection: 0xc000641158
I0321 06:14:03.229977       1 controller.go:187] ppc64le-qe6a-mr77v-worker-0-65rjs: reconciling Machine
I0321 06:14:03.230057       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-worker-0-65rjs exists.
I0321 06:14:03.232083       1 client.go:142] Created libvirt connection: 0xc000818e18
I0321 06:14:03.232471       1 client.go:317] Check if "ppc64le-qe6a-mr77v-worker-0-65rjs" domain exists
I0321 06:14:03.232807       1 client.go:158] Freeing the client pool
I0321 06:14:03.232853       1 client.go:164] Closing libvirt connection: 0xc000818e18
I0321 06:14:03.233232       1 controller.go:313] ppc64le-qe6a-mr77v-worker-0-65rjs: reconciling machine triggers idempotent update
I0321 06:14:03.233271       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-worker-0-65rjs
I0321 06:14:03.234835       1 client.go:142] Created libvirt connection: 0xc0008190d8
I0321 06:14:03.235255       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-worker-0-65rjs"
I0321 06:14:03.235568       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-worker-0-65rjs
I0321 06:14:03.237846       1 client.go:158] Freeing the client pool
I0321 06:14:03.237968       1 client.go:164] Closing libvirt connection: 0xc0008190d8
I0321 06:14:03.248101       1 controller.go:187] ppc64le-qe6a-mr77v-worker-0-h7b2s: reconciling Machine
I0321 06:14:03.248126       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-worker-0-h7b2s exists.
I0321 06:14:03.250372       1 client.go:142] Created libvirt connection: 0xc000c54988
I0321 06:14:03.250726       1 client.go:317] Check if "ppc64le-qe6a-mr77v-worker-0-h7b2s" domain exists
I0321 06:14:03.251060       1 client.go:158] Freeing the client pool
I0321 06:14:03.251087       1 client.go:164] Closing libvirt connection: 0xc000c54988
I0321 06:14:03.251454       1 controller.go:313] ppc64le-qe6a-mr77v-worker-0-h7b2s: reconciling machine triggers idempotent update
I0321 06:14:03.251466       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-worker-0-h7b2s
I0321 06:14:03.253286       1 client.go:142] Created libvirt connection: 0xc000c54c48
I0321 06:14:03.253599       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-worker-0-h7b2s"
I0321 06:14:03.253933       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-worker-0-h7b2s
I0321 06:14:03.256177       1 client.go:158] Freeing the client pool
I0321 06:14:03.256196       1 client.go:164] Closing libvirt connection: 0xc000c54c48

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023
@pratham-m
Copy link
Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2023
@cfergeau
Copy link
Contributor

Since this happens on ppc64le, this is most likely https://issues.redhat.com/browse/OCPBUGS-17476, which is caused by a regression in SLOF. While this is being fixed, we can add a workaround in cluster-api-provider-libvirt

@dale-fu
Copy link

dale-fu commented Sep 18, 2023

We have also seen this on s390x before, we are stuck using a specific version of libvirt, libvirt-6.0.0-37.module+el8.5.0+12162+40884dd2, since any thing later didn't seem to work for libvirt ipi installation.

@pratham-m
Copy link
Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2023
@pratham-m pratham-m changed the title Libvirt IPI: Workers not getting created on RHEL 8.7 with virsh 8.0.0 Libvirt IPI: Workers not getting created on RHEL 8.6+ with virsh 8.0.0 Sep 27, 2023
@pratham-m
Copy link
Author

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants