-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libvirt] Default time for 30m for the cluster to initialize not enough #1428
Comments
slow internet loses much has in podman pull Regards, |
@ssbano I am not on the slow internet, for this I am using the GCE nested virt and image pulls are quite fast, have to tried latest master of installer today/yesterday with a success? |
I changed on create.go 60 minutes Regards, |
I'll try tonight and give you some feedback. Regards, |
I was waiting to solve the certificate problem #1394 |
Check the pods behaviour and some of the components are retry continue. $ oc get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-member-test1-dl2sn-master-0 1/1 Running 0 4h41m
openshift-apiserver-operator openshift-apiserver-operator-6cc4546d78-lztg7 1/1 Running 48 4h42m
openshift-apiserver apiserver-7xqg9 1/1 Running 0 4h35m
openshift-authentication-operator openshift-authentication-operator-5cdb858474-mscb8 1/1 Running 37 4h36m
openshift-authentication openshift-authentication-c9c5f79ff-98zdf 1/1 Running 0 92m
openshift-authentication openshift-authentication-c9c5f79ff-zkwcl 1/1 Running 0 93m
openshift-cloud-credential-operator cloud-credential-operator-6575487785-7fq5j 1/1 Running 0 4h40m
openshift-cluster-machine-approver machine-approver-7c7c9d9686-cnbpx 1/1 Running 0 4h42m
openshift-cluster-node-tuning-operator cluster-node-tuning-operator-74d66ffb55-fhcjq 1/1 Running 0 4h36m
openshift-cluster-node-tuning-operator tuned-42hj9 1/1 Running 0 4h35m
openshift-cluster-node-tuning-operator tuned-kd6tw 1/1 Running 0 4h35m
openshift-cluster-samples-operator cluster-samples-operator-7d69dc98c-g69fj 1/1 Running 0 4h36m
openshift-cluster-storage-operator cluster-storage-operator-7499867ff9-dkjrc 1/1 Running 5 4h36m
openshift-cluster-version cluster-version-operator-847975454d-tnbv4 1/1 Running 0 4h41m
openshift-console-operator console-operator-6969dc4d65-plgbj 1/1 Running 37 4h36m
openshift-console console-547c7b8846-29z89 1/1 Running 2 4h33m
openshift-console console-547c7b8846-8l8gm 1/1 Running 2 4h33m
openshift-controller-manager-operator openshift-controller-manager-operator-d54df77d7-9bw5t 1/1 Running 47 4h42m
openshift-controller-manager controller-manager-4b7sc 1/1 Running 8 39m
openshift-dns-operator dns-operator-58b496f677-ckfh2 1/1 Running 0 4h42m
openshift-dns dns-default-4d7km 2/2 Running 0 4h38m
openshift-dns dns-default-q6htd 2/2 Running 0 4h41m
openshift-image-registry cluster-image-registry-operator-bff67847f-zq5p7 1/1 Running 0 4h36m
openshift-image-registry image-registry-5d85d6c8d9-kgl28 1/1 Running 0 4h35m
openshift-image-registry node-ca-p7q26 1/1 Running 0 4h35m
openshift-image-registry node-ca-sc4h9 1/1 Running 0 4h35m
openshift-ingress-operator ingress-operator-84fcfc454d-rwvd2 1/1 Running 0 4h36m
openshift-ingress router-default-875d76bbf-bzsbx 0/1 Pending 0 4h35m
openshift-ingress router-default-875d76bbf-xbhb9 1/1 Running 0 4h35m
openshift-kube-apiserver-operator kube-apiserver-operator-7d7b576bd6-xbtrh 1/1 Running 47 4h42m
openshift-kube-apiserver installer-32-test1-dl2sn-master-0 0/1 Completed 0 142m
openshift-kube-apiserver installer-41-test1-dl2sn-master-0 0/1 OOMKilled 0 51m
openshift-kube-apiserver installer-42-test1-dl2sn-master-0 0/1 Completed 0 40m
openshift-kube-apiserver installer-43-test1-dl2sn-master-0 0/1 Completed 0 39m
openshift-kube-apiserver installer-44-test1-dl2sn-master-0 0/1 Completed 0 37m
openshift-kube-apiserver installer-45-test1-dl2sn-master-0 0/1 Completed 0 36m
openshift-kube-apiserver installer-46-test1-dl2sn-master-0 0/1 Completed 0 34m
openshift-kube-apiserver installer-47-test1-dl2sn-master-0 0/1 Completed 0 32m
openshift-kube-apiserver installer-48-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-apiserver installer-49-test1-dl2sn-master-0 0/1 Completed 0 22m
openshift-kube-apiserver kube-apiserver-test1-dl2sn-master-0 2/2 Running 0 21m
openshift-kube-apiserver revision-pruner-32-test1-dl2sn-master-0 0/1 Completed 0 136m
openshift-kube-apiserver revision-pruner-41-test1-dl2sn-master-0 0/1 OOMKilled 0 50m
openshift-kube-apiserver revision-pruner-42-test1-dl2sn-master-0 0/1 Completed 0 39m
openshift-kube-apiserver revision-pruner-43-test1-dl2sn-master-0 0/1 Completed 0 37m
openshift-kube-apiserver revision-pruner-44-test1-dl2sn-master-0 0/1 Completed 0 36m
openshift-kube-apiserver revision-pruner-45-test1-dl2sn-master-0 0/1 Completed 0 34m
openshift-kube-apiserver revision-pruner-46-test1-dl2sn-master-0 0/1 Completed 0 32m
openshift-kube-apiserver revision-pruner-47-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-apiserver revision-pruner-48-test1-dl2sn-master-0 0/1 Completed 0 22m
openshift-kube-apiserver revision-pruner-49-test1-dl2sn-master-0 0/1 Completed 0 16m
openshift-kube-controller-manager-operator kube-controller-manager-operator-7fd685cf44-7v7rb 1/1 Running 47 4h42m
openshift-kube-controller-manager installer-2-test1-dl2sn-master-0 0/1 Completed 0 4h40m
openshift-kube-controller-manager installer-22-test1-dl2sn-master-0 0/1 Completed 0 160m
openshift-kube-controller-manager installer-3-test1-dl2sn-master-0 0/1 Completed 0 4h39m
openshift-kube-controller-manager installer-37-test1-dl2sn-master-0 0/1 Completed 0 40m
openshift-kube-controller-manager installer-38-test1-dl2sn-master-0 0/1 Completed 0 40m
openshift-kube-controller-manager installer-39-test1-dl2sn-master-0 0/1 Completed 0 32m
openshift-kube-controller-manager installer-40-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-controller-manager installer-41-test1-dl2sn-master-0 0/1 Completed 0 22m
openshift-kube-controller-manager installer-42-test1-dl2sn-master-0 0/1 Completed 0 16m
openshift-kube-controller-manager installer-5-test1-dl2sn-master-0 0/1 Completed 0 4h34m
openshift-kube-controller-manager kube-controller-manager-test1-dl2sn-master-0 1/1 Running 3 15m
openshift-kube-controller-manager revision-pruner-2-test1-dl2sn-master-0 0/1 Completed 0 4h39m
openshift-kube-controller-manager revision-pruner-22-test1-dl2sn-master-0 0/1 Completed 0 159m
openshift-kube-controller-manager revision-pruner-3-test1-dl2sn-master-0 0/1 Completed 0 4h38m
openshift-kube-controller-manager revision-pruner-37-test1-dl2sn-master-0 0/1 Completed 0 40m
openshift-kube-controller-manager revision-pruner-38-test1-dl2sn-master-0 0/1 Completed 0 38m
openshift-kube-controller-manager revision-pruner-39-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-controller-manager revision-pruner-40-test1-dl2sn-master-0 0/1 Completed 0 22m
openshift-kube-controller-manager revision-pruner-41-test1-dl2sn-master-0 0/1 Completed 0 16m
openshift-kube-controller-manager revision-pruner-42-test1-dl2sn-master-0 0/1 Completed 0 14m
openshift-kube-controller-manager revision-pruner-5-test1-dl2sn-master-0 0/1 Completed 0 4h34m
openshift-kube-scheduler-operator openshift-kube-scheduler-operator-7fb87bf449-g4pvv 1/1 Running 47 4h42m
openshift-kube-scheduler installer-20-test1-dl2sn-master-0 0/1 Error 0 159m
openshift-kube-scheduler installer-21-test1-dl2sn-master-0 0/1 Error 0 157m
openshift-kube-scheduler installer-37-test1-dl2sn-master-0 0/1 Error 0 37m
openshift-kube-scheduler installer-38-test1-dl2sn-master-0 0/1 Completed 0 36m
openshift-kube-scheduler installer-39-test1-dl2sn-master-0 0/1 Completed 0 32m
openshift-kube-scheduler installer-40-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-scheduler installer-41-test1-dl2sn-master-0 0/1 Completed 0 22m
openshift-kube-scheduler installer-42-test1-dl2sn-master-0 0/1 Completed 0 15m
openshift-kube-scheduler openshift-kube-scheduler-test1-dl2sn-master-0 1/1 Running 3 15m
openshift-kube-scheduler revision-pruner-20-test1-dl2sn-master-0 0/1 Completed 0 158m
openshift-kube-scheduler revision-pruner-21-test1-dl2sn-master-0 0/1 Completed 0 156m
openshift-kube-scheduler revision-pruner-37-test1-dl2sn-master-0 0/1 Completed 0 36m
openshift-kube-scheduler revision-pruner-38-test1-dl2sn-master-0 0/1 OOMKilled 0 36m
openshift-kube-scheduler revision-pruner-39-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-scheduler revision-pruner-40-test1-dl2sn-master-0 0/1 Completed 0 28m
openshift-kube-scheduler revision-pruner-41-test1-dl2sn-master-0 0/1 Completed 0 16m
openshift-kube-scheduler revision-pruner-42-test1-dl2sn-master-0 0/1 Completed 0 14m
openshift-machine-api cluster-autoscaler-operator-5dfb8c9c49-wpxrw 1/1 Running 46 4h42m
openshift-machine-api clusterapi-manager-controllers-787f5f5dbf-t2dpb 4/4 Running 0 4h41m
openshift-machine-api machine-api-operator-7958456954-vw2bf 1/1 Running 0 4h41m
openshift-machine-config-operator machine-config-controller-55f66dcdb4-qjgvr 1/1 Running 0 4h41m
openshift-machine-config-operator machine-config-daemon-rwpx5 1/1 Running 0 4h38m
openshift-machine-config-operator machine-config-daemon-xhkmj 1/1 Running 0 4h40m
openshift-machine-config-operator machine-config-operator-6d55fdf7b9-kf9w5 1/1 Running 0 4h42m
openshift-machine-config-operator machine-config-server-n29s4 1/1 Running 0 4h40m
openshift-marketplace certified-operators-64cfd5dfc-x5pf4 1/1 Running 2 4h32m
openshift-marketplace community-operators-5dd94977d7-9925k 1/1 Running 2 4h32m
openshift-marketplace marketplace-operator-5957c59b55-t7hqt 1/1 Running 2 4h36m
openshift-marketplace redhat-operators-86f64c75b5-9gdz7 1/1 Running 2 4h32m
openshift-monitoring alertmanager-main-0 3/3 Running 0 4h31m
openshift-monitoring alertmanager-main-1 3/3 Running 0 4h30m
openshift-monitoring alertmanager-main-2 3/3 Running 0 4h30m
openshift-monitoring cluster-monitoring-operator-556478479c-xhgwx 1/1 Running 0 4h36m
openshift-monitoring grafana-6df47448b-9fc7l 2/2 Running 0 4h31m
openshift-monitoring kube-state-metrics-c886c4d49-4qxqg 3/3 Running 0 4h36m
openshift-monitoring node-exporter-rvjdg 2/2 Running 0 4h35m
openshift-monitoring node-exporter-wv2jk 2/2 Running 0 4h35m
openshift-monitoring prometheus-adapter-7867c57f4f-7xd77 1/1 Running 0 51m
openshift-monitoring prometheus-adapter-7867c57f4f-9mhrg 1/1 Running 0 51m
openshift-monitoring prometheus-k8s-0 6/6 Running 1 4h29m
openshift-monitoring prometheus-k8s-1 6/6 Running 1 4h29m
openshift-monitoring prometheus-operator-654865bfd9-5kbln 1/1 Running 2 4h32m
openshift-multus multus-dqpv5 1/1 Running 0 4h41m
openshift-multus multus-qpss2 1/1 Running 0 4h38m
openshift-network-operator network-operator-8474b95564-vhbd2 1/1 Running 0 4h42m
openshift-operator-lifecycle-manager catalog-operator-7555554bb6-vd8k2 1/1 Running 0 4h40m
openshift-operator-lifecycle-manager olm-operator-7554b549f9-zcxgg 1/1 Running 0 4h40m
openshift-operator-lifecycle-manager olm-operators-qmrdj 1/1 Running 0 4h41m
openshift-operator-lifecycle-manager packageserver-586d546f99-dtlnj 1/1 Running 0 33m
openshift-operator-lifecycle-manager packageserver-586d546f99-rxwxx 1/1 Running 0 33m
openshift-sdn ovs-4bf5l 1/1 Running 0 4h38m
openshift-sdn ovs-ldddl 1/1 Running 0 4h41m
openshift-sdn sdn-4vjvb 1/1 Running 0 4h38m
openshift-sdn sdn-controller-t5q9b 1/1 Running 46 4h41m
openshift-sdn sdn-wbvpw 1/1 Running 1 4h41m
openshift-service-ca-operator openshift-service-ca-operator-7b56d9576d-vhfvd 1/1 Running 45 4h42m
openshift-service-ca apiservice-cabundle-injector-5f4598456-vmkgn 1/1 Running 46 4h40m
openshift-service-ca configmap-cabundle-injector-6db5fd6746-6dr8h 1/1 Running 46 4h40m
openshift-service-ca service-serving-cert-signer-7f54646c5-znlst 1/1 Running 46 4h40m
openshift-service-catalog-apiserver-operator openshift-service-catalog-apiserver-operator-7798bcdbc7-ggrsr 1/1 Running 37 4h36m
openshift-service-catalog-controller-manager-operator openshift-service-catalog-controller-manager-operator-596b4mphw 1/1 Running 37 4h36m |
I changed the bootstrap and master values to 32gb mem Are you using the default setting? |
oc describe , What problem does it report? |
@ssbano I use master 12gb not the default one and bootstrap doesn't run workload it's just there till the etcd ownership transferred to master. |
I know but you adjusting the values the bootstrap flows faster and then it is destroyed by the installer after |
FWIW, I can verify that is usually the case, although I haven't checked how long it really takes for the cluster to be fully up. It's at least more than 30 mins since installer gives up before that. |
I've been busy but I'll make git fetch soon. Did you have any progress? Regards, |
On libvirt, this may be due to delays pulling all the images down to your local machines. But it can happen for other reasons too. For example, here we timed out on AWS because of the same CVO-roll-out delays discussed in openshift/cluster-authentication-operator#95. |
@ssbano So today with latest master I am not seeing this except known one openshift/cluster-storage-operator#19 , might be something is improved regarding CVO roll-out, but I will keep it open till we have a desired success for libvirt provider. |
All right! Regards, |
I'm also hitting the 30m timeout when using the libvirt install method:
I'll try changing the timeout in create.go to 60m and will report back. |
@jsm84 sure, also can you tell which version of installer you are using? Is it from the master or a tag one? |
I was using the master branch. I just ran the install again, after rebasing against the latest changes to master and changing the cluster initialization timeout to 60m, and it still timed out. It reached 94% completion. I'm running this on a Lenovo P50 laptop (4 cores, 8 threads, 32GB RAM) which isn't exactly low-end hardware, and Fedora 29 Xfce Spin (lighter than usual DE = more free resources). |
We frequently see issues where 30m isn't enough for the cluster to come up. Waiting another 10-15 minutes would probaly resolve the issue for us. This bumps the cluster timeout to 60, and lets users override all timeouts with environment variables. Fixes openshift#1428
One thing that helped a lot for me in libvirt was to provide more vcpus. See https://github.com/cgwalters/xokdinst/blob/d45f422bed5a6cfaa78a1efb67aaecf5f77d6b37/src/xokdinst.rs#L146 The installer should probably detect the number of physical cores and match that. |
Still getting a timeout, even at 3hrs 20min (200min). I went digging further into the code, besides simply adjusting the timeout value in create.go, and found that this function is simply watching the Here are the contents of the ClusterVersion CR for my libvirt cluster:
It appears to be attempting to fetch updates and is failing due to an unknown version. This version tag appears to be created at compile time, which might explain why it fails to find anything. I just don't know the best course of action to circumvent this. It appears as though the cluster operator is stuck in a "never completed" state since it can't find updates for the version it's running. |
Below is the pasted output:
I can mostly confirm #1371, as I noticed that I was unable to login as kubeadmin, since I couldn't resolve the openshift-authentication route (missing the apps.test.openshift.local wildcard entry in libvirt's net-config for the cluster network). I can't completely confirm it, as I'm not getting to the "Waiting on Console" part of the installation. My timeout is occuring during the phase prior to that: Here's the log snippet:
And finally, here's the output of
|
I think increasing timeouts may not be the ideal solution. What i found out is that after the bootstrap VM sets up the masters, it waits for the API to come up by pinging the public DNS, which takes a long long time to propagate depending on where you are located and what DNS you are using. I got 100% failure when installing from within the office network because we have our own corporate DNS server. Next I tried from home and my ISPs DNS is also slow to see the new DNS entry. Finally, I changed /etc/resolv.conf to point to google DNS and finally the install was successful. I think we are going to have a problem with enterprise customers trying to install from within the corp network and failing. the further away from US based AWS clusters, the worse the problem Increasing timeout to 1 hr or more does not make sense and is a band aid solution. I am just a daft moron trying out the install, but is it possible to separate out the install from the Route 53 DNS config - maybe do a staged install, and ask users to manually verify that the DNS is resolvable from their machine and then continue tearing down bootstrap etc? Also, instead of pinging the public DNS, why not ping the private AWS DNS entry and finish the install. User can wait for DNS to propagate and then do oc login when the entry is publicly visible? just my 2c |
That should only be an option from within the installer-created VPC, unless you're following a UPI flow. But you could certainly create an installer machine in your AWS account (in your target region or not, probably wouldn't matter) and run |
@rsriniva Have you setup the DNS overlay as instructed here and the used the workaround for the console issue? |
When using the kni-installer for baremetal in a virtualized setting, as for CI environments or classroom lab settings, the apiTimeout wait of 30m is not enough for the bootstrap VM to instantiate due to the reduced speed of nested virtualization. I have noticed in these nested virtualized environments that job: A start job is running for Ignition (disks) can take anywhere from 12m to 22m to complete depending on the underlying disk in the nested environment. By that time rhcos has already been laid down on the master nodes who have been waiting for the api but the bootstrap VM is still in process of coming up and the 30m timer is already running. I use to make the timeout adjustment myself in create.go but now that we are working with release payloads where that is bundled in there I do not have that ability. This lab is in Westford. |
/label platform/libvirt |
/priority critical-urgent |
I had the state that jsm84 described in his comment |
I am facing timeout issue too :
OS image: Openshift Installer version : I have tried all alternatives from modifying the system resource value to increasing the timeout value, but no luck. |
@Sarang-Sangram Sorry to hear that you face the same but if waiting longer for the Installer doesn't help and it never comes up, this is not the relevant issue. You might be facing #1893 |
After destroying the cluster I re-initiated the cluster creation, which again timed out. But I noticed, below states in the oc commands :
|
@praveenkumar - see - https://pbs.twimg.com/media/EFiKJ0OW4AAc4xn?format=png&name=large Regards, |
we already support And as for 30 min timeout customization, we CI most of our deployment models and they are always in around ~25-35 mins total, and we do accept bugs when there is slowdown because of some bug, and not based on user environment restrictions. /close |
@abhinavdahiya: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Version
Platform (aws|libvirt|openstack):
libvirt
What happened?
Cluster initialisation time out after 30m and cluster is taking around 45-50 mins to become healthy in libvirt provider.
What you expected to happen?
Cluster should have taken 30 mins only for libvirt also like it is taking for aws.
How to reproduce it (as minimally and precisely as possible)?
The text was updated successfully, but these errors were encountered: