libvirt with kvm installation failed #2157

ahrtr · 2019-08-05T13:49:37Z

Version

$ openshift-install version
bin/openshift-install unreleased-master-1353-g526edca4f6646af4bbcdf48bc2c2ee1d44d9fada
built from commit 526edca4f6646af4bbcdf48bc2c2ee1d44d9fada
release image registry.svc.ci.openshift.org/origin/release:4.2

Platform:

[root@origin-build installer]# uname -a
Linux origin-build 3.10.0-957.21.3.el7.x86_64 #1 SMP Fri Jun 14 02:54:29 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@origin-build ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

What happened?

DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: no route to host 



DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: no route to host 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 

DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 



DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Fetching "Install Config"...                 
DEBUG Loading "Install Config"...                  
DEBUG   Loading "SSH Key"...                       
DEBUG   Using "SSH Key" loaded from state file     
DEBUG   Loading "Base Domain"...                   
DEBUG     Loading "Platform"...                    
DEBUG     Using "Platform" loaded from state file  
DEBUG   Using "Base Domain" loaded from state file 
DEBUG   Loading "Cluster Name"...                  
DEBUG     Loading "Base Domain"...                 
DEBUG   Using "Cluster Name" loaded from state file 
DEBUG   Loading "Pull Secret"...                   
DEBUG   Using "Pull Secret" loaded from state file 
DEBUG   Loading "Platform"...                      
DEBUG Using "Install Config" loaded from state file 
DEBUG Reusing previously-fetched "Install Config"  
INFO Pulling debug logs from the bootstrap machine 
ERROR failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp: lookup test1-kf6ts-bootstrap on 127.0.0.1:53: no such host 
FATAL waiting for Kubernetes API: context deadline exceeded

What you expected to happen?

I expected to see successful result.

How to reproduce it (as minimally and precisely as possible)?

I just followed the guide as below,
https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md

Other info

[root@origin-build auth]# virsh -c "qemu+tcp://192.168.122.1/system" domifaddr "test1-kf6ts-bootstrap"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      52:54:00:2a:13:a8    ipv4         192.168.126.10/24

[root@origin-build auth]# virsh -c "qemu+tcp://192.168.122.1/system" domifaddr "test1-kf6ts-master-0"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet1      52:54:00:a2:4e:97    ipv4         192.168.126.11/24

[root@origin-build auth]# virsh list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 5     test1-kf6ts-bootstrap          running
 6     test1-kf6ts-master-0           running


[root@origin-build .ssh]# ssh core@192.168.126.10

[core@test1-kf6ts-bootstrap ~]$ sudo journalctl -f -u bootkube -u openshift
-- Logs begin at Mon 2019-08-05 13:18:14 UTC. --
Aug 05 14:10:13 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:13 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:22 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:22 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:32 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:32 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:42 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:42 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml

The text was updated successfully, but these errors were encountered:

zeenix · 2019-08-05T14:11:53Z

See if you can reach the nodes from outside. If you can't reach them at all, you likely missed the firewall settings required. If you can reach them only through IP and not hostnames, then it's likely the DNS change didn't happen (Make sure you tol the NetworkManater to reload its config).

ahrtr · 2019-08-05T14:20:54Z

@zeenix Thanks for the feedback.

I could ssh into both the bootstrap and master node through IP. But it did not work using hostnames. I am pretty sure I configured & reloaded NetworkManager.

[root@origin-build .ssh]# ssh core@192.168.126.10
Red Hat Enterprise Linux CoreOS 420.8.20190708.2
WARNING: Direct SSH access to machines is not recommended.

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service
Last login: Mon Aug  5 14:10:34 2019 from 192.168.126.1
[core@test1-kf6ts-bootstrap ~]$ 
[core@test1-kf6ts-bootstrap ~]$ sudo journalctl -f -u bootkube -u openshift
-- Logs begin at Mon 2019-08-05 13:18:14 UTC. --
Aug 05 14:21:43 test1-kf6ts-bootstrap bootkube.sh[11195]: Error: stat /assets/auth/kubeconfig-loopback: no such file or directory
Aug 05 14:21:47 test1-kf6ts-bootstrap bootkube.sh[11195]: Error: no container with name or ID etcd-signer found: no such container
Aug 05 14:21:47 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 14:21:47 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 3.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Aug 05 14:21:53 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://api.test1.tt.testing:6443/api?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
Aug 05 14:21:53 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:21:57 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:21:57.842183095 +0000 UTC m=+1.340224283 container create 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.479163547 +0000 UTC m=+3.977204195 container init 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.729559363 +0000 UTC m=+4.227599955 container start 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.734416093 +0000 UTC m=+4.232456801 container attach 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
^C
[core@test1-kf6ts-bootstrap ~]$

zeenix · 2019-08-05T14:28:43Z

I could ssh into both the bootstrap and master node through IP. But it did not work using hostnames. I am pretty sure I configured & reloaded NetworkManager.

Right, DNS is your issue. Please verify that you followed every step precisely. Just to be sure, you've to use the fully-qualified hostnames. You can find them on the DHCP leases if you look at the XML of the cluster network that Installer creates.

zeenix · 2019-08-05T14:31:21Z

Oh and I ended up wasting a lot of time recently when i missed the fact that both the /etc/NetworkManager/conf.d/openshift.conf and /etc/NetworkManager/dnsmasq.d/openshift.conf needs to be in place (oh and they've different content).

ahrtr · 2019-08-05T14:42:07Z

I just tried again, actually I could ssh into the master node using hostname as well. I just used the hostname printed out in the log entries. So why the installation failed in the end?

[root@origin-build .ssh]# ssh core@api.test1.tt.testing
The authenticity of host 'api.test1.tt.testing (192.168.126.11)' can't be established.
ECDSA key fingerprint is SHA256:R1QJs1q8d1s5xTeDRhB8TNlVJfvpw6uXEi0k7hKS8mg.
ECDSA key fingerprint is MD5:ff:d6:10:71:27:87:6a:d0:bc:35:22:cb:b5:92:a4:0c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'api.test1.tt.testing,192.168.126.11' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20190805.0
WARNING: Direct SSH access to machines is not recommended.

---
Last login: Mon Aug  5 14:14:39 2019 from 192.168.126.1
[core@test1-kf6ts-master-0 ~]$

The reason why I failed to ssh the node previously was that I used the wrong hostnames, which were printed out using virsh

[root@origin-build auth]# virsh list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 5     test1-kf6ts-bootstrap          running
 6     test1-kf6ts-master-0           running

ahrtr · 2019-08-05T14:47:39Z

You can find them on the DHCP leases if you look at the XML of the cluster network that Installer creates

Could you please provide more detailed info on how to get the fully-qualified hostnames?

ahrtr · 2019-08-05T15:15:59Z

After ssh into the bootstrap node, I saw lots of error logs,

Aug 05 15:14:59 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 15:15:01 test1-c92ts-bootstrap bootkube.sh[16999]: Starting etcd certificate signer...
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.624925772 +0000 UTC m=+0.737770954 container create 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.891953379 +0000 UTC m=+1.004798506 container init 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.974642661 +0000 UTC m=+1.087487889 container start 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap bootkube.sh[16999]: 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd
Aug 05 15:15:02 test1-c92ts-bootstrap bootkube.sh[16999]: Waiting for etcd cluster...
Aug 05 15:15:03 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:03.716620023 +0000 UTC m=+0.701961654 container create ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.091268052 +0000 UTC m=+1.076609603 container init ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.134081564 +0000 UTC m=+1.119423024 container start ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.134180224 +0000 UTC m=+1.119521600 container attach ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap bootkube.sh[16999]: https://etcd-0.test1.tt.testing:2379 is healthy: successfully committed proposal: took = 2.658314ms
Aug 05 15:15:04 test1-c92ts-bootstrap openshift.sh[1364]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 15:15:04 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 15:15:05 test1-c92ts-bootstrap bootkube.sh[16999]: etcd cluster up. Killing etcd certificate signer...
Aug 05 15:15:06 test1-c92ts-bootstrap podman[19248]: 2019-08-05 15:15:06.147191145 +0000 UTC m=+0.705867350 container died 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:06 test1-c92ts-bootstrap podman[19248]: 2019-08-05 15:15:06.63675544 +0000 UTC m=+1.195431457 container remove 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:06 test1-c92ts-bootstrap bootkube.sh[16999]: 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd
Aug 05 15:15:07 test1-c92ts-bootstrap bootkube.sh[16999]: Starting cluster-bootstrap...
Aug 05 15:15:07 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:07.75034373 +0000 UTC m=+0.614334098 container create 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.166334139 +0000 UTC m=+1.030323906 container init 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.216222363 +0000 UTC m=+1.080211908 container start 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap bootkube.sh[16999]: Error: stat /assets/auth/kubeconfig-loopback: no such file or directory
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.216371491 +0000 UTC m=+1.080361120 container attach 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:10 test1-c92ts-bootstrap bootkube.sh[16999]: Error: no container with name or ID etcd-signer found: no such container
Aug 05 15:15:10 test1-c92ts-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 15:15:10 test1-c92ts-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Aug 05 15:15:10 test1-c92ts-bootstrap openshift.sh[1364]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://api.test1.tt.testing:6443/api?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
Aug 05 15:15:10 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
^C

zeenix · 2019-08-05T15:34:25Z

Could you please provide more detailed info on how to get the fully-qualified hostnames?

$ sudo virsh net-list
 Name           State    Autostart   Persistent
-------------------------------------------------
 default        active   yes         yes
 minikube-net   active   yes         yes
 test1-rrvvl    active   yes         yes
$ sudo virsh net-dumpxml test1-rrvvl
<network>
  <name>test1-rrvvl</name>
...
  <ip family='ipv4' address='192.168.126.1' prefix='24'>
    <dhcp>
      <range start='192.168.126.2' end='192.168.126.254'/>
      <host mac='52:54:00:8e:ab:76' name='test1-rrvvl-master-0.test1.tt.testing' ip='192.168.126.11'/>
      <host mac='52:54:00:4e:9a:4a' name='test1-rrvvl-bootstrap.test1.tt.testing' ip='192.168.126.10'/>
      <host mac='2a:94:ac:1a:85:fc' name='test1-rrvvl-worker-0-s7g7j' ip='192.168.126.51'/>
...

The test1-rrvvl-master-0.test1.tt.testing and test1-rrvvl-bootstrap.test1.tt.testing in this case are those.

ahrtr · 2019-08-06T07:28:51Z

I created a brand new VM, and tried again following the guide precisely, and failed again. Once I configured NetworkManager to use dnsmasq and reloaded NerworkManager, then all the original nameserver entries in /etc/resolv.conf were gone, and a new record "nameserve 127.0.0.1" was added.

Afterwards, when I tried to execute "bin/openshift-install create cluster --log-level debug", then I got the following error.

DEBUG   Reusing previously-fetched "Worker Machines" 
DEBUG Generating "Terraform Variables"...          
INFO Fetching OS image: rhcos-42.80.20190725.1-qemu.qcow2 
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get libvirt Terraform variables: failed to use cached libvirt image: Get https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.2/42.80.20190725.1/rhcos-42.80.20190725.1-qemu.qcow2: dial tcp: lookup releases-art-rhcos.svc.ci.openshift.org on 127.0.0.1:53: server misbehaving

jichenjc · 2019-08-06T08:07:20Z

I actually used the code 885a442 (Aug 3) and at least I can create the vms and operators mostly online.. ( I actually omitted the NetworkManager stuff)

if you just for test purpose, I think maybe you can first avoid the installer machine's effort, maybe you can ssh to the new created machines and check whether oc works?

zeenix · 2019-08-06T12:05:04Z

I created a brand new VM

Ah, it's a nested virt case. Make sure you're not bitten by the famous CSR issue. About fully-qualified domain names, you need to be on the latest git master of Installer binary for that. I only recently fixed that.

ahrtr · 2019-08-07T00:57:24Z

I actually used the code 885a442 (Aug 3) and at least I can create the vms and operators mostly online.. ( I actually omitted the NetworkManager stuff)

It's exactly what I am doing right now (I mean ignoring the NetworkManager/DNS configuration), and it took a long time for majority of the operators/PODS to be running, and both kubectl & oc work now. It seems that I ran into the same issue as 1428. But the worker node did not get created, and the bootstrap node did not get destroyed as expected, because the installer "waiting for Kubernetes API: context deadline exceeded " in the end.

Ah, it's a nested virt case. Make sure you're not bitten by the famous CSR issue.

Thanks for the info. The fully-qualified domain issue was gone after using the latest master code. Regarding the CSR issue, I will take a look later.

ahrtr · 2019-08-07T07:32:02Z

I tried again on a powerful physical machine instead of a VM following the guide precisely, and it's much better now. Both master node and work node were created, and the bootstrap node was destroyed as expected. But the installation still failed in the end,

DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
* Could not update deployment "openshift-insights/insights-operator" (297 of 447)
* Could not update deployment "openshift-insights/insights-operator" (354 of 447)
* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (442 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (410 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (446 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-image-registry/image-registry" (416 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (426 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (430 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (434 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (145 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/machine-api-operator" (96 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (436 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (419 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (422 of 447): the server does not recognize this resource, check extension API servers 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 90% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 90% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 97% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 98% complete, waiting on authentication, ingress, monitoring, openshift-samples 

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples

Version

$ bin/openshift-install version
bin/openshift-install unreleased-master-1527-gdd407b0351339ec5fec540c82ecd556ec0dcb98c
built from commit dd407b0
release image registry.svc.ci.openshift.org/origin/release:4.2

Other info

Some PODs are in the status of "Preempting", and some are always in "ContainerCreating",

penshift-console                                       console-54cf7c5b4b-d5snw                                          0/1     ContainerCreating   0          35m
openshift-console                                       console-54cf7c5b4b-kh45f                                          0/1     ContainerCreating   0          35m
openshift-console                                       console-6469748bb4-lswr4                                          0/1     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-0                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-1                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-2                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    cluster-monitoring-operator-6559f76df6-tvgt2                      1/1     Running             0          59m
openshift-monitoring                                    grafana-6f887959f5-rsdrq                                          0/2     ContainerCreating   0          35m
openshift-monitoring                                    prometheus-k8s-0                                                  0/6     ContainerCreating   0          29m
openshift-monitoring                                    prometheus-k8s-1                                                  0/6     ContainerCreating   0          29m
openshift-console-operator                              console-operator-7fd446b47-ptjll                                  0/1     Preempting          0          59m
openshift-operator-lifecycle-manager                    catalog-operator-779c4c9945-df6f5                                 0/1     Preempting          0          69m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator-6c69bd79ff-nbf25     0/1     Preempting          0          61m

ahrtr · 2019-08-08T02:33:05Z

Based on the definition of deployment "etcd-quorum-guard", the replica is 3, and the 3 PODs must be running on different master node per the pod anti-affinity. So there should be at least 3 master node, otherwise 2 PODs will always be in "Pending" status?

And there should be at least 2 worker node, because two replicas of deployment "deployment router-default" must be running on different worker nodes per the pod anti-affinity definition.

Since it's just for test purpose, so It'd be better to create only one master node and one worker node.

zeenix · 2019-08-08T11:23:31Z

I tried again on a powerful physical machine instead of a VM following the guide precisely, and it's much better now. Both master node and work node were created, and the bootstrap node was destroyed as expected. But the installation still failed in the end,

That's a very different issue now so please file a different one, unless there already is one (I think there is).

Did you check the CSR on the VM? That's the only needed info right now here.

zeenix · 2019-08-08T11:25:13Z

Did you check the CSR on the VM? That's the only needed info right now here.

Although that's no longer needed if you create a new cluster now. But try reproducing the issue again on the VM and if you can still reproduce the original issue, please check the CSR to be sure.

ahrtr · 2019-08-09T03:12:36Z

That's a very different issue now so please file a different one, unless there already is one (I think there is).

I assigned more memory and CPU for the master node, and it's working now.

But try reproducing the issue again on the VM and if you can still reproduce the original issue, please check the CSR to be sure

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

zeenix · 2019-08-09T10:04:40Z

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

I understand. I'll close this for now then. Please let us know if you can reproduce the original issue and we can reopen it for you.

/close

openshift-ci-robot · 2019-08-09T10:04:42Z

@zeenix: Closing this issue.

In response to this:

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

I understand. I'll close this for now then. Please let us know if you can reproduce the original issue and we can reopen it for you.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ahrtr mentioned this issue Aug 5, 2019

libvirt with kvm installation failed #2154

Closed

openshift-ci-robot closed this as completed Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libvirt with kvm installation failed #2157

libvirt with kvm installation failed #2157

ahrtr commented Aug 5, 2019 •

edited

Loading

zeenix commented Aug 5, 2019

ahrtr commented Aug 5, 2019 •

edited

Loading

zeenix commented Aug 5, 2019

zeenix commented Aug 5, 2019

ahrtr commented Aug 5, 2019 •

edited

Loading

ahrtr commented Aug 5, 2019

ahrtr commented Aug 5, 2019

zeenix commented Aug 5, 2019

ahrtr commented Aug 6, 2019 •

edited

Loading

jichenjc commented Aug 6, 2019

zeenix commented Aug 6, 2019

ahrtr commented Aug 7, 2019 •

edited

Loading

ahrtr commented Aug 7, 2019 •

edited

Loading

ahrtr commented Aug 8, 2019 •

edited

Loading

zeenix commented Aug 8, 2019

zeenix commented Aug 8, 2019

ahrtr commented Aug 9, 2019 •

edited

Loading

zeenix commented Aug 9, 2019

openshift-ci-robot commented Aug 9, 2019

libvirt with kvm installation failed #2157

libvirt with kvm installation failed #2157

Comments

ahrtr commented Aug 5, 2019 • edited Loading

Version

Platform:

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Other info

zeenix commented Aug 5, 2019

ahrtr commented Aug 5, 2019 • edited Loading

zeenix commented Aug 5, 2019

zeenix commented Aug 5, 2019

ahrtr commented Aug 5, 2019 • edited Loading

ahrtr commented Aug 5, 2019

ahrtr commented Aug 5, 2019

zeenix commented Aug 5, 2019

ahrtr commented Aug 6, 2019 • edited Loading

jichenjc commented Aug 6, 2019

zeenix commented Aug 6, 2019

ahrtr commented Aug 7, 2019 • edited Loading

ahrtr commented Aug 7, 2019 • edited Loading

Version

Other info

ahrtr commented Aug 8, 2019 • edited Loading

zeenix commented Aug 8, 2019

zeenix commented Aug 8, 2019

ahrtr commented Aug 9, 2019 • edited Loading

zeenix commented Aug 9, 2019

openshift-ci-robot commented Aug 9, 2019

ahrtr commented Aug 5, 2019 •

edited

Loading

ahrtr commented Aug 5, 2019 •

edited

Loading

ahrtr commented Aug 5, 2019 •

edited

Loading

ahrtr commented Aug 6, 2019 •

edited

Loading

ahrtr commented Aug 7, 2019 •

edited

Loading

ahrtr commented Aug 7, 2019 •

edited

Loading

ahrtr commented Aug 8, 2019 •

edited

Loading

ahrtr commented Aug 9, 2019 •

edited

Loading