Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libvirt with kvm installation failed #2157

Closed
ahrtr opened this issue Aug 5, 2019 · 19 comments
Closed

libvirt with kvm installation failed #2157

ahrtr opened this issue Aug 5, 2019 · 19 comments

Comments

@ahrtr
Copy link

ahrtr commented Aug 5, 2019

Version

$ openshift-install version
bin/openshift-install unreleased-master-1353-g526edca4f6646af4bbcdf48bc2c2ee1d44d9fada
built from commit 526edca4f6646af4bbcdf48bc2c2ee1d44d9fada
release image registry.svc.ci.openshift.org/origin/release:4.2

Platform:

[root@origin-build installer]# uname -a
Linux origin-build 3.10.0-957.21.3.el7.x86_64 #1 SMP Fri Jun 14 02:54:29 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@origin-build ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

What happened?

DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: no route to host 



DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: no route to host 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused 

DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 



DEBUG Still waiting for the Kubernetes API: Get https://api.test1.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: connection refused 
DEBUG Fetching "Install Config"...                 
DEBUG Loading "Install Config"...                  
DEBUG   Loading "SSH Key"...                       
DEBUG   Using "SSH Key" loaded from state file     
DEBUG   Loading "Base Domain"...                   
DEBUG     Loading "Platform"...                    
DEBUG     Using "Platform" loaded from state file  
DEBUG   Using "Base Domain" loaded from state file 
DEBUG   Loading "Cluster Name"...                  
DEBUG     Loading "Base Domain"...                 
DEBUG   Using "Cluster Name" loaded from state file 
DEBUG   Loading "Pull Secret"...                   
DEBUG   Using "Pull Secret" loaded from state file 
DEBUG   Loading "Platform"...                      
DEBUG Using "Install Config" loaded from state file 
DEBUG Reusing previously-fetched "Install Config"  
INFO Pulling debug logs from the bootstrap machine 
ERROR failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp: lookup test1-kf6ts-bootstrap on 127.0.0.1:53: no such host 
FATAL waiting for Kubernetes API: context deadline exceeded 

What you expected to happen?

I expected to see successful result.

How to reproduce it (as minimally and precisely as possible)?

I just followed the guide as below,
https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md

Other info

[root@origin-build auth]# virsh -c "qemu+tcp://192.168.122.1/system" domifaddr "test1-kf6ts-bootstrap"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      52:54:00:2a:13:a8    ipv4         192.168.126.10/24

[root@origin-build auth]# virsh -c "qemu+tcp://192.168.122.1/system" domifaddr "test1-kf6ts-master-0"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet1      52:54:00:a2:4e:97    ipv4         192.168.126.11/24

[root@origin-build auth]# virsh list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 5     test1-kf6ts-bootstrap          running
 6     test1-kf6ts-master-0           running


[root@origin-build .ssh]# ssh core@192.168.126.10

[core@test1-kf6ts-bootstrap ~]$ sudo journalctl -f -u bootkube -u openshift
-- Logs begin at Mon 2019-08-05 13:18:14 UTC. --
Aug 05 14:10:13 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:13 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:22 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:22 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:32 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:32 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:10:42 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 14:10:42 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml 
@zeenix
Copy link
Contributor

zeenix commented Aug 5, 2019

See if you can reach the nodes from outside. If you can't reach them at all, you likely missed the firewall settings required. If you can reach them only through IP and not hostnames, then it's likely the DNS change didn't happen (Make sure you tol the NetworkManater to reload its config).

@ahrtr
Copy link
Author

ahrtr commented Aug 5, 2019

@zeenix Thanks for the feedback.

I could ssh into both the bootstrap and master node through IP. But it did not work using hostnames. I am pretty sure I configured & reloaded NetworkManager.

[root@origin-build .ssh]# ssh core@192.168.126.10
Red Hat Enterprise Linux CoreOS 420.8.20190708.2
WARNING: Direct SSH access to machines is not recommended.

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service
Last login: Mon Aug  5 14:10:34 2019 from 192.168.126.1
[core@test1-kf6ts-bootstrap ~]$ 
[core@test1-kf6ts-bootstrap ~]$ sudo journalctl -f -u bootkube -u openshift
-- Logs begin at Mon 2019-08-05 13:18:14 UTC. --
Aug 05 14:21:43 test1-kf6ts-bootstrap bootkube.sh[11195]: Error: stat /assets/auth/kubeconfig-loopback: no such file or directory
Aug 05 14:21:47 test1-kf6ts-bootstrap bootkube.sh[11195]: Error: no container with name or ID etcd-signer found: no such container
Aug 05 14:21:47 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 14:21:47 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 3.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Aug 05 14:21:52 test1-kf6ts-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Aug 05 14:21:53 test1-kf6ts-bootstrap openshift.sh[1362]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://api.test1.tt.testing:6443/api?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
Aug 05 14:21:53 test1-kf6ts-bootstrap openshift.sh[1362]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 14:21:57 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:21:57.842183095 +0000 UTC m=+1.340224283 container create 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.479163547 +0000 UTC m=+3.977204195 container init 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.729559363 +0000 UTC m=+4.227599955 container start 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
Aug 05 14:22:00 test1-kf6ts-bootstrap podman[14031]: 2019-08-05 14:22:00.734416093 +0000 UTC m=+4.232456801 container attach 6b0770d5fbe99d7401ff2f720cbc4f71858e0f5574edfda2c89d8a0e447eda3d (image=registry.svc.ci.openshift.org/origin/release:4.2, name=reverent_germain)
^C
[core@test1-kf6ts-bootstrap ~]$ 


@zeenix
Copy link
Contributor

zeenix commented Aug 5, 2019

I could ssh into both the bootstrap and master node through IP. But it did not work using hostnames. I am pretty sure I configured & reloaded NetworkManager.

Right, DNS is your issue. Please verify that you followed every step precisely. Just to be sure, you've to use the fully-qualified hostnames. You can find them on the DHCP leases if you look at the XML of the cluster network that Installer creates.

@zeenix
Copy link
Contributor

zeenix commented Aug 5, 2019

Oh and I ended up wasting a lot of time recently when i missed the fact that both the /etc/NetworkManager/conf.d/openshift.conf and /etc/NetworkManager/dnsmasq.d/openshift.conf needs to be in place (oh and they've different content).

@ahrtr
Copy link
Author

ahrtr commented Aug 5, 2019

I just tried again, actually I could ssh into the master node using hostname as well. I just used the hostname printed out in the log entries. So why the installation failed in the end?

[root@origin-build .ssh]# ssh core@api.test1.tt.testing
The authenticity of host 'api.test1.tt.testing (192.168.126.11)' can't be established.
ECDSA key fingerprint is SHA256:R1QJs1q8d1s5xTeDRhB8TNlVJfvpw6uXEi0k7hKS8mg.
ECDSA key fingerprint is MD5:ff:d6:10:71:27:87:6a:d0:bc:35:22:cb:b5:92:a4:0c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'api.test1.tt.testing,192.168.126.11' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20190805.0
WARNING: Direct SSH access to machines is not recommended.

---
Last login: Mon Aug  5 14:14:39 2019 from 192.168.126.1
[core@test1-kf6ts-master-0 ~]$ 

The reason why I failed to ssh the node previously was that I used the wrong hostnames, which were printed out using virsh

[root@origin-build auth]# virsh list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 5     test1-kf6ts-bootstrap          running
 6     test1-kf6ts-master-0           running

@ahrtr
Copy link
Author

ahrtr commented Aug 5, 2019

You can find them on the DHCP leases if you look at the XML of the cluster network that Installer creates

Could you please provide more detailed info on how to get the fully-qualified hostnames?

@ahrtr
Copy link
Author

ahrtr commented Aug 5, 2019

After ssh into the bootstrap node, I saw lots of error logs,

Aug 05 15:14:59 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 15:15:01 test1-c92ts-bootstrap bootkube.sh[16999]: Starting etcd certificate signer...
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.624925772 +0000 UTC m=+0.737770954 container create 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.891953379 +0000 UTC m=+1.004798506 container init 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap podman[19106]: 2019-08-05 15:15:02.974642661 +0000 UTC m=+1.087487889 container start 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:02 test1-c92ts-bootstrap bootkube.sh[16999]: 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd
Aug 05 15:15:02 test1-c92ts-bootstrap bootkube.sh[16999]: Waiting for etcd cluster...
Aug 05 15:15:03 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:03.716620023 +0000 UTC m=+0.701961654 container create ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.091268052 +0000 UTC m=+1.076609603 container init ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.134081564 +0000 UTC m=+1.119423024 container start ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap podman[19158]: 2019-08-05 15:15:04.134180224 +0000 UTC m=+1.119521600 container attach ebe78fd6604460cb8b361a6a87a9dce0f21f49525ef7c3445ce3088a38939bfe (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:7da423eb784e636f6183469e25f6cecb61910800b98d82dfbf7e6bb507dbfc1c, name=etcdctl)
Aug 05 15:15:04 test1-c92ts-bootstrap bootkube.sh[16999]: https://etcd-0.test1.tt.testing:2379 is healthy: successfully committed proposal: took = 2.658314ms
Aug 05 15:15:04 test1-c92ts-bootstrap openshift.sh[1364]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": no matches for kind "Secret" in version "v1"
Aug 05 15:15:04 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
Aug 05 15:15:05 test1-c92ts-bootstrap bootkube.sh[16999]: etcd cluster up. Killing etcd certificate signer...
Aug 05 15:15:06 test1-c92ts-bootstrap podman[19248]: 2019-08-05 15:15:06.147191145 +0000 UTC m=+0.705867350 container died 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:06 test1-c92ts-bootstrap podman[19248]: 2019-08-05 15:15:06.63675544 +0000 UTC m=+1.195431457 container remove 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:60e198064a00f736a3e6338a68350c2c2efd8118d34897c6044c56640fb30e0e, name=etcd-signer)
Aug 05 15:15:06 test1-c92ts-bootstrap bootkube.sh[16999]: 4b47857fd0dd61ec0e5e9392ab83808ec3c8d3c5d44c700b47d085d1fd14d3fd
Aug 05 15:15:07 test1-c92ts-bootstrap bootkube.sh[16999]: Starting cluster-bootstrap...
Aug 05 15:15:07 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:07.75034373 +0000 UTC m=+0.614334098 container create 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.166334139 +0000 UTC m=+1.030323906 container init 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.216222363 +0000 UTC m=+1.080211908 container start 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:08 test1-c92ts-bootstrap bootkube.sh[16999]: Error: stat /assets/auth/kubeconfig-loopback: no such file or directory
Aug 05 15:15:08 test1-c92ts-bootstrap podman[19310]: 2019-08-05 15:15:08.216371491 +0000 UTC m=+1.080361120 container attach 2aeb5a293561dbef032c87cce063b88d215270d1e03419ec3fcd8d8a18be8ae1 (image=registry.svc.ci.openshift.org/origin/4.2-2019-08-05-143844@sha256:b098468ed18de0ac00d92afeab25044d00c4e7fb503ec1f182cf42adb1ea7061, name=goofy_dhawan)
Aug 05 15:15:10 test1-c92ts-bootstrap bootkube.sh[16999]: Error: no container with name or ID etcd-signer found: no such container
Aug 05 15:15:10 test1-c92ts-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 15:15:10 test1-c92ts-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Aug 05 15:15:10 test1-c92ts-bootstrap openshift.sh[1364]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://api.test1.tt.testing:6443/api?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
Aug 05 15:15:10 test1-c92ts-bootstrap openshift.sh[1364]: kubectl create --filename ./99_kubeadmin-password-secret.yaml failed. Retrying in 5 seconds...
^C

@zeenix
Copy link
Contributor

zeenix commented Aug 5, 2019

Could you please provide more detailed info on how to get the fully-qualified hostnames?

$ sudo virsh net-list
 Name           State    Autostart   Persistent
-------------------------------------------------
 default        active   yes         yes
 minikube-net   active   yes         yes
 test1-rrvvl    active   yes         yes
$ sudo virsh net-dumpxml test1-rrvvl
<network>
  <name>test1-rrvvl</name>
...
  <ip family='ipv4' address='192.168.126.1' prefix='24'>
    <dhcp>
      <range start='192.168.126.2' end='192.168.126.254'/>
      <host mac='52:54:00:8e:ab:76' name='test1-rrvvl-master-0.test1.tt.testing' ip='192.168.126.11'/>
      <host mac='52:54:00:4e:9a:4a' name='test1-rrvvl-bootstrap.test1.tt.testing' ip='192.168.126.10'/>
      <host mac='2a:94:ac:1a:85:fc' name='test1-rrvvl-worker-0-s7g7j' ip='192.168.126.51'/>
...

The test1-rrvvl-master-0.test1.tt.testing and test1-rrvvl-bootstrap.test1.tt.testing in this case are those.

@ahrtr
Copy link
Author

ahrtr commented Aug 6, 2019

I created a brand new VM, and tried again following the guide precisely, and failed again. Once I configured NetworkManager to use dnsmasq and reloaded NerworkManager, then all the original nameserver entries in /etc/resolv.conf were gone, and a new record "nameserve 127.0.0.1" was added.

Afterwards, when I tried to execute "bin/openshift-install create cluster --log-level debug", then I got the following error.

DEBUG   Reusing previously-fetched "Worker Machines" 
DEBUG Generating "Terraform Variables"...          
INFO Fetching OS image: rhcos-42.80.20190725.1-qemu.qcow2 
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get libvirt Terraform variables: failed to use cached libvirt image: Get https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.2/42.80.20190725.1/rhcos-42.80.20190725.1-qemu.qcow2: dial tcp: lookup releases-art-rhcos.svc.ci.openshift.org on 127.0.0.1:53: server misbehaving 

@jichenjc
Copy link
Contributor

jichenjc commented Aug 6, 2019

I actually used the code 885a442 (Aug 3) and at least I can create the vms and operators mostly online.. ( I actually omitted the NetworkManager stuff)

if you just for test purpose, I think maybe you can first avoid the installer machine's effort, maybe you can ssh to the new created machines and check whether oc works?

@zeenix
Copy link
Contributor

zeenix commented Aug 6, 2019

I created a brand new VM

Ah, it's a nested virt case. Make sure you're not bitten by the famous CSR issue. About fully-qualified domain names, you need to be on the latest git master of Installer binary for that. I only recently fixed that.

@ahrtr
Copy link
Author

ahrtr commented Aug 7, 2019

I actually used the code 885a442 (Aug 3) and at least I can create the vms and operators mostly online.. ( I actually omitted the NetworkManager stuff)

It's exactly what I am doing right now (I mean ignoring the NetworkManager/DNS configuration), and it took a long time for majority of the operators/PODS to be running, and both kubectl & oc work now. It seems that I ran into the same issue as 1428. But the worker node did not get created, and the bootstrap node did not get destroyed as expected, because the installer "waiting for Kubernetes API: context deadline exceeded " in the end.

Ah, it's a nested virt case. Make sure you're not bitten by the famous CSR issue.

Thanks for the info. The fully-qualified domain issue was gone after using the latest master code. Regarding the CSR issue, I will take a look later.

@ahrtr
Copy link
Author

ahrtr commented Aug 7, 2019

I tried again on a powerful physical machine instead of a VM following the guide precisely, and it's much better now. Both master node and work node were created, and the bootstrap node was destroyed as expected. But the installation still failed in the end,

DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
* Could not update deployment "openshift-insights/insights-operator" (297 of 447)
* Could not update deployment "openshift-insights/insights-operator" (354 of 447)
* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (442 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (410 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (446 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-image-registry/image-registry" (416 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (426 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (430 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (434 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (145 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/machine-api-operator" (96 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (436 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (419 of 447): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (422 of 447): the server does not recognize this resource, check extension API servers 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 90% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 90% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 97% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-07-024236: 98% complete, waiting on authentication, ingress, monitoring, openshift-samples 

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, ingress, monitoring, openshift-samples 

Version

$ bin/openshift-install version
bin/openshift-install unreleased-master-1527-gdd407b0351339ec5fec540c82ecd556ec0dcb98c
built from commit dd407b0
release image registry.svc.ci.openshift.org/origin/release:4.2

Other info

Some PODs are in the status of "Preempting", and some are always in "ContainerCreating",

penshift-console                                       console-54cf7c5b4b-d5snw                                          0/1     ContainerCreating   0          35m
openshift-console                                       console-54cf7c5b4b-kh45f                                          0/1     ContainerCreating   0          35m
openshift-console                                       console-6469748bb4-lswr4                                          0/1     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-0                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-1                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    alertmanager-main-2                                               0/3     ContainerCreating   0          35m
openshift-monitoring                                    cluster-monitoring-operator-6559f76df6-tvgt2                      1/1     Running             0          59m
openshift-monitoring                                    grafana-6f887959f5-rsdrq                                          0/2     ContainerCreating   0          35m
openshift-monitoring                                    prometheus-k8s-0                                                  0/6     ContainerCreating   0          29m
openshift-monitoring                                    prometheus-k8s-1                                                  0/6     ContainerCreating   0          29m
openshift-console-operator                              console-operator-7fd446b47-ptjll                                  0/1     Preempting          0          59m
openshift-operator-lifecycle-manager                    catalog-operator-779c4c9945-df6f5                                 0/1     Preempting          0          69m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator-6c69bd79ff-nbf25     0/1     Preempting          0          61m

@ahrtr
Copy link
Author

ahrtr commented Aug 8, 2019

Based on the definition of deployment "etcd-quorum-guard", the replica is 3, and the 3 PODs must be running on different master node per the pod anti-affinity. So there should be at least 3 master node, otherwise 2 PODs will always be in "Pending" status?

And there should be at least 2 worker node, because two replicas of deployment "deployment router-default" must be running on different worker nodes per the pod anti-affinity definition.

Since it's just for test purpose, so It'd be better to create only one master node and one worker node.

@zeenix
Copy link
Contributor

zeenix commented Aug 8, 2019

I tried again on a powerful physical machine instead of a VM following the guide precisely, and it's much better now. Both master node and work node were created, and the bootstrap node was destroyed as expected. But the installation still failed in the end,

That's a very different issue now so please file a different one, unless there already is one (I think there is).

Did you check the CSR on the VM? That's the only needed info right now here.

@zeenix
Copy link
Contributor

zeenix commented Aug 8, 2019

Did you check the CSR on the VM? That's the only needed info right now here.

Although that's no longer needed if you create a new cluster now. But try reproducing the issue again on the VM and if you can still reproduce the original issue, please check the CSR to be sure.

@ahrtr
Copy link
Author

ahrtr commented Aug 9, 2019

That's a very different issue now so please file a different one, unless there already is one (I think there is).

I assigned more memory and CPU for the master node, and it's working now.

But try reproducing the issue again on the VM and if you can still reproduce the original issue, please check the CSR to be sure

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

@zeenix
Copy link
Contributor

zeenix commented Aug 9, 2019

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

I understand. I'll close this for now then. Please let us know if you can reproduce the original issue and we can reopen it for you.

/close

@openshift-ci-robot
Copy link
Contributor

@zeenix: Closing this issue.

In response to this:

Installing OpenShift with libvirt on VM is really time consuming, normally it needs at least half day to reproduce/verify an issue. It took me a couple of days fighting with OpenShift Installer with libvirt on VM. I will take this as a low priority task, since I have just transferred all my work on a powerful physical machine.

I understand. I'll close this for now then. Please let us know if you can reproduce the original issue and we can reopen it for you.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants