New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 3.9 to 3.10 fails while waiting for master node to be ready #10690

Open
dharmit opened this Issue Nov 14, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@dharmit

dharmit commented Nov 14, 2018

Description

I ran the upgrade playbook after modifying hosts file for 3.9 to 3.10 friendly (openshift_node_groups mainly). But it fails in the task: TASK [openshift_node : Wait for node to be ready] .

We have 1 master 10 nodes in the setup. This is CentOS 7.5.1804 release

Version
  • Your ansible version per ansible --version

    $ ansible --version
    ansible 2.6.3
      config file = /etc/ansible/ansible.cfg
      configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
      ansible python module location = /usr/lib/python2.7/site-packages/ansible
      executable location = /usr/bin/ansible
      python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

If you're running from playbooks installed via RPM

  • The output of rpm -q openshift-ansible

    $ rpm -q openshift-ansible
    openshift-ansible-3.10.68-1.git.0.f908cf5.el7.noarch
Steps To Reproduce
  1. Run the openshift_node_group playbook for hosts file:

    $ansible-playbook -i hosts.310 /usr/share/ansible/openshift-ansible/playbooks/openshift-master/openshift_node_group.yml
  2. Upgrade the cluster with following playbook:

    $ansible-playbook -i hosts.310 /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml
Expected Results

Successful cluster upgrade

Observed Results
TASK [openshift_node : Wait for node to be ready] ***************************************************************************************************[60/4553]
FAILED - RETRYING: Wait for node to be ready (36 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (35 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (34 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (33 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (32 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (31 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (30 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (29 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (28 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (27 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (26 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (25 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (24 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (23 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (22 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (21 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (20 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (19 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (18 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (17 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (16 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (15 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (14 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (13 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (12 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (11 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (10 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (9 retries left).                                                                                                
FAILED - RETRYING: Wait for node to be ready (8 retries left).                                                                                                
FAILED - RETRYING: Wait for node to be ready (7 retries left).                                                                                                
FAILED - RETRYING: Wait for node to be ready (6 retries left).                                                                                                
FAILED - RETRYING: Wait for node to be ready (5 retries left).                                                                                                
FAILED - RETRYING: Wait for node to be ready (4 retries left).
FAILED - RETRYING: Wait for node to be ready (3 retries left).
FAILED - RETRYING: Wait for node to be ready (2 retries left).
FAILED - RETRYING: Wait for node to be ready (1 retries left).
fatal: [os-master-1.example.com -> os-master-1.example.com]: FAILED! => {"attempts": 36, "changed": false, "results": {"cmd": "/usr/bin/oc get node os-master-1.example.com -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2018-10-26T11:53:30Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/os": "linux", "kubernetes.io/hostname": "os-master-1.example.com", "node-role.kubernetes.io/master": "true", "node-type": "metrics", "purpose": "infra", "region": "infra", "size": "large1", "zone": "default"}, "name": "os-master-1.example.com", "resourceVersion": "3784921", "selfLink": "/api/v1/nodes/os-master-1.example.com", "uid": "c19deef7-d915-11e8-9d8d-5254007d3f61"}, "spec": {"externalID": "os-master-1.example.com"}, "status": {"addresses": [{"address": "192.168.122.167", "type": "InternalIP"}, {"address": "os-master-1.example.com", "type": "Hostname"}], "allocatable": {"cpu": "16", "hugepages-2Mi": "0", "memory": "16162924Ki", "pods": "250"}, "capacity": {"cpu": "16", "hugepages-2Mi": "0", "memory": "16265324Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": "2018-11-14T04:06:07Z", "lastTransitionTime": "2018-11-14T04:01:49Z", "message": "kubelet has sufficient disk space available", "reason": "KubeletHasSufficientDisk", "status": "False", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-11-14T04:06:07Z", "lastTransitionTime": "2018-11-14T04:01:49Z", "message": "kubelet has sufficient memory available", "reason": "KubeletHasSufficientMemory", "status": "False", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-11-14T04:06:07Z", "lastTransitionTime": "2018-11-14T04:01:49Z", "message": "kubelet has no disk pressure", "reason": "KubeletHasNoDiskPressure", "status": "False", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-11-14T04:06:07Z", "lastTransitionTime": "2018-11-14T04:01:49Z", "message": "runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized", "reason": "KubeletNotReady", "status": "False", "type": "Ready"}, {"lastHeartbeatTime": "2018-11-14T04:06:07Z", "lastTransitionTime": "2018-11-14T03:31:57Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSufficientPID", "status": "False", "type": "PIDPressure"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["docker.io/openshift/jenkins-2-centos7@sha256:59b9b12acf8e048186fac8c9fce2759a4d7883fb1b1be33a778d236b573f720d"], "sizeBytes": 2220265842}, {"names": ["docker.io/mrsiano/grafana-ocp@sha256:c3df94b5c3aaf16c5b393780939d30073ac897b6bdd037b2aeb64e9a52581490", "docker.io/mrsiano/grafana-ocp:latest"], "sizeBytes": 1626551984}, {"names": ["192.168.122.25:5000/pipeline-images/ccp-openshift-slave@sha256:794a4c4afc734484fccbbc5643b8a7977ced892f15cea02c75ad9c7e808bd076", "192.168.122.25:5000/pipeline-images/ccp-openshift-slave:latest"], "sizeBytes": 1290266381}, {"names": ["docker.io/openshift/origin-haproxy-router@sha256:8f2ecdd9b0dc99b22d8f274970933ca205cc2252a0623db9657154493135949d", "docker.io/openshift/origin-haproxy-router:v3.9.0"], "sizeBytes": 1284810579}, {"names": ["docker.io/openshift/origin-node@sha256:6ce287ea36036a97c5cd01e4689bdad001218945ce912214c9097c03b76c4cb8", "docker.io/openshift/origin-node:v3.10"], "sizeBytes": 1272263681}, {"names": ["docker.io/openshift/origin-deployer@sha256:59ad668b7ba2a216d89c88f83e39a8564c8e2056cdb5931a2bf9aafb6af3dd99", "docker.io/openshift/origin-deployer:v3.9.0"], "sizeBytes": 1261176213}, {"names": ["registry.centos.org/openshift/jenkins-slave-base-centos7@sha256:1497a56aaee28dae6432e733d117581232d3dcbd7b9cdba498c991713cc4ae3d", "registry.centos.org/openshift/jenkins-slave-base-centos7:latest"], "sizeBytes": 1039211073}, {"names": ["docker.io/openshift/origin-web-console@sha256:3e68a21afb90a66e1e8fcc4ac31272d397d5e6af188fdb9431d4d19a60ae5298", "docker.io/openshift/origin-web-console:v3.9.0"], "sizeBytes": 495221706}, {"names": ["docker.io/openshift/origin-docker-registry@sha256:4e0f264808067e1c20ae16acc07820374cfe68c1279cfab20e7abdaa5b5ba617", "docker.io/openshift/origin-docker-registry:v3.9.0"], "sizeBytes": 465026624}, {"names": ["docker.io/openshift/prometheus@sha256:35e2e0efc874c055be60a025874256816c98b9cebc10f259d7fb806bbe68badf", "docker.io/openshift/prometheus:v2.2.1"], "sizeBytes": 317896379}, {"names": ["docker.io/openshift/oauth-proxy@sha256:4b73830ee6f7447d0921eedc3946de50016eb8f048d66ea3969abc4116f1e42a", "docker.io/openshift/oauth-proxy:v1.0.0"], "sizeBytes": 228241928}, {"names": ["docker.io/openshift/origin-pod@sha256:efaaea43661e8cbdf4bb5527039925dfa78b58dd90453553f096c48dc1cf95b8", "docker.io/openshift/origin-pod:v3.10"], "sizeBytes": 223970706}, {"names": ["docker.io/openshift/origin-pod@sha256:38e2dcbe2edfa202c5aabbdd00932678602962ca05499195fab8adba7dc22c16", "docker.io/openshift/origin-pod:v3.9.0"], "sizeBytes": 222604299}, {"names": ["docker.io/openshift/prometheus-alertmanager@sha256:41eef9535dfd91cd98f4f1a81465c8fee660717b46145481810cab7f55984417", "docker.io/openshift/prometheus-alertmanager:v0.13.0"], "sizeBytes": 221548139}, {"names": ["docker.io/openshift/prometheus-node-exporter@sha256:26aef60fed4dc03fc6cb9b0020d2a4cd90debf31cef49315caaf8a45632873f9", "docker.io/openshift/prometheus-node-exporter:v0.15.2"], "sizeBytes": 216275407}, {"names": ["docker.io/openshift/prometheus-alert-buffer@sha256:076f8dd576806f5c2dde7e536d020c31aa7d2ec7dcea52da6cbb944895def7ba", "docker.io/openshift/prometheus-alert-buffer:v0.0.2"], "sizeBytes": 200521084}], "nodeInfo": {"architecture": "amd64", "bootID": "a851cee3-a36c-4263-a264-28107aefae5b", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.14.4.el7.x86_64", "kubeProxyVersion": "v1.10.0+b81c8f8", "kubeletVersion": "v1.10.0+b81c8f8", "machineID": "24a545a32b484880a63efa2d3c816b3e", "operatingSystem": "linux", "osImage": "CentOS Linux 7 (Core)", "systemUUID": "24A545A3-2B48-4880-A63E-FA2D3C816B3E"}}}], "returncode": 0}, "state": "list"}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry


PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=13   changed=0    unreachable=0    failed=0                                                                                   
os-master-1.example.com    : ok=258  changed=48   unreachable=0    failed=1                                                                                   
os-node-1.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-10.example.com     : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-2.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-3.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-4.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-5.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-6.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-7.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-8.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
os-node-9.example.com      : ok=24   changed=1    unreachable=0    failed=0                                                                                   
                                                                                                                                                              
                                                                                                                                                              
                                                                                                                                                              
Failure summary:                                                                                                                                              
                                                                                                                                                              
                                                                                                                                                              
  1. Hosts:    os-master-1.example.com                                                                                                                        
     Play:     Update master nodes                                                                                                                            
     Task:     Wait for node to be ready                                                                                          
     Message:  Failed without returning a message.

Outputs from master:

[root@os-master-1 ~]# oc get nodes
NAME                      STATUS     ROLES     AGE       VERSION
os-master-1.example.com   NotReady   master    18d       v1.10.0+b81c8f8
os-node-1.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-10.example.com    Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-2.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-3.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-4.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-5.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-6.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-7.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-8.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
os-node-9.example.com     Ready      compute   18d       v1.9.1+a0ce1bc657
[root@os-master-1 ~]# oc version
oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://os-master-1.example.com:8443
openshift v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657
Additional Information

Inventory file:

@JuozasA

This comment has been minimized.

JuozasA commented Nov 16, 2018

are the previous openshift_node role tasks being skipped (the ones before this "Wait for node to be ready")? Also what the output of journalctl -xe on os-master-1.example.com is saying?

@dharmit

This comment has been minimized.

dharmit commented Nov 19, 2018

Here's the stdout before "Wait for node to be ready". I don't think there's anything unusual there. But I could be wrong.

Full output from journalctl -xe.

I don't get why it's not able to find the node-config.yaml file:

Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: if [[ "${retries}" -gt 40 ]]; then
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: echo "rror: Another process is currently listening on the CNI socket, exiting" 2>&1               
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: exit 1
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: fi
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: done
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: # if the node config doesn't exist yet, wait until it does
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: retries=0
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: while true; do
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: if [[ ! -f /etc/origin/node/node-config.yaml ]]; then
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: echo "warning: Cannot find existing node-config.yaml, waiting 15s ..." 2>&1
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: sleep 15 & wait
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: (( retries += 1 ))
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: else
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: break
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: fi
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: if [[ "${retries}" -gt 40 ]]; then
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: echo "error: No existing node-config.yaml, exiting" 2>&1
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: exit 1
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: fi
Nov 19 08:28:30 os-master-1.example.com origin-node[62740]: done

The file does exist:

[root@os-master-1 ~]# ls /etc/origin/node/node-config.yaml
/etc/origin/node/node-config.yaml
[root@os-master-1 ~]# ls -l /etc/origin/node/node-config.yaml
-rw-------. 1 root root 1824 Nov 19 08:33 /etc/origin/node/node-config.yaml
[root@os-master-1 ~]# getenforce 
Permissive

I'm not sure what's causing the other error with CNI socket.

@dharmit

This comment has been minimized.

dharmit commented Nov 19, 2018

Also, this is a fresh setup that's got nothing running on OpenShift. I installed 3.9 earlier this morning. And then tried to upgrade. There's no prometheus, no grafana, no jenkins, nothing.

@dharmit

This comment has been minimized.

dharmit commented Nov 19, 2018

It looks like etcd is failing to come up.

[root@os-master-1 ~]# master-logs etcd etcd
Component etcd is stopped or not running

There's no container for etcd in the master node.

@dharmit

This comment has been minimized.

dharmit commented Nov 19, 2018

If I do

$ ansible-playbook -i hosts.310 /usr/share/ansible/openshift-ansible/playbooks/openshift-etcd/config.yml

it does bring up etcd pod in the kube-system namespace but its status stays CrashLoopBackOff. This seems to be because etcd is already running as a systemd service on the master node. Here's the logs:

$ master-logs etcd etcd
2018-11-19 11:13:31.445199 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://192.168.122.91:2379
2018-11-19 11:13:31.445685 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt
2018-11-19 11:13:31.445743 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
2018-11-19 11:13:31.445796 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/
2018-11-19 11:13:31.445827 I | pkg/flags: recognized and used environment variable ETCD_DEBUG=False
2018-11-19 11:13:31.445927 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=2500
2018-11-19 11:13:31.446044 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=500
2018-11-19 11:13:31.446100 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.122.91:2380
2018-11-19 11:13:31.446153 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=os-master-1.example.com=https://192.168.122.91:2380
2018-11-19 11:13:31.446194 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2018-11-19 11:13:31.446222 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
2018-11-19 11:13:31.446256 I | pkg/flags: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd/server.key
2018-11-19 11:13:31.446288 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://192.168.122.91:2379
2018-11-19 11:13:31.446328 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://192.168.122.91:2380
2018-11-19 11:13:31.446362 I | pkg/flags: recognized and used environment variable ETCD_NAME=os-master-1.example.com
2018-11-19 11:13:31.446386 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
2018-11-19 11:13:31.446415 I | pkg/flags: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
2018-11-19 11:13:31.446446 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd/peer.key
2018-11-19 11:13:31.446468 I | pkg/flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2018-11-19 11:13:31.446552 I | pkg/flags: recognized and used environment variable ETCD_QUOTA_BACKEND_BYTES=4294967296
2018-11-19 11:13:31.446610 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2018-11-19 11:13:31.446776 I | etcdmain: etcd Version: 3.2.22
2018-11-19 11:13:31.446809 I | etcdmain: Git SHA: 1674e682f
2018-11-19 11:13:31.446825 I | etcdmain: Go Version: go1.8.7
2018-11-19 11:13:31.446839 I | etcdmain: Go OS/Arch: linux/amd64
2018-11-19 11:13:31.446862 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
2018-11-19 11:13:31.447086 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-11-19 11:13:31.447258 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-11-19 11:13:31.447607 C | etcdmain: listen tcp 192.168.122.91:2380: bind: address already in use
@dharmit

This comment has been minimized.

dharmit commented Nov 19, 2018

I stopped etcd running as systemd service and executed above playbook for etcd again. That managed to get etcd working as a pod. The upgrade still fails with:

FAILED - RETRYING: Wait for node to be ready (3 retries left).                                                                                      [205/4757]
FAILED - RETRYING: Wait for node to be ready (2 retries left).                                                                                               
FAILED - RETRYING: Wait for node to be ready (1 retries left).                                                                                               
fatal: [os-master-1.example.com -> os-master-1.example.com]: FAILED! => {"attempts": 36, "changed": false, "results": {"cmd": "/usr/bin/oc get node os-master-
1.example.com -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-att
ach-detach": "true"}, "creationTimestamp": "2018-11-19T06:54:42Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/os": "linux", "kubernete
s.io/hostname": "os-master-1.example.com", "node-role.kubernetes.io/master": "true", "node-type": "metrics", "purpose": "infra", "region": "infra", "size": "l
arge1", "zone": "default"}, "name": "os-master-1.example.com", "resourceVersion": "44505", "selfLink": "/api/v1/nodes/os-master-1.example.com", "uid": "fdcf93
0a-ebc7-11e8-856e-5254005db2c9"}, "spec": {"externalID": "os-master-1.example.com"}, "status": {"addresses": [{"address": "192.168.122.91", "type": "InternalI
P"}, {"address": "os-master-1.example.com", "type": "Hostname"}], "allocatable": {"cpu": "16", "hugepages-2Mi": "0", "memory": "16162908Ki", "pods": "250"}, "
capacity": {"cpu": "16", "hugepages-2Mi": "0", "memory": "16265308Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": "2018-11-19T11:28:31Z", "lastTrans
itionTime": "2018-11-19T10:32:06Z", "message": "kubelet has sufficient disk space available", "reason": "KubeletHasSufficientDisk", "status": "False", "type":
 "OutOfDisk"}, {"lastHeartbeatTime": "2018-11-19T11:28:31Z", "lastTransitionTime": "2018-11-19T10:32:06Z", "message": "kubelet has sufficient memory available
", "reason": "KubeletHasSufficientMemory", "status": "False", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-11-19T11:28:31Z", "lastTransitionTime": "
2018-11-19T10:32:06Z", "message": "kubelet has no disk pressure", "reason": "KubeletHasNoDiskPressure", "status": "False", "type": "DiskPressure"}, {"lastHear
tbeatTime": "2018-11-19T11:28:31Z", "lastTransitionTime": "2018-11-19T10:32:06Z", "message": "runtime network not ready: NetworkReady=false reason:NetworkPlug
inNotReady message:docker: network plugin is not ready: cni config uninitialized", "reason": "KubeletNotReady", "status": "False", "type": "Ready"}, {"lastHea
rtbeatTime": "2018-11-19T11:28:31Z", "lastTransitionTime": "2018-11-19T07:56:40Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSuf
ficientPID", "status": "False", "type": "PIDPressure"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["docker.io/openshift/o
rigin-haproxy-router@sha256:8f2ecdd9b0dc99b22d8f274970933ca205cc2252a0623db9657154493135949d", "docker.io/openshift/origin-haproxy-router:v3.9.0"], "sizeBytes
": 1284810579}, {"names": ["docker.io/openshift/node@sha256:a6294e3d1bd6459c20e231b3276fb9dd47a8ac2db8c6a3cd258c7499e0d1d2a3", "docker.io/openshift/origin-nod
e@sha256:a6294e3d1bd6459c20e231b3276fb9dd47a8ac2db8c6a3cd258c7499e0d1d2a3", "docker.io/openshift/node:v3.10.0", "docker.io/openshift/origin-node:v3.10"], "siz
eBytes": 1272276032}, {"names": ["docker.io/openshift/origin-deployer@sha256:59ad668b7ba2a216d89c88f83e39a8564c8e2056cdb5931a2bf9aafb6af3dd99", "docker.io/ope
nshift/origin-deployer:v3.9.0"], "sizeBytes": 1261176213}, {"names": ["docker.io/openshift/origin-web-console@sha256:3e68a21afb90a66e1e8fcc4ac31272d397d5e6af1
88fdb9431d4d19a60ae5298", "docker.io/openshift/origin-web-console:v3.9.0"], "sizeBytes": 495221706}, {"names": ["docker.io/openshift/origin-docker-registry@sh
a256:4e0f264808067e1c20ae16acc07820374cfe68c1279cfab20e7abdaa5b5ba617", "docker.io/openshift/origin-docker-registry:v3.9.0"], "sizeBytes": 465026624}, {"names
": ["docker.io/openshift/origin-pod@sha256:2ffeb4d71a80922b9e62698100eef40284976385cf1f1b332edc8e921d48f4f5", "docker.io/openshift/origin-pod:v3.10"], "sizeBy
tes": 223970697}, {"names": ["docker.io/openshift/origin-pod@sha256:38e2dcbe2edfa202c5aabbdd00932678602962ca05499195fab8adba7dc22c16", "docker.io/openshift/or
igin-pod:v3.9.0"], "sizeBytes": 222604299}, {"names": ["quay.io/coreos/etcd@sha256:5b6691b7225a3f77a5a919a81261bbfb31283804418e187f7116a0a9ef65d21d", "quay.io
/coreos/etcd:latest"], "sizeBytes": 39454140}, {"names": ["quay.io/coreos/etcd@sha256:43fbc8a457aa0cb887da63d74a48659e13947cb74b96a53ba8f47abb6172a948", "quay
.io/coreos/etcd:v3.2.22"], "sizeBytes": 37269372}, {"names": ["quay.io/coreos/etcd@sha256:0a8a0dae8a4da722d594937f32d43d9dad231799ae65c78909bb2e5b95866c7b", "
quay.io/coreos/etcd:3.2"], "sizeBytes": 37232444}], "nodeInfo": {"architecture": "amd64", "bootID": "17d37f8e-dca5-4d08-bf72-a90d92ce2652", "containerRuntimeV
ersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.14.4.el7.x86_64", "kubeProxyVersion": "v1.10.0+b81c8f8", "kubeletVersion": "v1.10.0+b81c8f8", "machin
eID": "5fef7650fcc442af93ea9d76d9f95c8e", "operatingSystem": "linux", "osImage": "CentOS Linux 7 (Core)", "systemUUID": "5FEF7650-FCC4-42AF-93EA-9D76D9F95C8E"
}}}], "returncode": 0}, "state": "list"}                                                                                                                 
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.retry               
                                                                                                                                                         
PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=13   changed=0    unreachable=0    failed=0                                                                                   
os-master-1.example.com    : ok=256  changed=47   unreachable=0    failed=1                                                                                   
os-node-1.example.com      : ok=22   changed=0    unreachable=0    failed=0                                                                                   
os-node-10.example.com     : ok=22   changed=0    unreachable=0    failed=0
@dharmit

This comment has been minimized.

dharmit commented Nov 20, 2018

When I do oc describe for the master node, it complains about the CNI config:

Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Tue, 20 Nov 2018 07:19:19 +0000   Tue, 20 Nov 2018 06:47:08 +0000   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Tue, 20 Nov 2018 07:19:19 +0000   Tue, 20 Nov 2018 06:47:08 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 20 Nov 2018 07:19:19 +0000   Tue, 20 Nov 2018 06:47:08 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            False   Tue, 20 Nov 2018 07:19:19 +0000   Tue, 20 Nov 2018 06:47:08 +0000   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
  PIDPressure      False   Tue, 20 Nov 2018 07:19:19 +0000   Mon, 19 Nov 2018 07:56:40 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available

Looking around for the error message on few GitHub issues and RH bugzilla, I found that empty /etc/cni/net.d is causing the trouble. I copy-pasted the contents of /etc/cni/net.d/80-openshift-network.conf from one of the nodes to the master and oc get nodes started posting Ready status for master node.

With that in place, I triggered the control plan upgrade using /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml once more only to end up with same error message and empty /etc/cni/net.d/ directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment