New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing 3.11 cluster fails with "Node start failed" #10774

Closed
dharmit opened this Issue Nov 27, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@dharmit
Copy link

dharmit commented Nov 27, 2018

Description

I am trying to install OKD 3.11 on a cluster that previously had 3.9 installed and working fine. I removed 3.9 using the uninstall.yml playbook because I didn't get any help on the other issue that I raised for 3.9 to 3.10 upgrade failure #10690.

Version
  • Ansible

    $ ansible --version
    ansible 2.7.2
      config file = /etc/ansible/ansible.cfg
      configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins /modules']
      ansible python module location = /usr/lib/python2.7/site-packages/ansible
      executable location = /usr/bin/ansible
      python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
  • openshift-ansible RPM

    $ rpm -q openshift-ansible
    openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch
Steps To Reproduce
  1. Try to install 3.11 cluster with this hosts file.
Expected Results

Successful install

Observed Results

Installation fails with following output:

TASK [openshift_control_plane : fail] ************************************************************************************************************************
fatal: [os-master-1.example.com]: FAILED! => {"changed": false, "msg": "Node start failed."}                                                                  
                                                                                                                                                              
NO MORE HOSTS LEFT *******************************************************************************************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry                                    
                                                                                               
PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=11   changed=0    unreachable=0    failed=0                
os-master-1.example.com    : ok=286  changed=41   unreachable=0    failed=1                                               
os-node-1.example.com      : ok=101  changed=15   unreachable=0    failed=0               
os-node-10.example.com     : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-2.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-3.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                                                   
os-node-4.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-5.example.com      : ok=101  changed=15   unreachable=0    failed=0                                                            
os-node-6.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-7.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-8.example.com      : ok=101  changed=15   unreachable=0    failed=0
os-node-9.example.com      : ok=101  changed=15   unreachable=0    failed=0


INSTALLER STATUS *********************************************************************************************************************************************
Initialization              : Complete (0:01:54)
Health Check                : Complete (0:00:42)
Node Bootstrap Preparation  : Complete (0:05:39)
etcd Install                : Complete (0:01:04)
Master Install              : In Progress (0:02:16)

On the master node, origin-node service fails to come up with following error:

Nov 27 06:22:30 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:22:30 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Starting OpenShift Node...
-- Subject: Unit origin-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has begun starting up.
Nov 27 06:22:35 os-master-1.example.com origin-node[4173]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" d
Nov 27 06:22:35 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:22:35 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit origin-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit origin-node.service has failed.
-- 
-- The result is failed.

I don't understand why it's failing to read the file /etc/origin/node/node-config.yaml when it's there and has read permissions for the root user:

$ ls -l /etc/origin/node/node-config.yaml
-rw-------. 1 root root 1795 Nov 27 06:23 /etc/origin/node/node-config.yaml

Looking into the journal logs I see this:

Nov 27 06:28:32 os-master-1.example.com origin-node[7312]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:32 os-master-1.example.com systemd[1]: origin-node.service failed.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service holdoff time over, scheduling restart.
Nov 27 06:28:37 os-master-1.example.com systemd[1]: Starting OpenShift Node...
Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...
Nov 27 06:28:37 os-master-1.example.com systemd[1]: origin-node.service: main process exited, code=exited, status=1/FAILURE
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Failed to start OpenShift Node.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: Unit origin-node.service entered failed state.
Nov 27 06:28:38 os-master-1.example.com systemd[1]: origin-node.service failed.

It looks like this is not about not being able to read the file but about a configuration that I seem to have not written as per the expectation. I'm guessing so based on this log:

Nov 27 06:28:37 os-master-1.example.com origin-node[7378]: Error: unable to read node config: could not load config file "/etc/origin/node/node-config.yaml" due to an error: error reading config: v1.NodeConfig.KubeletArguments: []string: decode slice: expect [ or n, but found 5, error found in #10 byte of ...|tainers":50,"maximum|..., bigger context ...|70"],"max-pods":["80"],"maximum-dead-containers":50,"maximum-dead-containers-per-container":2,"minim|...

And the corresponding configuration in hosts file:

openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['80']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': '2h'}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]

It's not an invalid python dictionary at least but something is probably still off.

Control plane pods (if I'm correct in assuming that these are the control plane pods) seem to be coming up just fine:

$ oc get pods -n kube-system
NAME                                         READY     STATUS    RESTARTS   AGE
master-api-os-master-1.example.com           1/1       Running   0          19h
master-controllers-os-master-1.example.com   1/1       Running   0          19h
master-etcd-os-master-1.example.com          1/1       Running   0          19h

Can someone please help?

@dharmit

This comment has been minimized.

Copy link

dharmit commented Nov 28, 2018

This issue seems to have got fixed by using [ and ] for every value being set for the value key.

openshift_node_groups=[{'name': 'ccp-openshift-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/node-type=metrics', 'node-role.kubernetes.io/zone=default', 'node-role.kubernetes.io/infra=true'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}, {'name': 'ccp-openshift-node', 'labels': ['node-role.kubernetes.io/node-type=logging', 'node-role.kubernetes.io/zone=default'], 'edits': [{'key': 'kubeletArguments.max-pods', 'value': ['70']}, {'key': 'kubeletArguments.image-gc-high-threshold', 'value': ['70']}, {'key': 'kubeletArguments.minimum-container-ttl-duration', 'value': ['2h']}, {'key': 'kubeletArguments.maximum-dead-containers', 'value': ['50']}, {'key': 'kubeletArguments.maximum-dead-containers-per-container', 'value' : ['2']}]}]
@vrutkovs

This comment has been minimized.

Copy link
Contributor

vrutkovs commented Nov 28, 2018

Please avoid using INI-style inventories - these have a confusing syntax, YAML should be used instead

@vrutkovs vrutkovs closed this Nov 28, 2018

@dharmit

This comment has been minimized.

Copy link

dharmit commented Nov 29, 2018

@vrutkovs Thanks for the tip. I think the examples mentioned in the documentation are INI style. Could you please point me to a YAML based example that you recommend me to use instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment