New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible 2.7: Etcd install skipped -> Control plane pods didn't come up #10368

Open
lentzi90 opened this Issue Oct 10, 2018 · 18 comments

Comments

Projects
None yet
@lentzi90

lentzi90 commented Oct 10, 2018

Description

The etcd installation is skipped on a simple single node setup (in vagrant).

Version
  • Ansible version per ansible --version:
ansible 2.7.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/lennart/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.15rc1 (default, Apr 15 2018, 21:51:34) [GCC 7.3.0]
  • The output of git describe:
openshift-ansible-3.10.53-1-2-gb839d825c
Steps To Reproduce
  1. ansible-playbook -i single-node.ini openshift-ansible/playbooks/prerequisites.yml
  2. ansible-playbook -i single-node.ini openshift-ansible/playbooks/deploy_cluster.yml
Expected Results

The cluster installation is successful.

Observed Results

The deploy_cluster.yml playbook fails with the following message:

TASK [openshift_control_plane : Report control plane errors] *******************************************************************************
fatal: [master]: FAILED! => {"changed": false, "msg": "Control plane pods didn't come up"}

NO MORE HOSTS LEFT *************************************************************************************************************************
	to retry, use: --limit @/home/lennart/workspace/elastisys/vagrant/openshift/openshift-ansible/playbooks/deploy_cluster.retry

PLAY RECAP *********************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master                     : ok=248  changed=102  unreachable=0    failed=1   


INSTALLER STATUS ***************************************************************************************************************************
Initialization              : Complete (0:00:11)
Health Check                : Complete (0:00:38)
Node Bootstrap Preparation  : Complete (0:02:03)
etcd Install                : Complete (0:00:04)
Master Install              : In Progress (0:17:28)
	This phase can be restarted by running: playbooks/openshift-master/config.yml


Failure summary:


  1. Hosts:    master
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

Full logs here: deploy.txt

Verbose logs here: deploy-verbose.txt

There were three warnings:

[WARNING]: Could not match supplied host pattern, ignoring: oo_lb_to_config
[WARNING]: Could not match supplied host pattern, ignoring: oo_nfs_to_config
[WARNING]: flush_handlers task does not support when conditional

I further debugged this and found this in the logs of the api pod:

I1010 08:38:44.914483       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I1010 08:38:44.914599       1 plugins.go:84] Registered admission plugin "Initializers"
I1010 08:38:44.914608       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I1010 08:38:44.914615       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I1010 08:38:44.914621       1 plugins.go:84] Registered admission plugin "AlwaysAdmit"
I1010 08:38:44.914626       1 plugins.go:84] Registered admission plugin "AlwaysPullImages"
I1010 08:38:44.914634       1 plugins.go:84] Registered admission plugin "LimitPodHardAntiAffinityTopology"
I1010 08:38:44.914643       1 plugins.go:84] Registered admission plugin "DefaultTolerationSeconds"
I1010 08:38:44.914648       1 plugins.go:84] Registered admission plugin "AlwaysDeny"
I1010 08:38:44.914655       1 plugins.go:84] Registered admission plugin "EventRateLimit"
I1010 08:38:44.914660       1 plugins.go:84] Registered admission plugin "DenyEscalatingExec"
I1010 08:38:44.914663       1 plugins.go:84] Registered admission plugin "DenyExecOnPrivileged"
I1010 08:38:44.914668       1 plugins.go:84] Registered admission plugin "ExtendedResourceToleration"
I1010 08:38:44.914675       1 plugins.go:84] Registered admission plugin "OwnerReferencesPermissionEnforcement"
I1010 08:38:44.914683       1 plugins.go:84] Registered admission plugin "ImagePolicyWebhook"
I1010 08:38:44.914688       1 plugins.go:84] Registered admission plugin "InitialResources"
I1010 08:38:44.914693       1 plugins.go:84] Registered admission plugin "LimitRanger"
I1010 08:38:44.914698       1 plugins.go:84] Registered admission plugin "NamespaceAutoProvision"
I1010 08:38:44.914703       1 plugins.go:84] Registered admission plugin "NamespaceExists"
I1010 08:38:44.914707       1 plugins.go:84] Registered admission plugin "NodeRestriction"
I1010 08:38:44.914712       1 plugins.go:84] Registered admission plugin "PersistentVolumeLabel"
I1010 08:38:44.914717       1 plugins.go:84] Registered admission plugin "PodNodeSelector"
I1010 08:38:44.914722       1 plugins.go:84] Registered admission plugin "PodPreset"
I1010 08:38:44.914726       1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I1010 08:38:44.914731       1 plugins.go:84] Registered admission plugin "ResourceQuota"
I1010 08:38:44.914736       1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I1010 08:38:44.914741       1 plugins.go:84] Registered admission plugin "Priority"
I1010 08:38:44.914747       1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I1010 08:38:44.914752       1 plugins.go:84] Registered admission plugin "ServiceAccount"
I1010 08:38:44.914757       1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I1010 08:38:44.914762       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I1010 08:38:44.914766       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
Invalid MasterConfig /etc/origin/master/master-config.yaml
  etcdClientInfo.ca: Invalid value: "/etc/origin/master/master.etcd-ca.crt": could not read file: stat /etc/origin/master/master.etcd-ca.crt: no such file or directory

Then I went back to the ansible logs and realized that it skipped almost all tasks when installing etcd. (Relevant part of logs here: etcd-installation-log.txt)

Why is this happening? Did I miss something in the inventory file?

Additional Information

There are quite a few issues about control plane pods not starting but I don't think this is a duplicate of any of them.
Here are some of the issues that I looked at before reporting:

  • #10110 did not include the master node in the nodes group
  • #10047 fails much later when control plane is already up
  • #9973 and #7967 could be the same issue. I am missing /etc/cni, however, these issues does not mention anything about etcd not being present.
  • #9894 seems to be a mistake in the inventory
  • #9852 seems related to AWS only if I understand correctly
  • Your operating system and version: CentOS Linux release 7.5.1804 (Core)
  • Your inventory file:
# single-node.ini mostly copied from inventory/hosts.localhost

master ansible_host=192.168.121.159 ansible_port=22 ansible_user='vagrant' ansible_ssh_private_key_file='/home/lennart/.vagrant.d/insecure_private_key'

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_become=yes
openshift_deployment_type=origin
openshift_portal_net=172.30.0.0/16
openshift_disable_check=disk_availability,memory_availability,docker_storage

openshift_node_groups=[{'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true']}]


[masters]
master

[etcd]
master

[nodes]
master openshift_node_group_name="node-config-all-in-one"
@vrutkovs

This comment has been minimized.

Contributor

vrutkovs commented Oct 10, 2018

Ansible 2.7 is not recommended yet - the issue should go away once you downgrade to 2.6

This is a perfectly described issue though, lets use it as 'ansible 2.7 support' tracking bug

@vrutkovs vrutkovs changed the title from Etcd install skipped -> Control plane pods didn't come up to Ansible 2.7: Etcd install skipped -> Control plane pods didn't come up Oct 10, 2018

@lentzi90

This comment has been minimized.

lentzi90 commented Oct 10, 2018

Thanks for the quick reply!
I switched to the containerized installer to get the correct version, and the control plane came up as expected :)

@watsonb

This comment has been minimized.

watsonb commented Oct 10, 2018

@lentzi90 thanks for pointing this out. I ran into the exact same issue jumping to 2.7.0. I reverted back to ansible 2.6.2 and the installation proceeded without error.

@DanyC97

This comment has been minimized.

Contributor

DanyC97 commented Oct 10, 2018

just curious - why are you guys jumping on latest ansible version when it was made it clear in the requirement the ansible version ?

not trying to start a debate, just i see more and more people ignoring and jumping to latest v

@lentzi90

This comment has been minimized.

lentzi90 commented Oct 10, 2018

The laptop I'm using is running Ubuntu, which is still stuck with ansible 2.5 in the official repos, so I went with a PPA and got the latest.
To be honest I didn't think Ansible >= 2.6.2 implied that 2.7 was unsupported.

@nagonzalez

This comment has been minimized.

nagonzalez commented Oct 10, 2018

@DanyC97

I've got a CI job that I use to run daily tests of the localhost origin installer on the release-3.10 branch. As part of that job, I first install the requirements per the README.md. Reading the requirements, I got the impressions it was ok to use latest given I was installing origin and not OCP.

Requirements in README.md (release-3.10)

Requirements:

Ansible >= 2.4.3.0, 2.5.x is not currently supported for OCP installations
Jinja >= 2.7
pyOpenSSL
python-lxml

Code I used to prep installer:

- name: install required pip packages for installer
  pip:
    name: "{{ item.name }}"
    state: latest
  loop:
   - ansible
   - pyOpenSSL
  tags:
    - skip_ansible_lint

Would it make sense to add a check in the prerequisites.yml playbook that fails if required packages are higher than the supported versions?

@vrutkovs

This comment has been minimized.

Contributor

vrutkovs commented Oct 11, 2018

state: latest

That would pull in latest ansible, which is 2.7. You should use requirements.txt instead

@lentzi90

This comment has been minimized.

lentzi90 commented Oct 12, 2018

I think it would be a good idea to also mention requirements.txt in the readme. Something like:

You can ensure that you get supported versions of the packages by using requirements.txt when installing pip packages: pip install -r requirements.txt.

@zizzencs

This comment has been minimized.

zizzencs commented Oct 13, 2018

Faced the same issue today. Continuing the failed installation after downgrading to ansible 2.6.5 didn't work but a clean installation with ansible 2.6.5 did work.

@sivel

This comment has been minimized.

sivel commented Oct 26, 2018

I believe the problem that caused this should fixed in Ansible 2.7.1.

In Ansible 2.7.0 we made a change to variable exposure from import_role as described at:

https://docs.ansible.com/ansible/latest/porting_guides/porting_guide_2.7.html#include-role-and-import-role-variable-exposure

However, that exposed a bug in regards to mutable defaults which was resolved for the Ansible 2.7.1 release as part of ansible/ansible#46833

Based on what I understand of these playbooks, I believe that etcd_ca_setup: False carried over from 1 play to another play, causing the when statement on additional etcd role calls to be skipped.

I've had 2 co-workers confirm that Ansible 2.7.1 is behaving properly with the etcd roles.

@sdodson

This comment has been minimized.

Member

sdodson commented Oct 26, 2018

/close
Fixed in Ansible 2.7.1

@sivel (and co-workers) thanks for confirming!

@openshift-ci-robot

This comment has been minimized.

openshift-ci-robot commented Oct 26, 2018

@sdodson: Closing this issue.

In response to this:

/close
Fixed in Ansible 2.7.1

@sivel (and co-workers) thanks for confirming!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson

This comment has been minimized.

Member

sdodson commented Oct 26, 2018

/reopen
@mtnbikenc indicates other problems remain #10523 (comment)

@openshift-ci-robot

This comment has been minimized.

openshift-ci-robot commented Oct 26, 2018

@sdodson: Reopening this issue.

In response to this:

/reopen
@mtnbikenc indicates other problems remain #10523 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@calston

This comment has been minimized.

calston commented Oct 29, 2018

I had this issue as well trying to do a single node install.

A bad (edit: was "the" but surely this is not "the" solution) solution was to ditch the node-config-all-in-one node group and use just node-config-master while also setting openshift_schedulable=True. Then just manually re-label the node and re-run deploy_cluster.

When using node-config-all-in-one the Ansible playbook for some reason decides to try and configure node services before master services.

@calston

This comment has been minimized.

calston commented Oct 29, 2018

@DanyC97

I've got a CI job that I use to run daily tests

I was about to ask what Openshift/OKD's CI is in this regard because over the last 3 releases the regressions I've seen in various bits of the install procedure seems like there's a big big void in functional test coverage.

@leoluk

This comment has been minimized.

leoluk commented Nov 1, 2018

This breaks okd deployment on CentOS + EPEL, since they upgraded Ansible to 2.7 (not for the first time, either). @extra is still at 2.4, so EPEL is necessary. Maybe the CentOS SCL responsible for OpenShift should maintain their own Ansible packages which are compatible with openshift-ansible, like RHEL does?

Happy to contribute, if someone points me in the right direction.

@DanyC97

This comment has been minimized.

Contributor

DanyC97 commented Nov 9, 2018

so now with 3.11 being out you you can see that we have a new ansible rpm being released in CentOS extras repo especially to address the above issue.

If people happy with it i think we can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment