Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

JaredBurck · 2015-10-26T22:53:04Z

I am trying to install OpenShift v3 using the advanced method via anisble playbooks and having a very similar issue to #434. However, I am attempting to setup a single (1) master + clustered (3) etcd + 2 nodes mode cluster on OpenStack. Everything seems to be setup and executing just fine up until ansible gets to the following task and just hangs without completing, erroring, or any other ending.

TASK: [os_firewall | Start and enable iptables service] ***********************
changed: [master.ose.example.com]

It just hangs at this point no matter how long I wait. If I quit the execution and re-run the ansible playbook, ansible successfully completes the installation. There were no errors that I could find on master or any of the etcd instances and iptables had been enabled and started on each of them.

I could not attach my ansible hosts file so it is below:

# This is an example of a bring your own (byo) host inventory

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=enterprise
ansible_ssh_user=root
osm_default_subdomain=apps.ose.example.com

# enable htpasswd authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/openshift/openshift-passwd'}]

# host group for masters
[masters]
master.ose.example.com openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_schedulable=True

# host group for etcd
[etcd]
master.ose.example.com
etcd1.ose.example.com
etcd2.ose.example.com

# host group for nodes
[nodes]
master.ose.example.com openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
node2.ose.example.com openshift_hostname=node2.ose.example.com openshift_public_hostname=node2.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
node3.ose.example.com openshift_hostname=node3.ose.example.com openshift_public_hostname=node3.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'east'}"

I am happy to provide additional information or help investigate this further. Thanks!

The text was updated successfully, but these errors were encountered:

detiber · 2015-10-27T02:47:42Z

@JaredBurck I suspect the issue is related to the iptables restart and the ssh connection hanging from that point. It is probably more prevalent in OpenStack environments due to the way networking is handled.

Are you able to reproduce this issue reliably?

JaredBurck · 2015-10-27T17:07:40Z

@detiber Yes, I am able to reproduce this reliably.

detiber · 2015-10-27T17:57:56Z

@JaredBurck okay, just to clarify, you are seeing this same issue on all of the hosts once they attempt to restart iptables for the initial time but not subsequent runs, correct?

What is the base image that you are using within OpenStack?

JaredBurck · 2015-10-27T19:01:06Z

@detiber No not seeing the same issue on all the hosts, just hangs on the master. It does not matter where in the list the master host in, if it is the first host in this task it will hang. If master is the second host in the task list, it will hang here. It only seems to hang on the master host.

I am using the ose3-base image that is a custom built image from rhel-server-7.1-x86_64-dvd.iso. I am using the OS1 environment, if you are familiar with that.

One other thing that may or may not be helpful is that we do uninstall NetworkManager as-is. This was a prerequisites early on in OSEv3, but it appears that it has been removed from any docs as of lately.

detiber · 2015-10-27T19:51:49Z

@JaredBurck most interesting, I really would have expected the error to be on all hosts and not isolated to the master host.

Also odd is that the issue seems to be reproduced only by people using os1.

Some other things that may or may not have anything to do with the price of beans:

What version of ansible are you using?
Are you running openshift-ansible from inside or outside of os1?
How are you handling dns resolution for the environment? Is the dns server co-located on the master by chance?
Do the dns names used resolve to the external or internal ip addresses for the instances?

detiber · 2015-10-27T19:52:32Z

As a side note, there should be no incompatibilities with running NetworkManager. We resolved any outstanding issues before 3.0 went GA.

JaredBurck · 2015-10-27T21:25:36Z

@detiber Thanks for the side note, that's good to know! It is odd that this is only reproduced in os1.

[root@master ~]# ansible --version
ansible 1.9.2
  configured module search path = None

I am running openshift-ansible from inside os1 (internal - if that matters) on the master.

An internal dns server is installed and configured on node2 for dns resolution. The dns server is not co-located with the master.

The dns names used resolve to the internal ip addresses for the instances, the 172.x.x.x addresses.

detiber · 2015-10-28T21:07:06Z

@JaredBurck oh wait... you are running ansible on the master with dns being provided by node2?

I think you just helped me crack the code. I think the combination of the ssh client and ssh server experiencing a network interruption by the iptables restart is the ultimate cause....

Can you try one of the following workarounds:

Run ansible from a separate host with the same inventory and outside of the separate ansible host the same host configuration.
Run ansible as you currently are, but modify your inventory in the following way by appending ansible_connection=local and ansible_sudo=no to your master host definition in the [masters] group, this will tell ansible to run locally as opposed to over ssh. Alternatively you should be able to use localhost (or 127.0.0.1) for the name of the master in the inventory and it should automatically use the local connection plugin (the names listed in the inventory are for connection purposes and are not explicitly used by openshift-ansible in configuration).

There is a small chance that you may hit some type of a bug related to running with the second workaround, but that would manifest itself quite differently than the errors you are currently seeing.

JamesRadtke-RedHat · 2015-11-18T03:33:20Z

@JaredBurck - hey Jared - I have run in to the same issue this evening myself. However, this is on my laptop. This is the first time that I have attempted to have any functions colocated (i.e. master and etcd on one Node). I have not had any issues like this in the past. I will return to using single-function nodes using the same base code of the ansible-playbooks (Oct 31 release).
the current ssh commands running:
root 4027 1 0 22:07 ? 00:00:00 ssh: /root/.ansible/cp/ansible-ssh-rh7osetcd01.aperture.lab-22-root [mux]
root 6764 6730 0 22:07 pts/0 00:00:00 ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 rh7osetcd01.aperture.lab /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1447816068.32-200955259988792/service; rm -rf /root/.ansible/tmp/ansible-tmp-1447816068.32-200955259988792/ >/dev/null 2>&1' -- if I kill 6764, I get...

fatal: [rh7osetcd01.aperture.lab] => SSH Error: Shared connection to rh7osetcd01.aperture.lab closed.
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

TASK: [os_firewall | need to pause here, otherwise the iptables service starting can sometimes cause ssh to fail] ***
(^C-c = continue early, ^C-a = abort)
[rh7osemst01.aperture.lab, rh7osetcd02.aperture.lab]
Pausing for 10 seconds

detiber · 2015-11-18T06:02:32Z

@jradtke-rh The only time I've seen this reproduced is when ansible is being run on one of the hosts that are being configured. Can you confirm this was the case for your run this evening?

If so, then modifying the inventory to set ansible_connection=local for the host you are running ansible on should do the trick.

JaredBurck · 2015-11-18T06:29:51Z

@detiber - my apologies it's been a while. And thank you for all your assistance and patience!! FYI, my testing so far seems to agree with what you have said about when ansible is being run on one of the hosts that are being configured.

However, as you suggested (and in workaround 2 from 3 comments above) to modify the inventory to set ansible_connection=local and ansible_sudo=no to your master host definition in the [masters] group. Also tried those settings with the [etcd] group. Still hung for me on the master. I'm a little unclear on workaround 1 and what you would like me to try "outside of the separate ansible host the same host configuration". Could you explain a little more?

@jradtke-rh Could you try what @detiber suggested to see if this works for you? I have another workaround to get openshift installed and working, but not a solution to this issue.

JamesRadtke-RedHat · 2015-11-18T13:29:21Z

@detiber - Unfortunately I did not see this in time. I killed the process running ssh (PID: 6764 in my example above) which let the ansible script continue... which finished but failed. I then re-ran the playbook and everything worked. I will be repeating my installation using this configuration a few times so I can try a few of the suggestions in this thread and gather more data.

One thing that I find strange... why does "localhost" show up in the "PLAY RECAP"

PLAY RECAP ******************************************************************** 
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
rh7osemst01.aperture.lab   : ok=234  changed=7    unreachable=0    failed=0   
rh7osenod01.aperture.lab   : ok=47   changed=9    unreachable=0    failed=0   
rh7osenod02.aperture.lab   : ok=47   changed=9    unreachable=0    failed=0   
rh7osetcd01.aperture.lab   : ok=35   changed=0    unreachable=0    failed=0   
rh7osetcd02.aperture.lab   : ok=35   changed=0    unreachable=0    failed=0

detiber · 2015-11-18T14:53:59Z

@JaredBurck For workaround 1, I have no idea what I was thinking when I typed that. I was trying to say that you should run ansible from a separate host (one that is not part of the OpenShift deployment).

For workaround 2, I should have provided an example rather than try to explain it. Modifying your original inventory:

...
# host group for masters
[masters]
master.ose.example.com ansible_connection=local openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_schedulable=True

# host group for etcd
[etcd]
master.ose.example.com ansible_connection=local
etcd1.ose.example.com
etcd2.ose.example.com

# host group for nodes
[nodes]
master.ose.example.com ansible_connection=local openshift_hostname=master.ose.example.com  openshift_public_hostname=master.ose.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
node2.ose.example.com openshift_hostname=node2.ose.example.com openshift_public_hostname=node2.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
node3.ose.example.com openshift_hostname=node3.ose.example.com openshift_public_hostname=node3.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'east'}"

detiber · 2015-11-18T14:57:39Z

One thing that I find strange... why does "localhost" show up in the "PLAY RECAP"

@jradtke-rh We perform some operations on 'localhost' in order to copy items from one remote host to another. The main use case is for copying the generated certificates from the CA host to the host that needs the certs. There are other various uses as well, but any operations on localhost would only create temporary files and/or variables.

jayunit100 · 2016-02-12T02:28:43Z

I just saw this hang in EC2. The hang was several minutes , and it happened multiple times.

jayunit100 · 2016-02-12T02:47:24Z

Thanks for the workaround info. So,

by ansible_sudo=no , did we mean =false ?
what is the purpose of modding ansible_sudo here ?
Is this sufficiently problematic that we might want to make it a requirement; that we dont run ansible on the master host ?

jayunit100 · 2016-02-12T13:17:40Z

By the way, on AWS , I confirmed workaround 2 works for me ;
i dont think we have a need for ansible_sudo=no. Do we?

# host group for masters
[masters]
$MASTER ansible_connection=local

[nodes]
$NODE1

#openshift_node_labels="{'region': 'primary', 'zone': 'us-east-1'}"
#openshift_node_labels="{'region': 'primary', 'zone': 'us-east-1'}"

^ this seems to get past the iptables restart issue.

detiber · 2016-02-12T15:53:33Z

@jayunit100 That is correct, you shouldn't need ansible_sudo=no.

As an aside, Ansible uses a pretty loose definition of true/false and yes/no are commonly used in the Ansible documentation rather than true/false, so in general we use yes/no to conform with what Ansible users would expect.

jeremyeder · 2016-02-12T16:56:31Z

@jayunit100 workaround pls use the jumpbox we have setup inside our VPC on EC2.

jayunit100 · 2016-02-13T15:40:31Z

TL;DR

the right way to install openshift is from a "installer node" which is completely separated from BOTH the master, AND the nodes.
otherwise $MASTER ansible_connection=local works. but its a little hacky.

detiber · 2016-02-13T18:02:10Z

@jayunit100 having a dedicated jump host is great, but no more correct than using connection=local on the master.

In the installer wrapper we attempt to detect when we are running on the master and set connection=local automatically.

The real benefit of having a jump host is if you are managing other hosts using ansible or reading down and rebuilding clusters frequently.

jayunit100 · 2016-02-13T18:47:13Z

Ok, so running from master is supported . Jumper seems a better way to constrain installation problems and simplify docs, but as long as we are explicit that installing from master is supported, that's fine.

tbielawa · 2016-11-15T19:59:19Z

This issue has been inactive for quite some time. Please update and reopen this issue if this is still a priority you would like to see action on.

…enstack (openshift#747) * Allow for the specifying of server policies during OpenStack provisioning * documentation for openstack server group policies * add doc link detailing allowed policies * changed default to anti-affinity

tbielawa closed this as completed Nov 15, 2016

DenverJ mentioned this issue Aug 10, 2017

Hangs at: TASK: [os_firewall | Start and enable iptables service] when installing from a master #5050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

JaredBurck commented Oct 26, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 27, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 28, 2015

JamesRadtke-RedHat commented Nov 18, 2015

detiber commented Nov 18, 2015

JaredBurck commented Nov 18, 2015

JamesRadtke-RedHat commented Nov 18, 2015

detiber commented Nov 18, 2015

detiber commented Nov 18, 2015

jayunit100 commented Feb 12, 2016

jayunit100 commented Feb 12, 2016

jayunit100 commented Feb 12, 2016

detiber commented Feb 12, 2016

jeremyeder commented Feb 12, 2016

jayunit100 commented Feb 13, 2016

detiber commented Feb 13, 2016

jayunit100 commented Feb 13, 2016

tbielawa commented Nov 15, 2016

Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

Comments

JaredBurck commented Oct 26, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 27, 2015

detiber commented Oct 27, 2015

JaredBurck commented Oct 27, 2015

detiber commented Oct 28, 2015

JamesRadtke-RedHat commented Nov 18, 2015

detiber commented Nov 18, 2015

JaredBurck commented Nov 18, 2015

JamesRadtke-RedHat commented Nov 18, 2015

detiber commented Nov 18, 2015

detiber commented Nov 18, 2015

jayunit100 commented Feb 12, 2016

jayunit100 commented Feb 12, 2016

jayunit100 commented Feb 12, 2016

detiber commented Feb 12, 2016

jeremyeder commented Feb 12, 2016

jayunit100 commented Feb 13, 2016

detiber commented Feb 13, 2016

jayunit100 commented Feb 13, 2016

tbielawa commented Nov 15, 2016