Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747

Closed
JaredBurck opened this issue Oct 26, 2015 · 23 comments

Comments

@JaredBurck
Copy link

I am trying to install OpenShift v3 using the advanced method via anisble playbooks and having a very similar issue to #434. However, I am attempting to setup a single (1) master + clustered (3) etcd + 2 nodes mode cluster on OpenStack. Everything seems to be setup and executing just fine up until ansible gets to the following task and just hangs without completing, erroring, or any other ending.

TASK: [os_firewall | Start and enable iptables service] ***********************
changed: [master.ose.example.com]

It just hangs at this point no matter how long I wait. If I quit the execution and re-run the ansible playbook, ansible successfully completes the installation. There were no errors that I could find on master or any of the etcd instances and iptables had been enabled and started on each of them.

I could not attach my ansible hosts file so it is below:

# This is an example of a bring your own (byo) host inventory

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=enterprise
ansible_ssh_user=root
osm_default_subdomain=apps.ose.example.com

# enable htpasswd authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/openshift/openshift-passwd'}]

# host group for masters
[masters]
master.ose.example.com openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_schedulable=True

# host group for etcd
[etcd]
master.ose.example.com
etcd1.ose.example.com
etcd2.ose.example.com

# host group for nodes
[nodes]
master.ose.example.com openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
node2.ose.example.com openshift_hostname=node2.ose.example.com openshift_public_hostname=node2.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
node3.ose.example.com openshift_hostname=node3.ose.example.com openshift_public_hostname=node3.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'east'}"

I am happy to provide additional information or help investigate this further. Thanks!

@detiber
Copy link
Contributor

detiber commented Oct 27, 2015

@JaredBurck I suspect the issue is related to the iptables restart and the ssh connection hanging from that point. It is probably more prevalent in OpenStack environments due to the way networking is handled.

Are you able to reproduce this issue reliably?

@JaredBurck
Copy link
Author

@detiber Yes, I am able to reproduce this reliably.

@detiber
Copy link
Contributor

detiber commented Oct 27, 2015

@JaredBurck okay, just to clarify, you are seeing this same issue on all of the hosts once they attempt to restart iptables for the initial time but not subsequent runs, correct?

What is the base image that you are using within OpenStack?

@JaredBurck
Copy link
Author

@detiber No not seeing the same issue on all the hosts, just hangs on the master. It does not matter where in the list the master host in, if it is the first host in this task it will hang. If master is the second host in the task list, it will hang here. It only seems to hang on the master host.

I am using the ose3-base image that is a custom built image from rhel-server-7.1-x86_64-dvd.iso. I am using the OS1 environment, if you are familiar with that.

One other thing that may or may not be helpful is that we do uninstall NetworkManager as-is. This was a prerequisites early on in OSEv3, but it appears that it has been removed from any docs as of lately.

@detiber
Copy link
Contributor

detiber commented Oct 27, 2015

@JaredBurck most interesting, I really would have expected the error to be on all hosts and not isolated to the master host.

Also odd is that the issue seems to be reproduced only by people using os1.

Some other things that may or may not have anything to do with the price of beans:

  • What version of ansible are you using?
  • Are you running openshift-ansible from inside or outside of os1?
  • How are you handling dns resolution for the environment? Is the dns server co-located on the master by chance?
  • Do the dns names used resolve to the external or internal ip addresses for the instances?

@detiber
Copy link
Contributor

detiber commented Oct 27, 2015

As a side note, there should be no incompatibilities with running NetworkManager. We resolved any outstanding issues before 3.0 went GA.

@JaredBurck
Copy link
Author

@detiber Thanks for the side note, that's good to know! It is odd that this is only reproduced in os1.

[root@master ~]# ansible --version
ansible 1.9.2
  configured module search path = None

I am running openshift-ansible from inside os1 (internal - if that matters) on the master.

An internal dns server is installed and configured on node2 for dns resolution. The dns server is not co-located with the master.

The dns names used resolve to the internal ip addresses for the instances, the 172.x.x.x addresses.

@detiber
Copy link
Contributor

detiber commented Oct 28, 2015

@JaredBurck oh wait... you are running ansible on the master with dns being provided by node2?

I think you just helped me crack the code. I think the combination of the ssh client and ssh server experiencing a network interruption by the iptables restart is the ultimate cause....

Can you try one of the following workarounds:

  1. Run ansible from a separate host with the same inventory and outside of the separate ansible host the same host configuration.
  2. Run ansible as you currently are, but modify your inventory in the following way by appending ansible_connection=local and ansible_sudo=no to your master host definition in the [masters] group, this will tell ansible to run locally as opposed to over ssh. Alternatively you should be able to use localhost (or 127.0.0.1) for the name of the master in the inventory and it should automatically use the local connection plugin (the names listed in the inventory are for connection purposes and are not explicitly used by openshift-ansible in configuration).

There is a small chance that you may hit some type of a bug related to running with the second workaround, but that would manifest itself quite differently than the errors you are currently seeing.

@JamesRadtke-RedHat
Copy link

@JaredBurck - hey Jared - I have run in to the same issue this evening myself. However, this is on my laptop. This is the first time that I have attempted to have any functions colocated (i.e. master and etcd on one Node). I have not had any issues like this in the past. I will return to using single-function nodes using the same base code of the ansible-playbooks (Oct 31 release).
the current ssh commands running:
root 4027 1 0 22:07 ? 00:00:00 ssh: /root/.ansible/cp/ansible-ssh-rh7osetcd01.aperture.lab-22-root [mux]
root 6764 6730 0 22:07 pts/0 00:00:00 ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 rh7osetcd01.aperture.lab /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1447816068.32-200955259988792/service; rm -rf /root/.ansible/tmp/ansible-tmp-1447816068.32-200955259988792/ >/dev/null 2>&1' -- if I kill 6764, I get...

fatal: [rh7osetcd01.aperture.lab] => SSH Error: Shared connection to rh7osetcd01.aperture.lab closed.
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

TASK: [os_firewall | need to pause here, otherwise the iptables service starting can sometimes cause ssh to fail] ***
(^C-c = continue early, ^C-a = abort)
[rh7osemst01.aperture.lab, rh7osetcd02.aperture.lab]
Pausing for 10 seconds

@detiber
Copy link
Contributor

detiber commented Nov 18, 2015

@jradtke-rh The only time I've seen this reproduced is when ansible is being run on one of the hosts that are being configured. Can you confirm this was the case for your run this evening?

If so, then modifying the inventory to set ansible_connection=local for the host you are running ansible on should do the trick.

@JaredBurck
Copy link
Author

@detiber - my apologies it's been a while. And thank you for all your assistance and patience!! FYI, my testing so far seems to agree with what you have said about when ansible is being run on one of the hosts that are being configured.

However, as you suggested (and in workaround 2 from 3 comments above) to modify the inventory to set ansible_connection=local and ansible_sudo=no to your master host definition in the [masters] group. Also tried those settings with the [etcd] group. Still hung for me on the master. I'm a little unclear on workaround 1 and what you would like me to try "outside of the separate ansible host the same host configuration". Could you explain a little more?

@jradtke-rh Could you try what @detiber suggested to see if this works for you? I have another workaround to get openshift installed and working, but not a solution to this issue.

@JamesRadtke-RedHat
Copy link

@detiber - Unfortunately I did not see this in time. I killed the process running ssh (PID: 6764 in my example above) which let the ansible script continue... which finished but failed. I then re-ran the playbook and everything worked. I will be repeating my installation using this configuration a few times so I can try a few of the suggestions in this thread and gather more data.

One thing that I find strange... why does "localhost" show up in the "PLAY RECAP"

PLAY RECAP ******************************************************************** 
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
rh7osemst01.aperture.lab   : ok=234  changed=7    unreachable=0    failed=0   
rh7osenod01.aperture.lab   : ok=47   changed=9    unreachable=0    failed=0   
rh7osenod02.aperture.lab   : ok=47   changed=9    unreachable=0    failed=0   
rh7osetcd01.aperture.lab   : ok=35   changed=0    unreachable=0    failed=0   
rh7osetcd02.aperture.lab   : ok=35   changed=0    unreachable=0    failed=0   

@detiber
Copy link
Contributor

detiber commented Nov 18, 2015

@JaredBurck For workaround 1, I have no idea what I was thinking when I typed that. I was trying to say that you should run ansible from a separate host (one that is not part of the OpenShift deployment).

For workaround 2, I should have provided an example rather than try to explain it. Modifying your original inventory:

...
# host group for masters
[masters]
master.ose.example.com ansible_connection=local openshift_hostname=master.ose.example.com openshift_public_hostname=master.ose.example.com openshift_schedulable=True

# host group for etcd
[etcd]
master.ose.example.com ansible_connection=local
etcd1.ose.example.com
etcd2.ose.example.com

# host group for nodes
[nodes]
master.ose.example.com ansible_connection=local openshift_hostname=master.ose.example.com  openshift_public_hostname=master.ose.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
node2.ose.example.com openshift_hostname=node2.ose.example.com openshift_public_hostname=node2.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
node3.ose.example.com openshift_hostname=node3.ose.example.com openshift_public_hostname=node3.ose.example.com openshift_node_labels="{'region': 'primary', 'zone': 'east'}"

@detiber
Copy link
Contributor

detiber commented Nov 18, 2015

One thing that I find strange... why does "localhost" show up in the "PLAY RECAP"

@jradtke-rh We perform some operations on 'localhost' in order to copy items from one remote host to another. The main use case is for copying the generated certificates from the CA host to the host that needs the certs. There are other various uses as well, but any operations on localhost would only create temporary files and/or variables.

@jayunit100
Copy link

I just saw this hang in EC2. The hang was several minutes , and it happened multiple times.

@jayunit100
Copy link

Thanks for the workaround info. So,

  • by ansible_sudo=no , did we mean =false ?
  • what is the purpose of modding ansible_sudo here ?
  • Is this sufficiently problematic that we might want to make it a requirement; that we dont run ansible on the master host ?

@jayunit100
Copy link

By the way, on AWS , I confirmed workaround 2 works for me ;
i dont think we have a need for ansible_sudo=no. Do we?

# host group for masters
[masters]
$MASTER ansible_connection=local

[nodes]
$NODE1

#openshift_node_labels="{'region': 'primary', 'zone': 'us-east-1'}"
#openshift_node_labels="{'region': 'primary', 'zone': 'us-east-1'}"

^ this seems to get past the iptables restart issue.

@detiber
Copy link
Contributor

detiber commented Feb 12, 2016

@jayunit100 That is correct, you shouldn't need ansible_sudo=no.

As an aside, Ansible uses a pretty loose definition of true/false and yes/no are commonly used in the Ansible documentation rather than true/false, so in general we use yes/no to conform with what Ansible users would expect.

@jeremyeder
Copy link
Contributor

@jayunit100 workaround pls use the jumpbox we have setup inside our VPC on EC2.

@jayunit100
Copy link

TL;DR

  • the right way to install openshift is from a "installer node" which is completely separated from BOTH the master, AND the nodes.
  • otherwise $MASTER ansible_connection=local works. but its a little hacky.

@detiber
Copy link
Contributor

detiber commented Feb 13, 2016

@jayunit100 having a dedicated jump host is great, but no more correct than using connection=local on the master.

In the installer wrapper we attempt to detect when we are running on the master and set connection=local automatically.

The real benefit of having a jump host is if you are managing other hosts using ansible or reading down and rebuilding clusters frequently.

@jayunit100
Copy link

Ok, so running from master is supported . Jumper seems a better way to constrain installation problems and simplify docs, but as long as we are explicit that installing from master is supported, that's fine.

@tbielawa
Copy link
Contributor

This issue has been inactive for quite some time. Please update and reopen this issue if this is still a priority you would like to see action on.

tomassedovic pushed a commit to tomassedovic/openshift-ansible that referenced this issue Nov 7, 2017
…enstack (openshift#747)

* Allow for the specifying of server policies during OpenStack provisioning

* documentation for openstack server group policies

* add doc link detailing allowed policies

* changed default to anti-affinity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants