New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Just hangs at: TASK: [os_firewall | Start and enable iptables service] *********************** #747
Comments
@JaredBurck I suspect the issue is related to the iptables restart and the ssh connection hanging from that point. It is probably more prevalent in OpenStack environments due to the way networking is handled. Are you able to reproduce this issue reliably? |
@detiber Yes, I am able to reproduce this reliably. |
@JaredBurck okay, just to clarify, you are seeing this same issue on all of the hosts once they attempt to restart iptables for the initial time but not subsequent runs, correct? What is the base image that you are using within OpenStack? |
@detiber No not seeing the same issue on all the hosts, just hangs on the master. It does not matter where in the list the master host in, if it is the first host in this task it will hang. If master is the second host in the task list, it will hang here. It only seems to hang on the master host. I am using the ose3-base image that is a custom built image from rhel-server-7.1-x86_64-dvd.iso. I am using the OS1 environment, if you are familiar with that. One other thing that may or may not be helpful is that we do uninstall NetworkManager as-is. This was a prerequisites early on in OSEv3, but it appears that it has been removed from any docs as of lately. |
@JaredBurck most interesting, I really would have expected the error to be on all hosts and not isolated to the master host. Also odd is that the issue seems to be reproduced only by people using os1. Some other things that may or may not have anything to do with the price of beans:
|
As a side note, there should be no incompatibilities with running NetworkManager. We resolved any outstanding issues before 3.0 went GA. |
@detiber Thanks for the side note, that's good to know! It is odd that this is only reproduced in os1.
I am running openshift-ansible from inside os1 (internal - if that matters) on the master. An internal dns server is installed and configured on node2 for dns resolution. The dns server is not co-located with the master. The dns names used resolve to the internal ip addresses for the instances, the 172.x.x.x addresses. |
@JaredBurck oh wait... you are running ansible on the master with dns being provided by node2? I think you just helped me crack the code. I think the combination of the ssh client and ssh server experiencing a network interruption by the iptables restart is the ultimate cause.... Can you try one of the following workarounds:
There is a small chance that you may hit some type of a bug related to running with the second workaround, but that would manifest itself quite differently than the errors you are currently seeing. |
@JaredBurck - hey Jared - I have run in to the same issue this evening myself. However, this is on my laptop. This is the first time that I have attempted to have any functions colocated (i.e. master and etcd on one Node). I have not had any issues like this in the past. I will return to using single-function nodes using the same base code of the ansible-playbooks (Oct 31 release). fatal: [rh7osetcd01.aperture.lab] => SSH Error: Shared connection to rh7osetcd01.aperture.lab closed. TASK: [os_firewall | need to pause here, otherwise the iptables service starting can sometimes cause ssh to fail] *** |
@jradtke-rh The only time I've seen this reproduced is when ansible is being run on one of the hosts that are being configured. Can you confirm this was the case for your run this evening? If so, then modifying the inventory to set ansible_connection=local for the host you are running ansible on should do the trick. |
@detiber - my apologies it's been a while. And thank you for all your assistance and patience!! FYI, my testing so far seems to agree with what you have said about when ansible is being run on one of the hosts that are being configured. However, as you suggested (and in workaround 2 from 3 comments above) to modify the inventory to set @jradtke-rh Could you try what @detiber suggested to see if this works for you? I have another workaround to get openshift installed and working, but not a solution to this issue. |
@detiber - Unfortunately I did not see this in time. I killed the process running ssh (PID: 6764 in my example above) which let the ansible script continue... which finished but failed. I then re-ran the playbook and everything worked. I will be repeating my installation using this configuration a few times so I can try a few of the suggestions in this thread and gather more data. One thing that I find strange... why does "localhost" show up in the "PLAY RECAP"
|
@JaredBurck For workaround 1, I have no idea what I was thinking when I typed that. I was trying to say that you should run ansible from a separate host (one that is not part of the OpenShift deployment). For workaround 2, I should have provided an example rather than try to explain it. Modifying your original inventory:
|
@jradtke-rh We perform some operations on 'localhost' in order to copy items from one remote host to another. The main use case is for copying the generated certificates from the CA host to the host that needs the certs. There are other various uses as well, but any operations on localhost would only create temporary files and/or variables. |
I just saw this hang in EC2. The hang was several minutes , and it happened multiple times. |
Thanks for the workaround info. So,
|
By the way, on AWS , I confirmed workaround 2 works for me ;
^ this seems to get past the iptables restart issue. |
@jayunit100 That is correct, you shouldn't need ansible_sudo=no. As an aside, Ansible uses a pretty loose definition of true/false and yes/no are commonly used in the Ansible documentation rather than true/false, so in general we use yes/no to conform with what Ansible users would expect. |
@jayunit100 workaround pls use the jumpbox we have setup inside our VPC on EC2. |
TL;DR
|
@jayunit100 having a dedicated jump host is great, but no more correct than using connection=local on the master. In the installer wrapper we attempt to detect when we are running on the master and set connection=local automatically. The real benefit of having a jump host is if you are managing other hosts using ansible or reading down and rebuilding clusters frequently. |
Ok, so running from master is supported . Jumper seems a better way to constrain installation problems and simplify docs, but as long as we are explicit that installing from master is supported, that's fine. |
This issue has been inactive for quite some time. Please update and reopen this issue if this is still a priority you would like to see action on. |
…enstack (openshift#747) * Allow for the specifying of server policies during OpenStack provisioning * documentation for openstack server group policies * add doc link detailing allowed policies * changed default to anti-affinity
I am trying to install OpenShift v3 using the advanced method via anisble playbooks and having a very similar issue to #434. However, I am attempting to setup a single (1) master + clustered (3) etcd + 2 nodes mode cluster on OpenStack. Everything seems to be setup and executing just fine up until ansible gets to the following task and just hangs without completing, erroring, or any other ending.
It just hangs at this point no matter how long I wait. If I quit the execution and re-run the ansible playbook, ansible successfully completes the installation. There were no errors that I could find on master or any of the etcd instances and iptables had been enabled and started on each of them.
I could not attach my ansible hosts file so it is below:
I am happy to provide additional information or help investigate this further. Thanks!
The text was updated successfully, but these errors were encountered: