Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IIAB 6.7/master fails to install on latest Ubuntu 18.04: "Unable to start service dnsmasq" #1306

Closed
mrdavidhaag opened this issue Nov 24, 2018 · 68 comments

Comments

@mrdavidhaag
Copy link

Trying to install iiab on VM in Proxmox using 1 line installer script 6.7 and installation failed at TASK 4

TASK [4-server-options : Restart dnsmasq] ******************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "Unable to start service dnsmasq: Job for dnsmasq.service failed because the control process exited with error code.\nSee "systemctl status dnsmasq.service" and "journalctl -xe" for details.\n"}
to retry, use: --limit @/opt/iiab/iiab/iiab-stages.retry

PLAY RECAP *************************************************************************************
127.0.0.1 : ok=176 changed=122 unreachable=0 failed=1

Tried using Do everything from Scratch method - same result dnsmasq: failed to create listening socket for port 53: Address already in use

netstat tells me that bind9 and systemd-resolve are both listening on port 53

root@box:/home/awm# netstat -nlpt | grep 53
tcp 0 0 10.10.35.20:53 0.0.0.0:* LISTEN 417/named
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 417/named
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 404/systemd-resolve
tcp 0 0 127.0.0.1:953 0.0.0.0:* LISTEN 417/named

So I tried to stop bind9 service and run iiab install again - same error

root@box:/opt/iiab/iiab# systemctl status dnsmasq.service
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2018-11-24 10:58:06 WIT; 24s ago
Process: 5557 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=2)
Process: 5556 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCCESS)

Nov 24 10:58:06 box.lan systemd[1]: Starting dnsmasq - A lightweight DHCP and caching DNS server...
Nov 24 10:58:06 box.lan dnsmasq[5556]: dnsmasq: syntax check OK.
Nov 24 10:58:06 box.lan dnsmasq[5557]: dnsmasq: failed to create listening socket for port 53: Address already in use
Nov 24 10:58:06 box.lan systemd[1]: dnsmasq.service: Control process exited, code=exited status=2
Nov 24 10:58:06 box.lan systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
Nov 24 10:58:06 box.lan systemd[1]: Failed to start dnsmasq - A lightweight DHCP and caching DNS server.

If I stop systemd-resolve.service then the tasks fail at check for internet.

Any ideas or suggestions greatly appreciated. My Linux skills are thin.

@tim-moody
Copy link
Contributor

@mrdavidhaag
Some settings in /etc/iiab/local_vars.yml may conflict as there has been confusion between default_vars.yml and various local_vars.yml files in the 1 line installer. Check /etc/iiab/local_vars.yml and see if you have the following settings. When both named and dnsmasq are installed and enabled the conflict you describe will occur. Then rerun the install.

dhcpd_install: False
dhcpd_enabled: False
named_install: False
named_enabled: False
block_DNS: False

dnsmasq_install: True
dnsmasq_enabled: True

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 24, 2018 via email

@holta holta added this to the 6.7 milestone Nov 25, 2018
@holta holta added the bug label Nov 25, 2018
@holta
Copy link
Member

holta commented Nov 25, 2018

@mrdavidhaag
Some settings in /etc/iiab/local_vars.yml may conflict as there has been confusion between default_vars.yml and various local_vars.yml files in the 1 line installer. Check /etc/iiab/local_vars.yml and see if you have the following settings. When both named and dnsmasq are installed and enabled the conflict you describe will occur. Then rerun the install.

dhcpd_install: False
dhcpd_enabled: False
named_install: False
named_enabled: False
block_DNS: False

dnsmasq_install: True
dnsmasq_enabled: True

@jvonau can you confirm? And suggest alternatives for IIAB 6.7/master if dnsmasq is suddenly conflicting on Ubuntu 18.04 as is claimed above?

(Personally I've never had any such problems installing IIAB 6.7/master on Ubuntu 18.04 on VirtualBox, so I'd also ask if @mrdavidhaag 's use of Ubuntu 18.04 on ProxMox is perhaps somehow different than others' use of Ubuntu 18.04 on VirtualBox ?)

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 25, 2018 via email

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 25, 2018 via email

@holta
Copy link
Member

holta commented Nov 25, 2018

Perhaps something with Ubuntu 18.04 or something changed
in the iiab scripts.

Likely 18.04 and/or its latest version of dnsmasq changed?

Let's try to investigate on Raspbian Lite on Raspberry Pi 3 too, to shed light & solve this.

Profound thanks @mrdavidhaag for the detailed report.

@holta
Copy link
Member

holta commented Nov 25, 2018

Related suggestion: PR #1303 ("bind9 and dnsmasq fighting over port 53")

@tim-moody
Copy link
Contributor

I agree with Adam that I have not seen this problem on VirtualBox, so your environment should be considered.

Proxmox VE is a complete open-source platform for all-inclusive enterprise virtualization that tightly integrates KVM hypervisor and LXC containers, software-defined storage and networking functionality on a single platform, and easily manages high availability clusters and disaster recovery tools with the built-in web management interface.

My experience is that containers, certainly docker, come with some networking already configured that can conflict with IIAB installation.

@holta
Copy link
Member

holta commented Nov 25, 2018

@jvonau writes:

Captive portal is starting DNSMASQ

Is it possible Captive Portal changes over the last 2 weeks have caused this problem?

In any case, @georgejhunt suggests trying: (#1303 (comment))

try "systemctl disable systemd-resolvd", and reboot, before running our installer in the VM

@holta
Copy link
Member

holta commented Nov 25, 2018

@jvonau writes:

Disable and stop is what appears to be needed but that may not be the complete solution..

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 25, 2018 via email

@holta
Copy link
Member

holta commented Nov 26, 2018

@jvonau writes:

I'm away from a computer ATM... Seems the subject VM/pre-canned hosting service is setup slightly different from a stock real hardware Ubuntu install. [?]

Bigger questions I have if this is a single interface VM why is captive portal being enabled at that point in the install? Before the number of network interfaces is calulated? There would be no point in turning on captive portal with only a single interface involved in the same way squid is only activated when there is a gateway present... Think CP should only be enabled when there is a LAN interface present..

@holta
Copy link
Member

holta commented Nov 27, 2018

@georgejhunt & all, do have a moment to respond to the ideas/questions above prior to Thursday's http://minutes.iiab.io 10AM NYC Time call?

Can someone with access to RPi 3 and Ubuntu 18.04 validate the situation there, in a known environment on known hardware or on a known VM environment?

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 27, 2018 via email

@georgejhunt
Copy link
Contributor

georgejhunt commented Nov 27, 2018 via email

@holta
Copy link
Member

holta commented Nov 28, 2018

IIAB 6.7/master is definitely broken on Ubuntu 18.04 as tested on multiple NUCs: (unlike on Raspbian 2018-11-13 where IIAB 6.7/master still works)

TASK [4-server-options : Restart dnsmasq] **************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "Unable to start service dnsmasq: Job for dnsmasq.service failed because the control process exited with error code.\nSee "systemctl status dnsmasq.service" and "journalctl -xe" for details.\n"}

Further detail:

root@box:~# systemctl status dnsmasq
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
   Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-11-27 18:57:11 EST; 12min ago
  Process: 14669 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=2)
  Process: 14662 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCCESS)

Nov 27 18:57:11 box.lan systemd[1]: Starting dnsmasq - A lightweight DHCP and caching DNS server...
Nov 27 18:57:11 box.lan dnsmasq[14662]: dnsmasq: syntax check OK.
Nov 27 18:57:11 box.lan dnsmasq[14669]: dnsmasq: failed to create listening socket for 172.18.96.1: Address already in use
Nov 27 18:57:11 box.lan systemd[1]: dnsmasq.service: Control process exited, code=exited status=2
Nov 27 18:57:11 box.lan systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
Nov 27 18:57:11 box.lan systemd[1]: Failed to start dnsmasq - A lightweight DHCP and caching DNS server.

Thanks all for helping to narrow down what broke IIAB 6.7/master — which worked 2 weeks ago.

@mrdavidhaag
Copy link
Author

mrdavidhaag commented Nov 28, 2018 via email

@holta holta changed the title Failed installation on ubuntu 18.04 (mini) using 6.7 1 line installer script Failed installation on ubuntu 18.04 (mini) using IIAB 6.7 1-line installer script Nov 29, 2018
@holta
Copy link
Member

holta commented Nov 30, 2018

Clean install of IIAB 6.7/master failed on a 100% fresh Ubuntu Server 18.04.1 VM:

TASK [4-server-options : Restart dnsmasq] **************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "Unable to start service dnsmasq: Job for dnsmasq.service failed because the control process exited with error code.\nSee "systemctl status dnsmasq.service" and "journalctl -xe" for details.\n"}

George / Tim indicated yesterday (during our voice call) that the above would work. It did not. Seems we have some kind of common/intermittent failure? FYI, all updates were applied to the 18.04.1 VM prior to beginning the IIAB install, using apt update && apt -y dist-upgrade && reboot, as per our "Do Everything from Scratch" install recommendations and 1-line installer.

@jvonau writes:

U-18 Server most likely has changed in that systemd-resolved is now enabled to provide a local DNS server.. ip a and the contents of [ /etc/resolv.conf ] would be helpful.. can't VPN in ATM heading to work...

If CP is to be stand-alone please clarify the use of code in iiab-gen-iptables that is part of the original PR.

@holta
Copy link
Member

holta commented Nov 30, 2018

@jvonau requested output of ip a and more /etc/resolv.conf from the above failed IIAB 6.7/master install on Ubuntu Server 18.04.1 VM:

root@box:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:d1:14:a4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.161/24 brd 192.168.0.255 scope global enp0s3
       valid_lft forever preferred_lft forever
3: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 100
    link/none
    inet 10.8.0.50 peer 10.8.0.49/32 scope global tun0
       valid_lft forever preferred_lft forever
root@box:~# more /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "systemd-resolve --status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53

@holta
Copy link
Member

holta commented Nov 30, 2018

"dnsmasq & systemd Causing Intermittent CPU Spikes"
https://unix.stackexchange.com/questions/417645/dnsmasq-systemd-causing-intermittent-cpu-spikes

@jvonau wonders if the above might be related?

@holta
Copy link
Member

holta commented Nov 30, 2018

Clean install of IIAB 6.7/master failed on a 100% fresh Ubuntu Server 18.04.1 VM:

TASK [4-server-options : Restart dnsmasq] **************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "Unable to start service dnsmasq: Job for dnsmasq.service failed because the control process exited with error code.\nSee "systemctl status dnsmasq.service" and "journalctl -xe" for details.\n"}

FYI I get the exact same/above failure on 2 other Ubuntu Server 18.04.1 machines, that happen to be NUC PC's (10.8.0.6 & 10.8.0.34) rather than VM's.

FWIW these NUC PC's were updated using apt update && apt -y dist-upgrade && reboot prior to running cd /opt/iiab/iiab && git pull && ./iiab-install --reinstall

@holta holta changed the title Failed installation on ubuntu 18.04 (mini) using IIAB 6.7 1-line installer script IIAB 6.7/master fails to install on latest Ubuntu 18.04: "Unable to start service dnsmasq" Nov 30, 2018
@jvonau
Copy link
Contributor

jvonau commented Dec 1, 2018

The primary issue is dnsmasq fails to start when being installed by the system package manager so this is an upstream problem, search Ubuntu's bug tracker for other who have run across this issue or file a bug. I'll bet this could be reproduced by installing dnsmasq on a new VM without using iiab with just apt. The above observation on real hardware points to dnsmasq breaking on currently configured out in the wild machines that most likely will receive an update and suddenly break. The above link points to a possible solution to deal with the issue with systemd-resolved and dnsmasq.

@holta
Copy link
Member

holta commented Dec 1, 2018

@jvonau & All, there are many concrete suggestions here:

How to avoid conflicts between dnsmasq and systemd-resolved?
https://unix.stackexchange.com/questions/304050/how-to-avoid-conflicts-between-dnsmasq-and-systemd-resolved

Which do you suggest we try?

@holta
Copy link
Member

holta commented Mar 14, 2019

Seems unrelated to:

#1569 dnsmasq sometimes fails to start on Raspbian Desktop (possibly also Lite?)

@MrSteve2
Copy link

MrSteve2 commented Mar 15, 2019 via email

@m-anish
Copy link
Contributor

m-anish commented Mar 19, 2019

Okay, this is failing for me too:
http://paste.ubuntu.com/p/4njWzCnTtW/

Setup as follows:
Hardware:

  • Intel NUC

Network:

  • Internet through USB tethering
  • LAN port left empty
  • IIAB should setup in Gateway mode.

Latest commit hash: 23286e1

OS:
Ubuntu Server 18.04.2

@holta
Copy link
Member

holta commented Mar 19, 2019

@m-anish Ubuntu 18.04.2 can mean kernel 4.15 or 4.18 — plz confirm your exact kernel?

The output of ip a and cat /etc/iiab/iiab.env also give critical context for @jvonau

@m-anish
Copy link
Contributor

m-anish commented Mar 19, 2019

Running ./iiab-install --reinstall fixes this.

@m-anish
Copy link
Contributor

m-anish commented Mar 19, 2019

  • iiab.env
  • iiab.ini
  • ip a
  • Linux box.lan 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@iiab iiab deleted a comment from m-anish Mar 19, 2019
@holta
Copy link
Member

holta commented Mar 19, 2019

Running ./iiab-install --reinstall fixes this.

Did you happen to try repeated runs of ./iiab-network prior to that? What were the results if so?

(We need to publish a workaround if possible, as this error is proving to be quite common.)

@holta
Copy link
Member

holta commented Mar 19, 2019

Here are the main Networking issues others are facing at this time — these might or might not be related — but are good to keep in mind as we seek a more bulletproof IIAB installation process:

@m-anish
Copy link
Contributor

m-anish commented Mar 19, 2019

Did you happen to try repeated runs of ./iiab-network prior to that? What were the results if so?

No. This was the very first run.

After that, I tried starting dnsmasq manually and it worked.

After that, I tried ./iiab-install --reinstall and it worked.

@jvonau
Copy link
Contributor

jvonau commented Mar 19, 2019

Manually restarting dnsmasq with success suggests that br0 was not fully initialized when dnsmasq was restarted...

@holta
Copy link
Member

holta commented Mar 19, 2019

Manually restarting dnsmasq with success suggests that br0 was not fully initialized when dnsmasq was restarted...

Should the networking playbook enforce this prereq, and if so pause for 30sec or whatever until br0 appears before checking again?

(Or should we start with diagnostic/error messaging at that point in the playbook?)

@holta
Copy link
Member

holta commented Mar 19, 2019

@jvonau suggests:

Think the easiest would [be] to have a few second wait before restarting dnsmasq, to allow some time for the bridge to stabilize

Great. Let's make this fault-tolerant across different OS's & HW that each have their own wacky timings.

e.g. enforcing the prereqs we want and/or telling the implementer why it's failing if manual intervention is absolutely necessary.

@holta
Copy link
Member

holta commented May 7, 2019

Working with @jvonau and Tony Anderson (PR #1636) on his Acer XC-885 desktop machine, there appears be a similar timing issue with starting of dnsmasq prior to the creation of br0 (bridge for LAN-side, Wi-Fi, hostapd):

Failed: Failed to start dnsmmasq.service
see systemctl status dnsmasq.service for details
Failed: Failed to start: Network iiab-dnsmasq

It then hangs with [ok] Started Execute cloud user.final scripts.
[ok] Reached target Cloud-init target.

By default there's a 1-second built-in delay, that we may now/soon want to change Line 43 of roles/network/defaults/main.yml -- from 1 second:

hostapd_wait: 1 

...up to something like 5 seconds?

hostapd_wait: 5

(And then of course run...)

cd /opt/iiab/iiab
./iiab-network

Context: this glitch only happens during IIAB's initial install, when IIAB's network role is run.

@holta
Copy link
Member

holta commented Jul 9, 2019

@jvonau can you confirmed this is now solved by your recent dnsmasq PR's ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants