Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mwan3: opkg remove or stop on interface up .lock > lock... #13704

Closed
wulfy23 opened this issue Oct 18, 2020 · 7 comments
Closed

mwan3: opkg remove or stop on interface up .lock > lock... #13704

wulfy23 opened this issue Oct 18, 2020 · 7 comments
Assignees

Comments

@wulfy23
Copy link
Contributor

wulfy23 commented Oct 18, 2020

Maintainer: @feckert
Environment: master (and or PR), aarch64
Description: opkg remove or stop on interface up .lock > lock...

not a big issue... but worth sending through FYI...

i've had a firstboot script that successfully removes current master ( and installs the PR ) which ran ok...

echo "mwan3 workaround https://github.com/openwrt/packages/pull/13169"
opkg remove --force-depends mwan3
opkg install $MWAN3IPK 2>/dev/null
/etc/init.d/mwan3 disable &>/dev/null
/etc/init.d/mwan3 stop &>/dev/null

a recent change to the logic... performs some iptables / firewall / interface up / ipset logic just prior to the above commands... and the system hangs... ( mwan3 stop? )

seems killing the first lock gets the system 'unhanged'... so something in the opkg install / mwan3 stop / startup process... blocks the stop command... guessing that would be within the PR code...

ps w ( at hang )

20699 root      1212 S    lock /var/run/mwan3.lock
20728 root      1444 S    /bin/sh /usr/sbin/mwan3track wan
20729 root      1444 S    /bin/sh /usr/sbin/mwan3track wan6
20730 root      1444 S    /bin/sh /usr/sbin/mwan3track wanb
20731 root      1588 S    /bin/sh /usr/sbin/mwan3rtmon ipv4
20732 root      1588 S    /bin/sh /usr/sbin/mwan3rtmon ipv6
20816 root      1212 S    lock /var/run/mwan3.lock
20826 root      1640 S    /bin/sh /etc/rc.common /etc/init.d/mwan3 stop
20844 root      1212 S    lock /var/run/mwan3.lock
20878 root      1212 S    lock /var/run/mwan3.lock
@wulfy23
Copy link
Contributor Author

wulfy23 commented Oct 18, 2020

to clarify and unless opkg is involved I think there might be two issues here...

  1. insertion of veth(etc) while mwan3(pr?) is in 'lock/learn' mode causes it to hang?

and / or

  1. mwan3(pr?) does not respond to stop whilst in 'lock/learn' mode...

@feckert feckert self-assigned this Oct 20, 2020
@aaronjg
Copy link
Contributor

aaronjg commented Oct 25, 2020

It appears that there is a deadlock somewhere.

Can you turn on debug logging in the mwan3 common.sh, and uncomment the log lines in mwan3_lock and mwan3_unlock, and then report the mwan3 lines from the system log?

Also, could you install the full ps command from procps-ng-ps and then show the process tree from ps auxf? That will provide additional information on what is causing the deadlock.

I have made the mwan3 much more modular in the past couple P/Rs, so we can likely get rid of a lot of the locks, but it still shouldn't deadlock under the current setup.

@wulfy23
Copy link
Contributor Author

wulfy23 commented Oct 25, 2020

thanks for the debug tips... modding confs is tricky as the lock occurs post ipk install...

having said that... i've attempted a few non-install time remove-install / stop / disable whilst calling the previous firewall-ipset script... and it seems i'm unable to re-trigger this behavior...

one too many variables at play ... and considering the pushed version has newer code... best to close this report down me thinks... some other points in case something like this crops up again...

  • i'm beginning to suspect it was something to do with the previous config file and a part-time usb0 wan interface...

@wulfy23 wulfy23 closed this as completed Oct 25, 2020
@wulfy23 wulfy23 reopened this Nov 5, 2020
@wulfy23
Copy link
Contributor Author

wulfy23 commented Nov 5, 2020

EDIT: just came across the cleanup pr ( #13853 ) which looks like you've already tracked this down / resolved... feel free to close this... thankyou!

ok... we've triggered this baby again... this time with alternate conditions so debugging was somewhat simpler...

this time... : again firstboot... + current master.... issued stop early.... lock deadlock....

r14863-4a976beff4@mwan3-2.10.1-1

seems the issue is with 16-mwan user... and/or the 'running' logix ( [ -d /var/run/mwan3 ] ? )

`
[ "$MWAN3_SHUTDOWN" != 1 ] && mwan3_lock "$ACTION" "$DEVICE-user"

[ "$MWAN3_SHUTDOWN" != 1 ] && ! /etc/init.d/mwan3 running && {
	mwan3_unlock "$ACTION" "$DEVICE-user"
	exit 0
}

`

`
4388 root 1380 S sh /etc/custom/rc.custom 4306
4824 root 1376 S sh /etc/custom/firstboot/15-services
5606 root 1280 S /bin/sh /sbin/hotplug-call iface
5744 root 1496 S /bin/sh /sbin/hotplug-call iface
5746 root 1644 S /bin/sh /etc/rc.common /etc/init.d/mwan3 stop
5755 root 1212 S lock /var/run/mwan3.lock
5756 root 1636 S /bin/sh /etc/rc.common /etc/init.d/mwan3 running
5783 root 1212 S flock 1000
5784 root 1212 S lock /var/run/mwan3.lock
5892 root 1568 SN /usr/sbin/nlbwmon -o /var/lib/nlbwmon -b 104

[root@dca632 /usbstick 41°]# strace -p 5746
strace: Process 5746 attached
wait4(-1, ^Cstrace: Process 5746 detached
<detached ...>

[root@dca632 /usbstick 41°]# strace -p 5756
strace: Process 5756 attached
wait4(-1, ^Cstrace: Process 5756 detached
<detached ...>

[root@dca632 /usbstick 41°]# strace -p 5755
strace: Process 5755 attached
restart_syscall(<... resuming interrupted io_setup ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7fecd00da0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7fecd00da0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 5755 detached
<detached ...>

[root@dca632 /usbstick 42°]# strace -p 5784
strace: Process 5784 attached
flock(3, LOCK_EX^Cstrace: Process 5784 detached
<detached ...>

[root@dca632 /usbstick 41°]# cat /proc/5784/cmdline
lock/var/run/mwan3.lock

[root@dca632 /usbstick 41°]# cat /proc/5784/stat
5784 (lock) S 5746 1 1 0 -1 4194304 83 0 0 0 0 0 0 0 20 0 1 0 1774 1241088 147 18446744073709551615 4194304 4603048 549739605968 0 0 0 0 4096 0 1 0 0 17 0 0 0 0 0 0 4670576 4673569 356659200 549739610002 549739610027 549739610027 549739610094 0

[root@dca632 /usbstick 42°]# cat /proc/5746/cmdline
/bin/sh/etc/rc.common/etc/init.d/mwan3stop

[root@dca632 /usbstick 41°]# cat /proc/5756/cmdline
/bin/sh/etc/rc.common/etc/init.d/mwan3running

[root@dca632 /usbstick 41°]# cat /proc/5756/stat
5756 (sh) S 5744 1 1 0 -1 4194304 498 821 0 0 1 0 0 0 20 0 1 0 1772 1675264 359 18446744073709551615 4194304 4603048 549658265168 0 0 0 0 4100 65538 1 0 0 17 2 0 0 0 0 0 4670576 4673569 876306432 549658267414 549658267463 549658267463 549658267622 0

[root@dca632 /usbstick 41°]# cat /proc/5744/cmdline
/bin/sh/sbin/hotplug-calliface

[root@dca632 /usbstick 41°]# ps abcde
root 2553 0.0 0.0 1844 1184 ? S 03:26 0:00 /sbin/netifd
root 3307 0.0 0.0 1212 536 ? S 03:26 0:00 _ udhcpc -p /var/run/udhcpc-eth1.pid -s /lib/netifd/dhcp.script -f -
root 3306 0.0 0.0 972 716 ? S 03:26 0:00 _ odhcp6c -s /lib/netifd/dhcpv6.script -Ntry -P0 -t120 eth1
root 5606 0.0 0.0 1280 1076 ? S 03:26 0:00 _ /bin/sh /sbin/hotplug-call iface
root 5744 0.0 0.0 1496 1100 ? S 03:26 0:00 _ /bin/sh /sbin/hotplug-call iface
root 5756 0.0 0.0 1636 1436 ? S 03:26 0:00 _ /bin/sh /etc/rc.common /etc/init.d/mwan3 running
root 5783 0.0 0.0 1212 528 ? S 03:26 0:00 _ flock 1000

root 4306 0.0 0.0 1308 1048 ? S 03:26 0:00 /bin/sh /etc/rc.common /etc/rc.d/S95done boot
root 4328 0.0 0.0 1376 1176 ? S 03:26 0:00 _ sh /etc/rc.local
root 4388 0.0 0.0 1380 1180 ? S 03:26 0:00 _ sh /etc/custom/rc.custom 4306
root 4824 0.0 0.0 1376 1116 ? S 03:26 0:00 _ sh /etc/custom/firstboot/15-services
root 5746 0.0 0.0 1644 1388 ? S 03:26 0:00 _ /bin/sh /etc/rc.common /etc/init.d/mwan3 stop
root 5784 0.0 0.0 1212 588 ? S 03:26 0:00 _ lock /var/run/mwan3.lock

`

@aaronjg
Copy link
Contributor

aaronjg commented Nov 6, 2020

The cleanup commit that you mention is just code cleanup, and wouldn't change behavior. (@feckert, correct me if I'm wrong here).

With the information provided, I can reproduce the problem, and it is with the line you say.

Basically, there is a deadlock where procd has the lock on /tmp/lock/procd_mwan3.lock but not on /var/run/mwan3.lock, and netifd via hotplug has the lock on /var/run/mwan3.lock, but not /tmp/lock/procd_mwan3.lock, so we have a deadlock.

The purpose of the lock is to make sure that if a hotplug state transitions and mwan3 state transitions do not interfere with each other. Now that everything is on procd, we don't need separate locking mechanisms, so we can replace the locks on /var/run/mwan3.lock with procd API calls to procd_lock.

Thanks for identifying the issue. I'll work on a fix for this, and post again when I have a potential solution to test.

aaronjg added a commit to aaronjg/openwrt-packages that referenced this issue Nov 6, 2020
Replace locks on /var/run/mwan3.lock with locks via procd.

This fixes a deadlock issue where mwan3 stop would have a procd
lock, but a hotplug script would have the /var/run/mwan3.lock

Locking can be removed from mwan3rtmon since:
1) procd will have sent the KILL signal to the process during
shutdown, so it will not add routes to already removed interfaces on
mwan3 shutdown and
2) mwan3rtmon checks if an interface is active based on the
mwan3_iface_in_<IFACE> entry in iptables, and the hotplug script
always adds this before creating the route table and removes it
before deleting the route table

Fixes github issue openwrt#13704
(openwrt#13704)
@aaronjg
Copy link
Contributor

aaronjg commented Nov 6, 2020

@wulfy23 Can you give this P/R a try and see if that fixes the issue?

#13857

feckert pushed a commit to TDT-AG/packages that referenced this issue Nov 6, 2020
Replace locks on /var/run/mwan3.lock with locks via procd.

This fixes a deadlock issue where mwan3 stop would have a procd
lock, but a hotplug script would have the /var/run/mwan3.lock

Locking can be removed from mwan3rtmon since:
1) procd will have sent the KILL signal to the process during
shutdown, so it will not add routes to already removed interfaces on
mwan3 shutdown and
2) mwan3rtmon checks if an interface is active based on the
mwan3_iface_in_<IFACE> entry in iptables, and the hotplug script
always adds this before creating the route table and removes it
before deleting the route table

Fixes github issue openwrt#13704
(openwrt#13704)
aaronjg added a commit to aaronjg/openwrt-packages that referenced this issue Nov 6, 2020
Replace locks on /var/run/mwan3.lock with locks via procd.

This fixes a deadlock issue where mwan3 stop would have a procd
lock, but a hotplug script would have the /var/run/mwan3.lock

Locking can be removed from mwan3rtmon since:
1) procd will have sent the KILL signal to the process during
shutdown, so it will not add routes to already removed interfaces on
mwan3 shutdown and
2) mwan3rtmon checks if an interface is active based on the
mwan3_iface_in_<IFACE> entry in iptables, and the hotplug script
always adds this before creating the route table and removes it
before deleting the route table

Fixes github issue openwrt#13704
(openwrt#13704)

Signed-off-by: Aaron Goodman <aaronjg@stanford.edu>
@wulfy23
Copy link
Contributor Author

wulfy23 commented Nov 7, 2020

closing as issue has been addressed, thankyou...

@wulfy23 wulfy23 closed this as completed Nov 7, 2020
pprindeville pushed a commit to pprindeville/packages that referenced this issue Dec 19, 2020
Replace locks on /var/run/mwan3.lock with locks via procd.

This fixes a deadlock issue where mwan3 stop would have a procd
lock, but a hotplug script would have the /var/run/mwan3.lock

Locking can be removed from mwan3rtmon since:
1) procd will have sent the KILL signal to the process during
shutdown, so it will not add routes to already removed interfaces on
mwan3 shutdown and
2) mwan3rtmon checks if an interface is active based on the
mwan3_iface_in_<IFACE> entry in iptables, and the hotplug script
always adds this before creating the route table and removes it
before deleting the route table

Fixes github issue openwrt#13704
(openwrt#13704)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants