Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base-files: sysupgrade: Add 2 sec sleep into process KILL loop #13402

Merged
merged 1 commit into from Sep 12, 2023

Conversation

hnyman
Copy link
Contributor

@hnyman hnyman commented Sep 3, 2023

cc @robimarko @Ansuel

Add 2 seconds sleep between loop rounds in the final process termination loop in sysupgrade stage2.

This is needed especially for qualcommax/ipq807x, where ath11k wireless driver may have a long 10-20 seconds delay after termination before actually getting killed. This breaks often sysupgrade.

The current KILL loop in kill_remaining does all 10 kill attempts consecutively without any delay, as evidenced here in a failing sysupgrade. It does not allow any time for the process to finalize its internal termination.

Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending TERM to remaining processes ...
Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending signal TERM to hostapd (2122)
Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending signal TERM to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending KILL to remaining processes ...
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2122)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Failed to kill all processes.
sysupgrade aborted with return code: 256

The change in this commit adds a 2 seconds delay after each kill loop round in order to allow some processes to more gracefully handle their internal termination.

The result is like this:

Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending TERM to remaining processes ... 
Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending signal TERM to hostapd (2309) 
Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending signal TERM to hostapd (2324) 
Sun Sep  3 11:15:14 EEST 2023 upgrade: Sending KILL to remaining processes ... 
Sun Sep  3 11:15:14 EEST 2023 upgrade: Sending signal KILL to hostapd (2309) 
[  699.827521] br-lan: port 7(hn5wpa2r) entered disabled state 
[  699.908673] device hn5wpa2r left promiscuous mode 
[  699.908721] br-lan: port 7(hn5wpa2r) entered disabled state 
[  701.038029] br-lan: port 6(hn5wpa3) entered disabled state 
Sun Sep  3 11:15:16 EEST 2023 upgrade: Sending signal KILL to hostapd (2324) 
[  702.058256] br-lan: port 5(hn2wlan) entered disabled state 
[  709.250063] stage2 (8237): drop_caches: 3
Sun Sep  3 11:15:25 EEST 2023 upgrade: Switching to ramdisk...

The delay introduced here only kicks in if there is some process that does not get terminated by the first TERM call. Then there is at least one 2 sec wait after the first KILL loop round.

This commit is related to discussion in PRs #12235 and #12632

Compile & run-tested for qualcommax/ipq807x DL-WRX36

@github-actions github-actions bot added the core packages pull request/issue for core (in-tree) packages label Sep 3, 2023
@xabolcs
Copy link
Contributor

xabolcs commented Sep 3, 2023

This change slows down the kill_remain function to O(n) where n is the count of the slow terminating processes like the ath11k.

Wouldn't be better to wait between the loop rounds and not between the process kills (as in your first proposal)?

Or - hopefully - I read the code wrong! 😅

@hnyman
Copy link
Contributor Author

hnyman commented Sep 3, 2023

You are right.
The current code causes a 2 sec delay for each process-to-be-killed found in the KILL loop round (but nothing on TERM round). In practice, the sleep only gets applied if a process has been found for KILL, so there shouldn't be too many sleeps triggered. But yes, there might be quite much sleep if there are lots of impossible to kill processes.

My first proposal discussed in the other PR was to place a "sleep 2" to the end of the while loop. But that caused a 2 sec delay also for the first TERM round. And then another 2 sec at the end of the first KILL round even if there had been nothing to be killed. That sounded bad to me, so I changed to the current approach.

If we want to avoid the (probably rare) possibility of really multiple processes-to-be-KILLed causing a long delay inside the for statement, we might again place the "sleep 2" as the last item of the while loop, but make it conditional for $run=true , which only gets set true if we are on the KILL round and there has been something to be killed.

[ "$run" = true ] && sleep 2

Then there would be just one possible sleep per while loop round.

@@ -129,6 +129,7 @@
                        v "Failed to kill all processes."
                        exit 1
                }
+               [ "$run" = true ] && sleep 2 && v "Waited 2 seconds."
        done
 }
 

(Still, the quite simplest would be to add the simple "sleep 2" there, like I initially did, but that would prolong sysupgrade by 4 secs always.)

EDIT:
This would lead into:

Watchdog did not previously reset the system
[67468.883095] br-lan: port 5(hn2wlan) entered disabled state
[67468.922767] device hn2wlan left promiscuous mode
[67468.922814] br-lan: port 5(hn2wlan) entered disabled state
Mon Sep  4 18:43:32 EEST 2023 upgrade: Sending TERM to remaining processes ...
Mon Sep  4 18:43:32 EEST 2023 upgrade: Sending signal TERM to hostapd (2307)
Mon Sep  4 18:43:32 EEST 2023 upgrade: Sending signal TERM to hostapd (2323)
Mon Sep  4 18:43:36 EEST 2023 upgrade: Sending KILL to remaining processes ...
Mon Sep  4 18:43:36 EEST 2023 upgrade: Sending signal KILL to hostapd (2307)
Mon Sep  4 18:43:36 EEST 2023 upgrade: Sending signal KILL to hostapd (2323)
[67474.135183] br-lan: port 7(hn5wpa2r) entered disabled state
[67474.172533] device hn5wpa2r left promiscuous mode
[67474.172575] br-lan: port 7(hn5wpa2r) entered disabled state
Mon Sep  4 18:43:38 EEST 2023 upgrade: Waited 2 seconds.
[67475.282011] br-lan: port 6(hn5wpa3) entered disabled state
[67481.272647] stage2 (14674): drop_caches: 3
Mon Sep  4 18:43:44 EEST 2023 upgrade: Switching to ramdisk...

@asvio
Copy link

asvio commented Sep 4, 2023

I've tried it on nbg7815(ipq8074) and it works even while doing iperf3 with a wifi client to wired server.
Thanks.

PD: I've algo tried with 23.05 and algo work. I think it is a must to have also there.

@hnyman
Copy link
Contributor Author

hnyman commented Sep 4, 2023

Yeah, all three approaches work.

1 )
always sleep 2 sec at the end of loop round

@@ -129,6 +129,7 @@
 			v "Failed to kill all processes."
 			exit 1
 		}
+		sleep 2
 	done
 }
 

2 )
sleep 2 sec for each process killed (inside the for iteration statement)

@@ -121,7 +121,7 @@ kill_remaining() { # [ <signal> [ <loop> ] ]
 			v "Sending signal $sig to $name ($pid)"
 			kill -$sig $pid 2>/dev/null
 
-			[ $loop -eq 1 ] && run=true
+			[ $loop -eq 1 ] && sleep 2 && run=true
 		done
 
 		let loop_limit--

3 )
sleep 2 sec at the end of loop round if something was killed (and write a log item)

@@ -129,6 +129,7 @@
 			v "Failed to kill all processes."
 			exit 1
 		}
+		[ "$run" = true ] && sleep 2 && v "Waited 2 seconds."
 	done
 }
 

This PR's current commit reflects alternative 2),
while 1) would be the simplest and most straightforward approach,
and 3) would probably be the most correct one (that only sleeps if necessary).

All three approaches have so far produced me 100% success rate (in 20-30 test sysupgrades done in total).

@mpratt14
Copy link
Contributor

mpratt14 commented Sep 4, 2023

is 1 second not enough?

or combine that with increasing the number of loops before ignoring it?

@robimarko
Copy link
Contributor

Without a delay these loops will go through extremely fast, so this LGTM.

Reviewed-by: Robert Marko <robimarko@gmail.com>

@hnyman
Copy link
Contributor Author

hnyman commented Sep 6, 2023

@robimarko
I added your reviewed-by

@xabolcs
Copy link
Contributor

xabolcs commented Sep 6, 2023

This PR's current commit reflects alternative 2),

@hnyman, may I ask to improve the commit description? Current description (see below) suggests the 1st approach.

The change in this commit adds a 2 seconds delay after each kill loop
round in order to allow some processes to more gracefully handle their
internal termination.

@hnyman
Copy link
Contributor Author

hnyman commented Sep 8, 2023

may I ask to improve the commit description? Current description (see below) suggests the 1st approach.

I edited the commit message to reflect the behaviour more accurately .

@robimarko @Ansuel
Somebody to merge it?

@robimarko
Copy link
Contributor

I dont have commit rights, so somebody else has do to merge it

@Ansuel
Copy link
Member

Ansuel commented Sep 8, 2023 via email

Add 2 seconds sleep after each forcibly killed/tried-to-kill process
in the final process termination loop in sysupgrade stage2.

This is needed especially for qualcommax/ipq807x, where ath11k
wireless driver may have a long 10-20 seconds delay after termination
before actually getting killed. This often breaks sysupgrade.

The current KILL loop in kill_remaining does all 10 kill attempts
consecutively without any delay, as evidenced here in a failing sysupgrade.
It does not allow any time for the process to finalize its internal
termination.

Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending TERM to remaining processes ...
Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending signal TERM to hostapd (2122)
Sat Sep  2 19:05:56 EEST 2023 upgrade: Sending signal TERM to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending KILL to remaining processes ...
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2122)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Sending signal KILL to hostapd (2138)
Sat Sep  2 19:06:00 EEST 2023 upgrade: Failed to kill all processes.
sysupgrade aborted with return code: 256

The change in this commit adds a 2 seconds delay after each kill attempt
in order to allow some processes to more gracefully handle their
internal termination.

The result is like this:

Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending TERM to remaining processes ...
Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending signal TERM to hostapd (2309)
Sun Sep  3 11:15:10 EEST 2023 upgrade: Sending signal TERM to hostapd (2324)
Sun Sep  3 11:15:14 EEST 2023 upgrade: Sending KILL to remaining processes ...
Sun Sep  3 11:15:14 EEST 2023 upgrade: Sending signal KILL to hostapd (2309)
[  699.827521] br-lan: port 7(hn5wpa2r) entered disabled state
[  699.908673] device hn5wpa2r left promiscuous mode
[  699.908721] br-lan: port 7(hn5wpa2r) entered disabled state
[  701.038029] br-lan: port 6(hn5wpa3) entered disabled state
Sun Sep  3 11:15:16 EEST 2023 upgrade: Sending signal KILL to hostapd (2324)
[  702.058256] br-lan: port 5(hn2wlan) entered disabled state
[  709.250063] stage2 (8237): drop_caches: 3
Sun Sep  3 11:15:25 EEST 2023 upgrade: Switching to ramdisk...

The delay introduced here only kicks in if there is some process that
does not get terminated by the first TERM call. Then there is at least
one 2 sec wait after the first KILL loop round.

This commit is related to discussion in PRs openwrt#12235 and openwrt#12632

Signed-off-by: Hannu Nyman <hannu.nyman@iki.fi>
Reviewed-by: Robert Marko <robimarko@gmail.com>
@openwrt-bot openwrt-bot merged commit 47d56ae into openwrt:main Sep 12, 2023
6 checks passed
@Ansuel
Copy link
Member

Ansuel commented Sep 12, 2023

@hnyman thanks merged and backported to 23.05.
(and in theory this would make ipq807x stable and ready for 23.05)

@hnyman hnyman deleted the killsleep branch September 12, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core packages pull request/issue for core (in-tree) packages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants