watchcat: clarify restart logs and add optional failure timer reset#29326
watchcat: clarify restart logs and add optional failure timer reset#29326dhrm1k wants to merge 2 commits into
Conversation
30398ef to
457f29e
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjusts watchcat network monitoring behavior so the failure timer is reset after recovery actions complete, reducing repeated restarts during sustained outages, and clarifies related log messaging.
Changes:
- Update log messages to clarify that actions occur only after the configured failure period is reached.
- Reset the failure timer based on the time when the recovery action finishes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
457f29e to
2320ed8
Compare
|
Fixed it, thanks. I replaced that |
|
@dhrm1k I am not in favour of this change. I think restarting the interface repeatedly is unavoidable if we want to avoid failing to restart once the outage is over. I believe it would defeat the purpose of watchcat in the event of an outage. |
|
I thought mainly around the timing behavior after the restart But I understand your concern: if the repeated restart behavior is treated In that case, I am happy to drop this change and keep the log wording |
|
@dhrm1k Sounds good. Thank you. |
2320ed8 to
01d7742
Compare
|
I have just kept the log changes now. |
|
LGTM. Can you update PR description (and close the associated issue, with a reference to the discussion here so others know why)? |
|
@dhrm1k I've been thinking about this. I think there a couple of difference scenarios here:
This perhaps should be a configuration option, where you can either insist on a reboot/restart after an outage, or where you only want a restart/reboot if the outage persists. It sounds like you have a 2. scenario, whereas I am used to a 1. scenario. Would you be up for adding such an option? |
|
Yes, I’d be up for that. I think you’re right that these are really two different use cases. One is where you always want the recovery action once the outage window has The other is where you only want to take the recovery action if the outage is My use case was definitely closer to the second one. I can look at adding this as a configuration option so the current behavior |
|
I’ve added the configuration option version on top of the existing branch. The current behavior remains the default. The new I also tested the behavior on OpenWrt 24.10.4 x86/64 to confirm the |
danielfdickinson
left a comment
There was a problem hiding this comment.
Generally looks really good. I've made some comments/queries inline. Thank you.
| option pinghosts '8.8.8.8' | ||
| option forcedelay '30' | ||
| # For restart_iface and run_script, start a fresh failure window after | ||
| # each recovery action finishes before allowing another one. |
There was a problem hiding this comment.
Minor tweak: instead of 'another one' could you say 'another restart'?
There was a problem hiding this comment.
yes, i did that now.
| procd_set_param command /usr/bin/watchcat.sh \ | ||
| "restart_iface" "$period" "$pinghosts" "$pingperiod" \ | ||
| "$pingsize" "$interface" "$mmifacename" "$unlockbands" \ | ||
| "$addressfamily" "" "$reset_failure_timer" |
There was a problem hiding this comment.
Thanks. Appreciate you breaking up that command line. and the one below.
| mm_iface_unlock_bands="$7" | ||
| address_family="$8" | ||
| script="$9" | ||
| reset_failure_timer="${10}" |
There was a problem hiding this comment.
Does "${10}" work in ash? Might it be better use shift and use "$9" ?
There was a problem hiding this comment.
yes, "${10}" work in ash, still i have switched the handling to use shift plus named
variables rather than relying on ${10} / ${11} in the mode dispatch now.
| # args from init script: period pinghosts pingperiod pingsize interface | ||
| # mmifacename unlockbands addressfamily script reset_failure_timer | ||
| watchcat_monitor_network "$2" "$3" "$4" "$5" "$6" "$7" "$8" \ | ||
| "$9" "${10}" "${11}" |
There was a problem hiding this comment.
As above with "${10}" and "${11}". Might be better to use shift. Also (this is is an enhancement, not a blocker) named variables that capture the incoming parameters and use those variables here.
There was a problem hiding this comment.
it no longer relies on ${10} / ${11} there now
|
Could you also bump PKG_RELEASE in the Makefile? |
c4b6df2 to
75d124e
Compare
Clarify the restart_iface logging so the message reflects that the configured action happens only after the failure period is reached. Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>
75d124e to
46dae48
Compare
|
I have pushed a new commit with requested changes. I have also bumped the PKG_RELEASE in Makefile. (Also repushed a older commit because that was failing the CI tests because i didn't sign that commit.) |
danielfdickinson
left a comment
There was a problem hiding this comment.
LGTM. Thank you!
|
@BKPepe Any chance of a Copilot on this? |
| # Optionally start a fresh failure window after the recovery action | ||
| # finishes instead of continuing to count the original outage. | ||
| if [ "$reset_failure_timer" -eq 1 ]; then | ||
| time_now="$(cat /proc/uptime)" | ||
| time_now="${time_now%%.*}" |
There was a problem hiding this comment.
This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417 , so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).
There was a problem hiding this comment.
That makes sense.
I agree this probably needs a more explicit look at the timing model.
I’ll map out the expected behavior for the different cases first:
- current/default behavior
- reset_failure_timer disabled
- reset_failure_timer enabled
and then check the script against that before changing the timing logic
further.
There was a problem hiding this comment.
This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417, so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).
I want to make sure I understand what you meant here.
For example, if pings have already been failing long enough to trigger the
recovery action, and that recovery action itself takes some time to finish,
should the default path still count that recovery time as part of the same
outage window, or should it treat the post-action timestamp as the new
baseline even when reset_failure_timer is not enabled?
I can see both interpretations, so I just want to make sure I am reading the
intended behavior correctly.
Can you elaborate a bit please.
There was a problem hiding this comment.
I'd have to diagram to be sure what I think is best. I think if the recovery action completes after the network is back, that it should not restart again, but if the outage exceeds two triggers there should be two restarts, in the default case, and only one in with the new flag you are adding (IIUC).
It's a question of a) what caused the outage, and b) what is needed to recover from it.
For some interfaces (e.g. WireGuard or OpenVPN) if there is an internet outage, the interface will still not be working when the internet comes back, until the interface is restarted. Since we are pinging through the interface that is down, we cannot know the internet came back up so we need the restart to continue to be triggered throughout the outage, so that once the internet is back the interface comes back up. For a regular interface OTOH, this is overkill, hence adding the flag.
There was a problem hiding this comment.
Thanks, this help. I think I have a much beter feel now for what you mean.
I tried writing out the timing in a very simple example, mainly to check that
I am understanding the split between the default behavior and the new flag the
way you intended.
Say:
failure_period=60- pings are failing continuously
- the recovery action itself takes 15 seconds
Then the rough timeline is:
t=0 outage starts
t=60 failure period reached, recovery action starts
t=75 recovery action finishes
The way I am reading your comment, the default behavior should keep retrying
through a sustained outage.
So in the default case, if the outage is still ongoing, the time spent inside
the recovery action should still contribute enough to the same outage window
that multiple restarts can happen during a long outage.
That would look something like:
t=0 outage starts
t=60 restart #1
t=75 restart #1 finishes
t=120 restart #2
That makes sense to me for the WireGuard/OpenVPN type of case you described:
the upstream internet may be back, but the monitored path through the tunnel
is still down, so watchcat needs to keep retrying during the outage or the
tunnel may never get kicked back into service.
Then with reset_failure_timer=1, I read that as the more conservative mode:
once the recovery action finishes, a fresh failure window starts from there,
so the action duration no longer counts toward the next trigger.
That would look more like:
t=0 outage starts
t=60 restart #1
t=75 restart #1 finishes
t=135 restart #2 would be the earliest next retry
So if I am understanding you correctly:
- default mode should continue retrying through a sustained outage
reset_failure_timer=1should suppress that repeated retry behavior by
starting a fresh failure window after the action completes
The one part I still wanted to make sure I was reading correctly was this:
if the recovery action completes after the network is back, that it should
not restart again
My reading of that is:
- in the default case, retries should continue only while failed checks keep
happening - so if connectivity has actually recovered by the next check, there should be
no further restart - but if failed checks continue and the outage is long enough to cross another
trigger window, another restart is expected
Does this match what you had in your mind?
There was a problem hiding this comment.
Thank you for the work you have put in!
The timing looks like what I was thinking, but being concrete gives me more confidence in what I am saying.
My reading of that is:
in the default case, retries should continue only while failed checks keep happening
so if connectivity has actually recovered by the next check, there should be no further restart
but if failed checks continue and the outage is long enough to cross another
trigger window, another restart is expected
Does this match what you had in your mind?
Yes, that is what was thinking. Thank you again.
There was a problem hiding this comment.
I would suggest adding a TIMINGS.md in the directory with the package Makefile, mostly so you can point Copilot it at it in the PR description, so Copilot doesn't complain about the timing windows (since they are as expected). It also capture the information for future efforts.
There was a problem hiding this comment.
thank you for being considerate about adding a new option! Also on volunteering to be a maintainer!
i will add a TIMINGS.md in the watchcat package directory that documents the
default behavior and the reset_failure_timer=1 behavior with a concrete
example, so the intended timing windows are written down for future work and
for review context.
| # Restart timer cycle. | ||
| # Optionally start a fresh failure window after the recovery action | ||
| # finishes instead of continuing to count the original outage. | ||
| if [ "$reset_failure_timer" -eq 1 ]; then |
There was a problem hiding this comment.
I will do this in the next commit.
46dae48 to
d995f21
Compare
|
I’ve pushed a new commit. It has:
Please have a look when you get a chance. If there is anything else you would |
Add an opt-in reset_failure_timer option for restart_iface and run_script modes. When enabled, watchcat starts a fresh failure window after the recovery action finishes before allowing another recovery action. The existing behavior remains the default. Document the intended default and reset_failure_timer timing behavior in TIMINGS.md and use a safer string comparison for the reset_failure_timer check. Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>
d995f21 to
cf4347e
Compare
|
There was a typo and not required summary in last commit that I noticed today. I have fixed it. |
|
Sorry for the delay, will look at this on the weekend. |
|
@dhrm1k Got sidetracked with a rewrite of the NUT scripts for OpenWrt to satisfy Copilot and the new CI tests. I hope to come back to this by mid-week. |
📦 Package Details
Maintainer: Roger D rogerdammit@gmail.com (@roger- )
Description:
watchcatmonitors connectivity and can reboot the device, restart aninterface, or run a script after a configured failure period.
This PR fixes
restart_ifacetiming so the failure timer is reset afterthe recovery action completes, which avoids repeated restarts during a
sustained outage when the restart itself takes longer than the configured
period.
It also updates the related log messages so they clearly state that the
configured action will happen only after the failure period is reached.
Fixes: #29318
🧪 Run Testing Details
✅ Formalities
If your PR contains a patch:
git am