watchcat: clarify restart logs and add optional failure timer reset by dhrm1k · Pull Request #29326 · openwrt/packages

dhrm1k · 2026-05-07T03:05:59Z

📦 Package Details

Maintainer: Roger D rogerdammit@gmail.com (@roger- )

Description:
watchcat monitors connectivity and can reboot the device, restart an
interface, or run a script after a configured failure period.

This PR fixes restart_iface timing so the failure timer is reset after
the recovery action completes, which avoids repeated restarts during a
sustained outage when the restart itself takes longer than the configured
period.

It also updates the related log messages so they clearly state that the
configured action will happen only after the failure period is reached.

Fixes: #29318

🧪 Run Testing Details

OpenWrt Version: 24.10.4
OpenWrt Target/Subtarget: x86/64
OpenWrt Device: VM

✅ Formalities

I have reviewed the CONTRIBUTING.md file for detailed contributing guidelines.

If your PR contains a patch:

It can be applied using git am
It has been refreshed to avoid offsets, fuzzes, etc., using
```
make package/<your-package>/refresh V=s
```

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts watchcat network monitoring behavior so the failure timer is reset after recovery actions complete, reducing repeated restarts during sustained outages, and clarifies related log messaging.

Changes:

Update log messages to clarify that actions occur only after the configured failure period is reached.
Reset the failure timer based on the time when the recovery action finishes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dhrm1k · 2026-05-08T19:51:56Z

Fixed it, thanks.

I replaced that cat /proc/uptime usage with a direct read from
/proc/uptime and folded it into the branch.

danielfdickinson · 2026-05-09T14:58:08Z

@dhrm1k I am not in favour of this change. I think restarting the interface repeatedly is unavoidable if we want to avoid failing to restart once the outage is over. I believe it would defeat the purpose of watchcat in the event of an outage.

dhrm1k · 2026-05-09T15:17:49Z

I thought mainly around the timing behavior after the restart
action itself completes. If the restart took longer than
the configured period, the next failed check could immediately trigger
another restart without really giving the interface a fresh failure window.

But I understand your concern: if the repeated restart behavior is treated
as intentional, then changing that timing could make watchcat less
responsive once connectivity comes back.

In that case, I am happy to drop this change and keep the log wording
cleanup, since that part still seems useful independently.

danielfdickinson · 2026-05-09T15:19:27Z

@dhrm1k Sounds good. Thank you.

dhrm1k · 2026-05-09T20:22:25Z

I have just kept the log changes now.

danielfdickinson · 2026-05-09T21:23:48Z

LGTM. Can you update PR description (and close the associated issue, with a reference to the discussion here so others know why)?

danielfdickinson · 2026-05-10T21:31:42Z

@dhrm1k I've been thinking about this. I think there a couple of difference scenarios here:

Where we must do a restart after an outage because some service will have been interrupted and need the interface restart to 'kick' it back into action (e.g. a VPN)
Where we only want to restart if the outage exceeds the reboot/restart period.

This perhaps should be a configuration option, where you can either insist on a reboot/restart after an outage, or where you only want a restart/reboot if the outage persists.

It sounds like you have a 2. scenario, whereas I am used to a 1. scenario.

Would you be up for adding such an option?

dhrm1k · 2026-05-11T02:54:08Z

Yes, I’d be up for that.

I think you’re right that these are really two different use cases.

One is where you always want the recovery action once the outage window has
been reached, because something behind the interface may need that restart to
recover properly.

The other is where you only want to take the recovery action if the outage is
still happening at that point.

My use case was definitely closer to the second one.

I can look at adding this as a configuration option so the current behavior
stays available by default, while also allowing the more conditional behavior
for setups where repeated or delayed restarts are less desirable.

dhrm1k · 2026-05-11T18:19:06Z

I’ve added the configuration option version on top of the existing branch.

The current behavior remains the default. The new
reset_failure_timer option applies to restart_iface and run_script
modes and starts a fresh failure window after the recovery action finishes.

I also tested the behavior on OpenWrt 24.10.4 x86/64 to confirm the
default path still retriggers immediately after a long recovery action, while
the opt-in path waits for a fresh failure window.

danielfdickinson

Generally looks really good. I've made some comments/queries inline. Thank you.

danielfdickinson · 2026-05-11T21:21:42Z

 	option pinghosts '8.8.8.8'
 	option forcedelay '30'
+	# For restart_iface and run_script, start a fresh failure window after
+	# each recovery action finishes before allowing another one.


Minor tweak: instead of 'another one' could you say 'another restart'?

yes, i did that now.

danielfdickinson · 2026-05-11T21:22:53Z

+		procd_set_param command /usr/bin/watchcat.sh \
+			"restart_iface" "$period" "$pinghosts" "$pingperiod" \
+			"$pingsize" "$interface" "$mmifacename" "$unlockbands" \
+			"$addressfamily" "" "$reset_failure_timer"


Thanks. Appreciate you breaking up that command line. and the one below.

danielfdickinson · 2026-05-11T21:23:58Z

 	mm_iface_unlock_bands="$7"
 	address_family="$8"
 	script="$9"
+	reset_failure_timer="${10}"


Does "${10}" work in ash? Might it be better use shift and use "$9" ?

yes, "${10}" work in ash, still i have switched the handling to use shift plus named
variables rather than relying on ${10} / ${11} in the mode dispatch now.

danielfdickinson · 2026-05-11T21:26:50Z

+	# args from init script: period pinghosts pingperiod pingsize interface
+	# mmifacename unlockbands addressfamily script reset_failure_timer
+	watchcat_monitor_network "$2" "$3" "$4" "$5" "$6" "$7" "$8" \
+		"$9" "${10}" "${11}"


As above with "${10}" and "${11}". Might be better to use shift. Also (this is is an enhancement, not a blocker) named variables that capture the incoming parameters and use those variables here.

it no longer relies on ${10} / ${11} there now

danielfdickinson · 2026-05-11T22:01:29Z

Could you also bump PKG_RELEASE in the Makefile?

Clarify the restart_iface logging so the message reflects that the configured action happens only after the failure period is reached. Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>

dhrm1k · 2026-05-12T03:05:06Z

I have pushed a new commit with requested changes. I have also bumped the PKG_RELEASE in Makefile. (Also repushed a older commit because that was failing the CI tests because i didn't sign that commit.)

danielfdickinson

LGTM. Thank you!

danielfdickinson · 2026-05-12T03:24:15Z

@BKPepe Any chance of a Copilot on this?

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

danielfdickinson · 2026-05-12T05:42:07Z

+			# Optionally start a fresh failure window after the recovery action
+			# finishes instead of continuing to count the original outage.
+			if [ "$reset_failure_timer" -eq 1 ]; then
+				time_now="$(cat /proc/uptime)"
+				time_now="${time_now%%.*}"


This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417 , so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).

That makes sense.

I agree this probably needs a more explicit look at the timing model.

I’ll map out the expected behavior for the different cases first:

current/default behavior

reset_failure_timer disabled

reset_failure_timer enabled

and then check the script against that before changing the timing logic
further.

This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417, so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).

I want to make sure I understand what you meant here.

For example, if pings have already been failing long enough to trigger the
recovery action, and that recovery action itself takes some time to finish,
should the default path still count that recovery time as part of the same
outage window, or should it treat the post-action timestamp as the new
baseline even when reset_failure_timer is not enabled?

I can see both interpretations, so I just want to make sure I am reading the
intended behavior correctly.

Can you elaborate a bit please.

I'd have to diagram to be sure what I think is best. I think if the recovery action completes after the network is back, that it should not restart again, but if the outage exceeds two triggers there should be two restarts, in the default case, and only one in with the new flag you are adding (IIUC).
It's a question of a) what caused the outage, and b) what is needed to recover from it.

For some interfaces (e.g. WireGuard or OpenVPN) if there is an internet outage, the interface will still not be working when the internet comes back, until the interface is restarted. Since we are pinging through the interface that is down, we cannot know the internet came back up so we need the restart to continue to be triggered throughout the outage, so that once the internet is back the interface comes back up. For a regular interface OTOH, this is overkill, hence adding the flag.

Thanks, this help. I think I have a much beter feel now for what you mean.

I tried writing out the timing in a very simple example, mainly to check that
I am understanding the split between the default behavior and the new flag the
way you intended.

Say:

failure_period=60

pings are failing continuously

the recovery action itself takes 15 seconds

Then the rough timeline is:

t=0 outage starts t=60 failure period reached, recovery action starts t=75 recovery action finishes

The way I am reading your comment, the default behavior should keep retrying
through a sustained outage.

So in the default case, if the outage is still ongoing, the time spent inside
the recovery action should still contribute enough to the same outage window
that multiple restarts can happen during a long outage.

That would look something like:

t=0 outage starts t=60 restart #1 t=75 restart #1 finishes t=120 restart #2

That makes sense to me for the WireGuard/OpenVPN type of case you described:
the upstream internet may be back, but the monitored path through the tunnel
is still down, so watchcat needs to keep retrying during the outage or the
tunnel may never get kicked back into service.

Then with reset_failure_timer=1, I read that as the more conservative mode:
once the recovery action finishes, a fresh failure window starts from there,
so the action duration no longer counts toward the next trigger.

That would look more like:

t=0 outage starts t=60 restart #1 t=75 restart #1 finishes t=135 restart #2 would be the earliest next retry

So if I am understanding you correctly:

default mode should continue retrying through a sustained outage

reset_failure_timer=1 should suppress that repeated retry behavior by
starting a fresh failure window after the action completes

The one part I still wanted to make sure I was reading correctly was this:

if the recovery action completes after the network is back, that it should
not restart again

My reading of that is:

in the default case, retries should continue only while failed checks keep
happening

so if connectivity has actually recovered by the next check, there should be
no further restart

but if failed checks continue and the outage is long enough to cross another
trigger window, another restart is expected

Does this match what you had in your mind?

Thank you for the work you have put in!

The timing looks like what I was thinking, but being concrete gives me more confidence in what I am saying.

My reading of that is:

in the default case, retries should continue only while failed checks keep happening
so if connectivity has actually recovered by the next check, there should be no further restart
but if failed checks continue and the outage is long enough to cross another
trigger window, another restart is expected

Does this match what you had in your mind?

Yes, that is what was thinking. Thank you again.

I would suggest adding a TIMINGS.md in the directory with the package Makefile, mostly so you can point Copilot it at it in the PR description, so Copilot doesn't complain about the timing windows (since they are as expected). It also capture the information for future efforts.

thank you for being considerate about adding a new option! Also on volunteering to be a maintainer!

i will add a TIMINGS.md in the watchcat package directory that documents the
default behavior and the reset_failure_timer=1 behavior with a concrete
example, so the intended timing windows are written down for future work and
for review context.

dhrm1k · 2026-05-12T15:35:56Z

-			# Restart timer cycle.
+			# Optionally start a fresh failure window after the recovery action
+			# finishes instead of continuing to count the original outage.
+			if [ "$reset_failure_timer" -eq 1 ]; then


I will do this in the next commit.

dhrm1k · 2026-05-13T15:50:29Z

I’ve pushed a new commit.

It has:

the safer reset_failure_timer comparison in watchcat.sh without -eq.
a TIMINGS.md in the watchcat package directory documenting the intended
default behavior and the reset_failure_timer=1 behavior

Please have a look when you get a chance. If there is anything else you would
like adjusted, feel free to point it ou.

Add an opt-in reset_failure_timer option for restart_iface and run_script modes. When enabled, watchcat starts a fresh failure window after the recovery action finishes before allowing another recovery action. The existing behavior remains the default. Document the intended default and reset_failure_timer timing behavior in TIMINGS.md and use a safer string comparison for the reset_failure_timer check. Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>

dhrm1k · 2026-05-15T03:34:30Z

There was a typo and not required summary in last commit that I noticed today. I have fixed it.

danielfdickinson · 2026-05-15T07:16:27Z

Sorry for the delay, will look at this on the weekend.

danielfdickinson · 2026-05-24T13:57:38Z

@dhrm1k Got sidetracked with a rewrite of the NUT scripts for OpenWrt to satisfy Copilot and the new CI tests. I hope to come back to this by mid-week.

GeorgeSapkin force-pushed the watchcat-restart-loop branch from 30398ef to 457f29e Compare May 7, 2026 15:51

BKPepe requested a review from Copilot May 8, 2026 05:24

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread utils/watchcat/files/watchcat.sh Outdated

Copilot started reviewing on behalf of BKPepe May 8, 2026 05:33 View session

dhrm1k force-pushed the watchcat-restart-loop branch from 457f29e to 2320ed8 Compare May 8, 2026 19:50

dhrm1k force-pushed the watchcat-restart-loop branch from 2320ed8 to 01d7742 Compare May 9, 2026 20:21

dhrm1k changed the title ~~watchcat: reset timer after restart action~~ ~watchcat: reset timer after restart action~ watchcat: clarify restart logs and add optional failure timer reset May 11, 2026

dhrm1k changed the title ~~~watchcat: reset timer after restart action~ watchcat: clarify restart logs and add optional failure timer reset~~ watchcat: clarify restart logs and add optional failure timer reset May 11, 2026

danielfdickinson suggested changes May 11, 2026

View reviewed changes

dhrm1k force-pushed the watchcat-restart-loop branch from c4b6df2 to 75d124e Compare May 12, 2026 02:03

watchcat: clarify restart log wording

2033b30

Clarify the restart_iface logging so the message reflects that the configured action happens only after the failure period is reached. Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>

dhrm1k force-pushed the watchcat-restart-loop branch from 75d124e to 46dae48 Compare May 12, 2026 02:20

dhrm1k requested a review from danielfdickinson May 12, 2026 03:07

danielfdickinson approved these changes May 12, 2026

View reviewed changes

BKPepe requested a review from Copilot May 12, 2026 04:45

Copilot AI reviewed May 12, 2026

View reviewed changes

Copilot started reviewing on behalf of BKPepe May 12, 2026 14:08 View session

dhrm1k force-pushed the watchcat-restart-loop branch from 46dae48 to d995f21 Compare May 13, 2026 15:45

dhrm1k force-pushed the watchcat-restart-loop branch from d995f21 to cf4347e Compare May 15, 2026 03:33

Conversation

dhrm1k commented May 7, 2026

📦 Package Details

🧪 Run Testing Details

✅ Formalities

If your PR contains a patch:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

dhrm1k commented May 8, 2026

Uh oh!

danielfdickinson commented May 9, 2026

Uh oh!

dhrm1k commented May 9, 2026

Uh oh!

danielfdickinson commented May 9, 2026

Uh oh!

dhrm1k commented May 9, 2026

Uh oh!

danielfdickinson commented May 9, 2026

Uh oh!

danielfdickinson commented May 10, 2026

Uh oh!

dhrm1k commented May 11, 2026

Uh oh!

dhrm1k commented May 11, 2026

Uh oh!

danielfdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielfdickinson commented May 11, 2026

Uh oh!

dhrm1k commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

danielfdickinson commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhrm1k May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dhrm1k commented May 12, 2026 •

edited

Loading

dhrm1k May 13, 2026 •

edited

Loading