Skip to content

watchcat: clarify restart logs and add optional failure timer reset#29326

Open
dhrm1k wants to merge 2 commits into
openwrt:masterfrom
dhrm1k:watchcat-restart-loop
Open

watchcat: clarify restart logs and add optional failure timer reset#29326
dhrm1k wants to merge 2 commits into
openwrt:masterfrom
dhrm1k:watchcat-restart-loop

Conversation

@dhrm1k
Copy link
Copy Markdown
Contributor

@dhrm1k dhrm1k commented May 7, 2026

📦 Package Details

Maintainer: Roger D rogerdammit@gmail.com (@roger- )

Description:
watchcat monitors connectivity and can reboot the device, restart an
interface, or run a script after a configured failure period.

This PR fixes restart_iface timing so the failure timer is reset after
the recovery action completes, which avoids repeated restarts during a
sustained outage when the restart itself takes longer than the configured
period.

It also updates the related log messages so they clearly state that the
configured action will happen only after the failure period is reached.

Fixes: #29318


🧪 Run Testing Details

  • OpenWrt Version: 24.10.4
  • OpenWrt Target/Subtarget: x86/64
  • OpenWrt Device: VM

✅ Formalities

  • I have reviewed the CONTRIBUTING.md file for detailed contributing guidelines.

If your PR contains a patch:

  • It can be applied using git am
  • It has been refreshed to avoid offsets, fuzzes, etc., using
    make package/<your-package>/refresh V=s

@GeorgeSapkin GeorgeSapkin force-pushed the watchcat-restart-loop branch from 30398ef to 457f29e Compare May 7, 2026 15:51
@BKPepe BKPepe requested a review from Copilot May 8, 2026 05:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts watchcat network monitoring behavior so the failure timer is reset after recovery actions complete, reducing repeated restarts during sustained outages, and clarifies related log messaging.

Changes:

  • Update log messages to clarify that actions occur only after the configured failure period is reached.
  • Reset the failure timer based on the time when the recovery action finishes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread utils/watchcat/files/watchcat.sh Outdated
@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 8, 2026

Fixed it, thanks.

I replaced that cat /proc/uptime usage with a direct read from
/proc/uptime and folded it into the branch.

@danielfdickinson
Copy link
Copy Markdown
Contributor

@dhrm1k I am not in favour of this change. I think restarting the interface repeatedly is unavoidable if we want to avoid failing to restart once the outage is over. I believe it would defeat the purpose of watchcat in the event of an outage.

@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 9, 2026

I thought mainly around the timing behavior after the restart
action itself completes. If the restart took longer than
the configured period, the next failed check could immediately trigger
another restart without really giving the interface a fresh failure window.

But I understand your concern: if the repeated restart behavior is treated
as intentional, then changing that timing could make watchcat less
responsive once connectivity comes back.

In that case, I am happy to drop this change and keep the log wording
cleanup, since that part still seems useful independently.

@danielfdickinson
Copy link
Copy Markdown
Contributor

@dhrm1k Sounds good. Thank you.

@dhrm1k dhrm1k force-pushed the watchcat-restart-loop branch from 2320ed8 to 01d7742 Compare May 9, 2026 20:21
@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 9, 2026

I have just kept the log changes now.

@danielfdickinson
Copy link
Copy Markdown
Contributor

LGTM. Can you update PR description (and close the associated issue, with a reference to the discussion here so others know why)?

@danielfdickinson
Copy link
Copy Markdown
Contributor

@dhrm1k I've been thinking about this. I think there a couple of difference scenarios here:

  1. Where we must do a restart after an outage because some service will have been interrupted and need the interface restart to 'kick' it back into action (e.g. a VPN)
  2. Where we only want to restart if the outage exceeds the reboot/restart period.

This perhaps should be a configuration option, where you can either insist on a reboot/restart after an outage, or where you only want a restart/reboot if the outage persists.

It sounds like you have a 2. scenario, whereas I am used to a 1. scenario.

Would you be up for adding such an option?

@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 11, 2026

Yes, I’d be up for that.

I think you’re right that these are really two different use cases.

One is where you always want the recovery action once the outage window has
been reached, because something behind the interface may need that restart to
recover properly.

The other is where you only want to take the recovery action if the outage is
still happening at that point.

My use case was definitely closer to the second one.

I can look at adding this as a configuration option so the current behavior
stays available by default, while also allowing the more conditional behavior
for setups where repeated or delayed restarts are less desirable.

@dhrm1k dhrm1k changed the title watchcat: reset timer after restart action ~watchcat: reset timer after restart action~ watchcat: clarify restart logs and add optional failure timer reset May 11, 2026
@dhrm1k dhrm1k changed the title ~watchcat: reset timer after restart action~ watchcat: clarify restart logs and add optional failure timer reset watchcat: clarify restart logs and add optional failure timer reset May 11, 2026
@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 11, 2026

I’ve added the configuration option version on top of the existing branch.

The current behavior remains the default. The new
reset_failure_timer option applies to restart_iface and run_script
modes and starts a fresh failure window after the recovery action finishes.

I also tested the behavior on OpenWrt 24.10.4 x86/64 to confirm the
default path still retriggers immediately after a long recovery action, while
the opt-in path waits for a fresh failure window.

Copy link
Copy Markdown
Contributor

@danielfdickinson danielfdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks really good. I've made some comments/queries inline. Thank you.

Comment thread utils/watchcat/files/watchcat.config Outdated
option pinghosts '8.8.8.8'
option forcedelay '30'
# For restart_iface and run_script, start a fresh failure window after
# each recovery action finishes before allowing another one.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweak: instead of 'another one' could you say 'another restart'?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i did that now.

procd_set_param command /usr/bin/watchcat.sh \
"restart_iface" "$period" "$pinghosts" "$pingperiod" \
"$pingsize" "$interface" "$mmifacename" "$unlockbands" \
"$addressfamily" "" "$reset_failure_timer"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Appreciate you breaking up that command line. and the one below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread utils/watchcat/files/watchcat.sh Outdated
mm_iface_unlock_bands="$7"
address_family="$8"
script="$9"
reset_failure_timer="${10}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "${10}" work in ash? Might it be better use shift and use "$9" ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, "${10}" work in ash, still i have switched the handling to use shift plus named
variables rather than relying on ${10} / ${11} in the mode dispatch now.

Comment thread utils/watchcat/files/watchcat.sh Outdated
# args from init script: period pinghosts pingperiod pingsize interface
# mmifacename unlockbands addressfamily script reset_failure_timer
watchcat_monitor_network "$2" "$3" "$4" "$5" "$6" "$7" "$8" \
"$9" "${10}" "${11}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above with "${10}" and "${11}". Might be better to use shift. Also (this is is an enhancement, not a blocker) named variables that capture the incoming parameters and use those variables here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it no longer relies on ${10} / ${11} there now

@danielfdickinson
Copy link
Copy Markdown
Contributor

Could you also bump PKG_RELEASE in the Makefile?

@dhrm1k dhrm1k force-pushed the watchcat-restart-loop branch from c4b6df2 to 75d124e Compare May 12, 2026 02:03
Clarify the restart_iface logging so the message reflects that the
configured action happens only after the failure period is reached.

Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>
@dhrm1k dhrm1k force-pushed the watchcat-restart-loop branch from 75d124e to 46dae48 Compare May 12, 2026 02:20
@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 12, 2026

I have pushed a new commit with requested changes. I have also bumped the PKG_RELEASE in Makefile. (Also repushed a older commit because that was failing the CI tests because i didn't sign that commit.)

@dhrm1k dhrm1k requested a review from danielfdickinson May 12, 2026 03:07
Copy link
Copy Markdown
Contributor

@danielfdickinson danielfdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@danielfdickinson
Copy link
Copy Markdown
Contributor

@BKPepe Any chance of a Copilot on this?

@BKPepe BKPepe requested a review from Copilot May 12, 2026 04:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +182 to +186
# Optionally start a fresh failure window after the recovery action
# finishes instead of continuing to count the original outage.
if [ "$reset_failure_timer" -eq 1 ]; then
time_now="$(cat /proc/uptime)"
time_now="${time_now%%.*}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417 , so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

I agree this probably needs a more explicit look at the timing model.

I’ll map out the expected behavior for the different cases first:

  • current/default behavior
  • reset_failure_timer disabled
  • reset_failure_timer enabled

and then check the script against that before changing the timing logic
further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds plausible. Having the best of both worlds would be good. I've noticed some timing is not as I would expect in my PR for other changes: #29417, so this is worth a more complete analysis (e.g. diagramming and documenting the expected timing and behaviour and making the the script delivers).

I want to make sure I understand what you meant here.

For example, if pings have already been failing long enough to trigger the
recovery action, and that recovery action itself takes some time to finish,
should the default path still count that recovery time as part of the same
outage window, or should it treat the post-action timestamp as the new
baseline even when reset_failure_timer is not enabled?

I can see both interpretations, so I just want to make sure I am reading the
intended behavior correctly.

Can you elaborate a bit please.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to diagram to be sure what I think is best. I think if the recovery action completes after the network is back, that it should not restart again, but if the outage exceeds two triggers there should be two restarts, in the default case, and only one in with the new flag you are adding (IIUC).
It's a question of a) what caused the outage, and b) what is needed to recover from it.

For some interfaces (e.g. WireGuard or OpenVPN) if there is an internet outage, the interface will still not be working when the internet comes back, until the interface is restarted. Since we are pinging through the interface that is down, we cannot know the internet came back up so we need the restart to continue to be triggered throughout the outage, so that once the internet is back the interface comes back up. For a regular interface OTOH, this is overkill, hence adding the flag.

Copy link
Copy Markdown
Contributor Author

@dhrm1k dhrm1k May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this help. I think I have a much beter feel now for what you mean.

I tried writing out the timing in a very simple example, mainly to check that
I am understanding the split between the default behavior and the new flag the
way you intended.

Say:

  • failure_period=60
  • pings are failing continuously
  • the recovery action itself takes 15 seconds

Then the rough timeline is:

t=0    outage starts
t=60   failure period reached, recovery action starts
t=75   recovery action finishes

The way I am reading your comment, the default behavior should keep retrying
through a sustained outage.

So in the default case, if the outage is still ongoing, the time spent inside
the recovery action should still contribute enough to the same outage window
that multiple restarts can happen during a long outage.

That would look something like:

t=0    outage starts
t=60   restart #1
t=75   restart #1 finishes
t=120  restart #2

That makes sense to me for the WireGuard/OpenVPN type of case you described:
the upstream internet may be back, but the monitored path through the tunnel
is still down, so watchcat needs to keep retrying during the outage or the
tunnel may never get kicked back into service.

Then with reset_failure_timer=1, I read that as the more conservative mode:
once the recovery action finishes, a fresh failure window starts from there,
so the action duration no longer counts toward the next trigger.

That would look more like:

t=0    outage starts
t=60   restart #1
t=75   restart #1 finishes
t=135  restart #2 would be the earliest next retry

So if I am understanding you correctly:

  • default mode should continue retrying through a sustained outage
  • reset_failure_timer=1 should suppress that repeated retry behavior by
    starting a fresh failure window after the action completes

The one part I still wanted to make sure I was reading correctly was this:

if the recovery action completes after the network is back, that it should
not restart again

My reading of that is:

  • in the default case, retries should continue only while failed checks keep
    happening
  • so if connectivity has actually recovered by the next check, there should be
    no further restart
  • but if failed checks continue and the outage is long enough to cross another
    trigger window, another restart is expected

Does this match what you had in your mind?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work you have put in!

The timing looks like what I was thinking, but being concrete gives me more confidence in what I am saying.

My reading of that is:

in the default case, retries should continue only while failed checks keep happening
so if connectivity has actually recovered by the next check, there should be no further restart
but if failed checks continue and the outage is long enough to cross another
trigger window, another restart is expected

Does this match what you had in your mind?

Yes, that is what was thinking. Thank you again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a TIMINGS.md in the directory with the package Makefile, mostly so you can point Copilot it at it in the PR description, so Copilot doesn't complain about the timing windows (since they are as expected). It also capture the information for future efforts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for being considerate about adding a new option! Also on volunteering to be a maintainer!

i will add a TIMINGS.md in the watchcat package directory that documents the
default behavior and the reset_failure_timer=1 behavior with a concrete
example, so the intended timing windows are written down for future work and
for review context.

Comment thread utils/watchcat/files/watchcat.sh Outdated
# Restart timer cycle.
# Optionally start a fresh failure window after the recovery action
# finishes instead of continuing to count the original outage.
if [ "$reset_failure_timer" -eq 1 ]; then
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in the next commit.

@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 13, 2026

I’ve pushed a new commit.

It has:

  • the safer reset_failure_timer comparison in watchcat.sh without -eq.
  • a TIMINGS.md in the watchcat package directory documenting the intended
    default behavior and the reset_failure_timer=1 behavior

Please have a look when you get a chance. If there is anything else you would
like adjusted, feel free to point it ou.

Add an opt-in reset_failure_timer option for restart_iface and
run_script modes.

When enabled, watchcat starts a fresh failure window after the
recovery action finishes before allowing another recovery action.
The existing behavior remains the default.

Document the intended default and reset_failure_timer timing
behavior in TIMINGS.md and use a safer string comparison for the
reset_failure_timer check.

Signed-off-by: Dharmik Parmar <dharmikparmar2004@yahoo.com>
@dhrm1k dhrm1k force-pushed the watchcat-restart-loop branch from d995f21 to cf4347e Compare May 15, 2026 03:33
@dhrm1k
Copy link
Copy Markdown
Contributor Author

dhrm1k commented May 15, 2026

There was a typo and not required summary in last commit that I noticed today. I have fixed it.

@danielfdickinson
Copy link
Copy Markdown
Contributor

Sorry for the delay, will look at this on the weekend.

@danielfdickinson
Copy link
Copy Markdown
Contributor

@dhrm1k Got sidetracked with a rewrite of the NUT scripts for OpenWrt to satisfy Copilot and the new CI tests. I hope to come back to this by mid-week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

watchcat: restart_iface can repeatedly restart an interface during a outage

3 participants