Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nut-2.8.2 does not seem to honor DEADTIME #2454

Closed
avg-I opened this issue May 23, 2024 · 8 comments
Closed

nut-2.8.2 does not seem to honor DEADTIME #2454

avg-I opened this issue May 23, 2024 · 8 comments
Labels
question Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved
Milestone

Comments

@avg-I
Copy link
Contributor

avg-I commented May 23, 2024

I have a single UPS and several computers powered off it.
One is a master that runs upsd and upsmon, others are slaves that run just upsmon.
Here is a snippet from upsmon.conf on one of the slaves:

MONITOR labups@neo.home.arpa 1  xxx xxx slave
POLLFREQ 15
POLLFREQALERT 5
DEADTIME 120

Recently, during a blackout, I had a glitch where the network interface on that slave went down for about 4 seconds.
Its upsmon started powering off the machine in about 5 seconds, which was not ideal in the situation.

Here are some logs:

May 23 05:55:50 super kernel: lagg0: link state changed to DOWN
May 23 05:55:54 super kernel: lagg0: link state changed to UP
May 23 05:55:55 super upsmon[1078]: Poll UPS [labups@neo.home.arpa] failed - Server disconnected
May 23 05:55:55 super upsmon[1078]: Communications with UPS labups@neo.home.arpa lost
May 23 05:55:55 super upsmon[1078]: UPS [labups@neo.home.arpa] was last known to be not fully online and currently is not communicating, assuming dead
May 23 05:55:55 super upsmon[1078]: Executing automatic power-fail shutdown
May 23 05:55:55 super upsmon[1078]: Auto logout and shutdown proceeding

I expected that upsmon would wait for DEADTIME before doing that.

What additional information should I provide?

@jimklimov
Copy link
Member

jimklimov commented May 23, 2024

I suppose "link down/up" transitions broke the TCP session, so the upsmon client was forcefully disconnected from the upsd data server while in a critical state, and behaved by design.

@avg-I
Copy link
Contributor Author

avg-I commented May 23, 2024

So, any communication problem between upsd and upsmon while on battery, and upsmon is supposed to immediately start powering off?

Then, what DEADTIME is for?

# DEADTIME - Interval to wait before declaring a stale ups "dead"
# 
# upsmon requires a UPS to provide status information every few seconds
# (see POLLFREQ and POLLFREQALERT) to keep things updated.  If the status
# fetch fails, the UPS is marked stale.  If it stays stale for more than
# DEADTIME seconds, the UPS is marked dead.
# 
# A dead UPS that was last known to be on battery is assumed to have gone
# to a low battery condition.  This may force a shutdown if it is providing
# a critical amount of power to your system.

Is that applicable only to a local (serial or USB connected) UPS?
Is there any control like that for network communication?

@jimklimov
Copy link
Member

jimklimov commented May 24, 2024

The data server regularly updates the connected clients like upsmon with broadcasts about device information. For your corner case, "connected" is the critical word. Link flickered, IP address probably disappeared for a few seconds, TCP session got broken, server is assumed abruptly powered off (and/or its OS went down without waiting for clients to disconnect, so its upsd is off). And since the UPS was last known to be on battery, we haven't got much more time to reconnect or investigate either. To keep data safe, gotta run to stop services, flush filesystems ASAP.

This seems similar to the documented example with networking gear turning off because it is not on an UPS (or a weaker one) and that being among the reasons for emergency shutdown of a client. Here your lack of network just did not have some switch or router disappearing.

@jimklimov jimklimov added question Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved labels May 24, 2024
@avg-I
Copy link
Contributor Author

avg-I commented May 24, 2024

I see your point.

At the same time, the UPS was not really critical, it was on battery but not low battery.

It would be nice if users had some control over the behavior.
Immediate shutdown on any glitch is not suitable for all.
In some scenarios a UPS is used just to give enough time for an orderly shutdown.
But in other scenarios people want to keep services running for as long as possible (e.g., with regularly scheduled blackouts).

We give the master server DEADTIME to restore communications with a UPS device.
But we do not give slaves any time to restore communications with the master.
Seems like an omission.

@jimklimov
Copy link
Member

Fair point, at least for the non-critical OB state. Would you care to post a PR for the new toggle?

For a bit more context about the current/default behavior, note however that as an UPS or its batteries age, the original assumptions of what would comprise an actual critical state can become obsolete (part of why some devices offer calibration functionality). So based on invalid assumptions we can think there's a lot of juice in the battery, while in fact the UPS is a glorified power strip or close to that.

@avg-I
Copy link
Contributor Author

avg-I commented Jun 5, 2024

@jimklimov, I created #2462. Not sure if that matches your idea on how the issue should be resolved.
In my opinion, going back to the traditional behavior is the best solution.

jimklimov added a commit to jimklimov/nut that referenced this issue Jun 10, 2024
…he log warning message [networkupstools#2454]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 10, 2024
…ls#2462

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to avg-I/nut that referenced this issue Jun 10, 2024
…ls#2462

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 10, 2024
…he log warning message [networkupstools#2454]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov
Copy link
Member

Primary PR merged. Exploratory one (to return the log message) left out for now, per discussion. Maybe will come back to it for debug-only logging just in case, though.

jimklimov added a commit to jimklimov/nut that referenced this issue Jun 10, 2024
…he debug log message [networkupstools#2454]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@desertwitch
Copy link
Contributor

desertwitch commented Jun 10, 2024

Just reading up on this some more after being out and about for a while. I agree a criticality should not be triggered by a minor communication staleness even when OB, respecting DEADTIME regardless of linestate seems a wise decision here. The default value of 15 seconds should be short enough not to cause any major crises if the UPS is in fact dead, but also long enough not to bring down a server prematurely due to some minor, short-lived communication hiccups.

In any case, I'm always happy with user-configurability where it makes sense (and especially for shutdown criteria), and instant criticality might have indeed been a bit too strict here in retrospect (but with good intentions nonetheless).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved
Projects
None yet
Development

No branches or pull requests

3 participants