Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove guessed limits for some health sensors, documentation for sensor classes #10327

Merged
merged 6 commits into from Jun 21, 2019

Conversation

Projects
None yet
3 participants
@martijn-schmidt
Copy link
Contributor

commented Jun 8, 2019

Original behaviour

The file includes/discovery/functions.inc.php contains sensor_limit_low() and sensor_limit() which both attempt to guess a sane value for sensors when no explicitly defined low_limit or high_limit can be found during discovery.

Both switch control structures used in those two functions have empty case statements which means that if one of those matches, it's going to fall through and run the code for each subsequent case until a break is reached. See also: https://softwareengineering.stackexchange.com/a/162582

For example, when we call function sensor_low_limit(dbm, -13.036) it will return the value -12.3842 instead of null. That is because there will be a match at case 'dbm': which falls through all the way to case 'cooling':, where it performs $limit = -13.036 * 0.95 before hitting a break.

Changed behaviour

I want to remove power_consumed and count guessed low_limit and high_limit, I personally added those sensor classes in PR #9471 when I didn't understand that a switch control structure has fall-through behaviour so I can guarantee that guessing limits for those is a mistake on my behalf. It should not be there, only power_factor can have guessed limits. Apologies for the issue, I'm still a beginning programmer. 馃槂

Furthermore, I want to remove guessed high_limit values for current and power because these are supposed to draw higher values as more devices or components are installed on for example a PDU or a chassis.

Finally, I want to remove guessed low_limit and high_limit for dbm sensors, there is a much too large variance in power budget on commercially available optical transceivers for there to be a sensible window where you can guess these values. admiralspark on Discord is reporting the same problem, like myself he is running some DWDM optics over a larger distance where variances in signal strength are common.

Refactoring

I have removed all cases that would have returned $limit = null and replaced them with a default that performs a return null directly. This shortens the function significantly which makes codeclimate's 25-lines-per-function check happy, lowers the risk of unintentional changes in the future, and I have also left a comment to warn other beginning programmers like myself about unintentionally hitting the fall-through behaviour.

Because there is no documentation on how to add a sensor class at all right now, I've gone ahead and written some documentation to make it easier to find the availability of these guessed functions. It also saves the next guy/gal that wants to implement a sensor class some time grepping through the code, so that's positive thing as well.

I didn't reference adding the unit in get_unit_for_sensor_class() from includes/html/functions.inc.php in the new documentation since that functionality is being moved to Laravel localization files like resources/lang/$lang/sensors.php in one of my other pull requests - PR #10287.

Bugfix

There are a lot of cases in the latest test data where the "group": null, has changed to "group: "", for sensors that were discovered with yaml as a result of a change made in PR #9667:

murrant@8e80121#diff-67aaf7dd786395ce4cbd2c3996cf1bdcR1035

The original yaml discovery code had "group": null as a fallback, and the discovery function discover_sensor() still has $group = null as default value if nothing else is specified, so I think it might be a bug. I have updated the test data for groups along with the bugfix that makes sure yaml discovery group processing reverts to null if nothing truthy was returned.

Running a quick grep -ir "\"group\": \"\"," | uniq | wc -l in the tests/data/ directory shows that around 61 test files are affected for what's currently in the master branch.

It looks like this also updates rounding for some sensor limits and current values.

Cleaning up past mistakes?

Since both functions are only called during discovery OR when the existing limit is null, there are going to be a lot of stale/bogus low_limit and high_limit values in existing LibreNMS deployments. Do you have any thoughts on how to perform a one-time clean-up of the limits for all the affected sensor classes that weren't manually overridden by the user? Maybe a database migration could do the trick? As long as the clean-up isn't repeated those sensors which can pull sensible low_limit and high_limit values from SNMP will re-learn them during the next discovery run after the LibreNMS update.

DO NOT DELETE THIS TEXT

Please note

Please read this information carefully. You can run ./scripts/pre-commit.php to check your code before submitting.

Testers

If you would like to test this pull request then please run: ./scripts/github-apply <pr_id>, i.e ./scripts/github-apply 10327
After you are done testing, you can remove the changes with ./scripts/github-remove. If there are schema changes, you can ask on discord how to revert.

@martijn-schmidt martijn-schmidt marked this pull request as ready for review Jun 8, 2019

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 8, 2019

It looks like this PR is causing a lot of unit tests to fail due to the fact that we are no longer erroneously making up low_limit and high_limit values for certain sensor classes. I'll run ./scripts/save-test-data.php -m sensors and commit some updated json test data.

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 8, 2019

There are a lot of cases in the latest test data where the "group": null, has changed to "group: "", for sensors that were discovered with yaml as a result of a change made in PR #9667:

murrant@8e80121#diff-67aaf7dd786395ce4cbd2c3996cf1bdcR1035

@murrant @PipoCanaja do you know if that is intentional and whether we should update the test data for groups along with the discovery function discover_sensor() which still has $group = null as default value if nothing else is specified?

Or is it better to update the function discovery_process() so yaml discovery group processing reverts to null if nothing truthy was returned? Could be as simple as appending ?: null to the following line:

https://github.com/librenms/librenms/blob/master/includes/discovery/functions.inc.php#L1078

Running a quick grep -ir "\"group\": \"\"," | uniq | wc -l in the tests/data/ directory shows that around 60 test files are affected for what's currently in the master branch.

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 9, 2019

I've been thinking about this some more.. in some cases the switch fall-through behaviour may have been used intentionally, for example current and power are directly related to each other. Of course the power draw (and therefore the current) can change during normal operation so that shouldn't be a guessable limit, but there may be others where this is a valid consideration.

The dbm sensor class also certainly shouldn't have guessable limits, optical transceivers can have very wide acceptable transmit/receive windows (for example, one of our 10G-ZR DWDM optics may according to its specifications transmit between 0 and 5 dBm and receive between -23 dBm and -6 dBm). And with rx/tx dbm signals sometimes hovering around zero, the 0.95 calculation may result in a very small guessed window which causes constant alerting.

Having added the power_consumed, power_factor, and count sensor classes myself I know for sure that only power_factor can have a sensible guessed limit, the other two are mistakes on my behalf due to not knowing about the switch fall-through behaviour. We also know the empty cases at the end of the switch that don't hit a case which contains code are not used, those would've returned null anyway.

But I'm not sure whether the empty cases with fall-through were intentional for airflow, snr, frequency, pressure, which all use the code from the case for cooling. Looking at the git history they were added by @laf throughout 2017 as part of commits 72a0d82 d48be5f 592e8ac, so it would be nice to have some feedback on what'd be the best thing to do with those - leave alone or remove? 馃槂

martijn-schmidt added a commit to martijn-schmidt/librenms that referenced this pull request Jun 12, 2019

Partially revert 4d1fdb2 for some sensors
As discussed in PR librenms#10327, some sensors classes were legitimately
using switch fall-through behaviour. Revert 4d1fdb2 for those sensors.

Keep current, power, power_consumed, and dbm removed on purpose.

martijn-schmidt added a commit to martijn-schmidt/librenms that referenced this pull request Jun 12, 2019

Partially revert 4d1fdb2 for some sensors
As discussed in PR librenms#10327, some sensors classes were legitimately
using switch fall-through behaviour. Revert 4d1fdb2 for those sensors.

Keep current, power, power_consumed, and dbm removed on purpose.

@martijn-schmidt martijn-schmidt force-pushed the martijn-schmidt:switch-fallthrough-fix branch from c955aa4 to 0baeb07 Jun 12, 2019

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

Starting with a clean batch of changes so I can undo some of the mistakes I made in the first attempt. :)

@martijn-schmidt martijn-schmidt changed the title Guessed limits for health sensors: fix switch fall-through bug, documentation for adding sensor classes WIP - Changes to guessed limits for health sensors, documentation for adding sensor classes Jun 12, 2019

martijn-schmidt added some commits Jun 12, 2019

Changes to guessed limit functions for sensors.
Original behaviour
===================

The file `includes/discovery/functions.inc.php` contains
`sensor_limit_low()` and `sensor_limit()` which both attempt to
guess a sane value for sensors when no explicitly defined
low_limit or high_limit can be found during discovery.

Both switch control structures used in those two functions
have empty case statements which means that if one of those matches,
it's going to fall through and run the code for each subsequent
case until a `break` is reached.

For example, when we call `function sensor_low_limit(dbm, -13.036)`
it will return the value `-12.3842` instead of `null`. That is
because there will be a match at `case 'dbm':` which falls through
all the way to `case 'cooling':`, where it performs
`$limit = -13.036 * 0.95` before hitting a `break`.

Changed behaviour
===================

Removed `power_consumed` and `count` guessed low_limit and
high_limit, I personally added those sensor classes in PR #9471
when I didn't understand that a switch control structure has
fall-through behaviour so I can guarantee that guessing limits for
those is a mistake on my behalf. It should not be there, only
power_factor can have guessed limits. Apologies for the issue,
I'm still a beginning programmer!

Furthermore, I removed guessed high_limit values for `current`
and `power` because these are supposed to draw higher values as
more devices or components are installed on for example a PDU or
a chassis.

Finally, I removed guessed low_limit and high_limit for `dbm`
sensors, there is a much too large variance in power budget on
commercially available optical transceivers for there to be a
sensible window where you can guess these values.

@martijn-schmidt martijn-schmidt changed the title WIP - Changes to guessed limits for health sensors, documentation for adding sensor classes Changes to guessed limits for health sensors, documentation for sensor classes Jun 12, 2019

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

Should be a lot better now - after starting with a clean branch I applied the yaml group discovery fallback to null instead of "" patch and collected test data so we can see the effects of that in a separate commit, and then I applied the "changes to guessed limits" patch plus collected test data again to make sure the Travis build passes.

Also updated the PR description with the key differences of before vs after, as pointed out on Discord it was not clear enough.

Finally, would love some feedback on how to approach resetting the incorrectly guessed limits on existing LibreNMS deployments without messing up manually configured limits (even if we one-off clear limits that are pulled from SNMP as bycatch, those ought to be picked back up immediately with the next discovery run).

@murrant

This comment has been minimized.

Copy link
Member

commented Jun 16, 2019

If you set the fields to null in the DB, I think the auto-guessing will run again.

@murrant
Copy link
Member

left a comment

This seems ok. I wish we could do a better job of guessing, but I agree in some situations guessing wrong is worse.

@martijn-schmidt

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2019

If you set the fields to null in the DB, I think the auto-guessing will run again.

Yes, that's also how I interpreted the code - though of course some guesses will no longer happen so the limit will stay null which is good. However what I mean is, how do we clean up the limits for sensor classes that were incorrectly guessed? This seems to come up as a support question every now and then, it would be ideal if we could preempt the questions by taking care of the "past mistakes" by one-time updating the database upon loading this code for the first time.

@murrant

This comment has been minimized.

Copy link
Member

commented Jun 16, 2019

You can't, there is no way to differentiate from guessed limits and user set/adjusted limits.

@murrant murrant changed the title Changes to guessed limits for health sensors, documentation for sensor classes Remove guessed limits for some health sensors, documentation for sensor classes Jun 21, 2019

@murrant murrant merged commit 048fd0f into librenms:master Jun 21, 2019

6 checks passed

Inspection Summary
Details
Node: analysis
Details
Travis CI - Pull Request Build Passed
Details
WIP Ready for review
Details
codeclimate All good!
Details
license/cla Contributor License Agreement is signed.
Details

@martijn-schmidt martijn-schmidt deleted the martijn-schmidt:switch-fallthrough-fix branch Jun 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.