Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device State Monitoring #2694

Closed
Rosiak opened this issue Jan 5, 2016 · 11 comments
Closed

Device State Monitoring #2694

Rosiak opened this issue Jan 5, 2016 · 11 comments

Comments

@Rosiak
Copy link
Member

Rosiak commented Jan 5, 2016

This is a issue to discuss the implementation of a proper device state system.

Here are some of my notes so far:

The idea is to improve and possibly overhaul the current (almost non-existing) state monitoring.
There is quite a few people missing this feature.

We should have generic way to store & fetch these info.

We should have the ability to make custom value translations, since data presented through SNMP does not always make sense. For example @SaaldjorMike & I have some Dell iDrac devices, for their raid state they report back "1,2,3,4,5,6" which actually translates to "1=Other, 2=Unknown, 3=OK, 4=Non-critical, 5=Critical, 6=Non-recoverable". Devices could also just report back the actual state as a string.

I can't agree with myself wether we should make static severity levels "Information, Warning, Disaster etc." or we if we just should preserve and pass on these custom ones from the value translation part..

  • I think the UI is the easy part of this task.
  • Values needed to store, device_id, oid/mib entry, value, translation/mapping id etc.
  • Could the logic from Device Components. #2623 be used or?

Tagging #1365 #1236

@paulgear
Copy link
Member

Good thoughts, and we need to discuss more.

@paulgear
Copy link
Member

We probably should just merge this into #1236.

@Rosiak
Copy link
Member Author

Rosiak commented Jan 30, 2016

Possible returned states:

Dell iDrac Powersupply

  • other(1)
  • unknown(2)
  • ok(3)
  • nonCritical(4)
  • critical(5)
  • nonRecoverable(6)

Cisco 4500X

  • ok(1)
  • unavailable(2)
  • nonoperational(3)

Generic LibreNMS State:
Due to the fact that we cannot rely on the states returned from the devices themselves, I think we should map the returned state to a generic LibreNMS state. Those LibreNMS state values could be:

  • 1 = unknown
  • 2 = ok
  • 3 = warning
  • 4 = critical

The mapping could look like:

Dell iDrac Powersupply

  • other(1) = 1
  • unknown(2) = 1
  • ok(3) = 2
  • nonCritical(4) = 3
  • critical(5) = 4
  • nonRecoverable(6) = 4

Cisco 4500X

  • ok(1) = 2
  • unavailable(2) = 4
  • nonoperational(3) = 4

Or is this totally off?

@paulgear
Copy link
Member

I like the OK, warning, critical, unknown set, as it matches exactly what Nagios checks use, which is a very commonly understood paradigm. In fact, making them use the same values as Nagios (0-3, if I recall correctly) would make a lot of sense to me.

@Rosiak
Copy link
Member Author

Rosiak commented Feb 9, 2016

@paulgear
Agree on that.

Proposed Table Layout:

state_id
int(11)

state_descr
varchar(255)

state_draw_graph
tinyint(1)

state_value
tinyint(1)

state_generic_value
tinyint(1)

state_lastupdated
timestamp

@laf
Copy link
Member

laf commented Feb 9, 2016

I think trying to squeeze all states into just three types will be limiting. Can you not look that info up from the MIB to convert the state into human readable value?

@paulgear
Copy link
Member

I think that if we're trying to provide a framework which allows intuitive colouring of components and alerting, we have to keep it generic and simple, and the Nagios 0-3 OK-Unknown levels are something that pretty much anyone who has worked with monitoring systems will understand. https://nagios-plugins.org/doc/guidelines.html#AEN78

@Rosiak
Copy link
Member Author

Rosiak commented Feb 13, 2016

Think I discussed this with @laf on IRC afterwards.

@vpsman
Copy link

vpsman commented Feb 13, 2016

I like the idea of mapping to nagios style statuses in the format "{librenms_state}: {extended_state}" e.g.

"CRITICAL: Cache module critical failure"
or
"WARNING: Rebuilding"

I'd like to discuss the relationship between the sensors table and this new table. I don't think we can use sensor_index or sensor_oid so we may need a junction table. Nothing wrong with that I guess that means we need to make the connection at discovery time thus we'll need to update everything in includes/discovery/states/*.inc.php. That looks to be manageable at this stage.

@allenacevedo
Copy link

Can we add some more Status checks to the Dell iDrac raid controller?
OMSA_Storage_Disk_1
vbvalue : 1
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.1

OMSA_Storage_Disk_2
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.2
vbvalue : 2

OMSA_Storage_Disk_3
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.3
vbvalue : 3

It should respond with the same level of:
ok(3) = 2
nonCritical(4) = 3
critical(5) = 4
nonRecoverable(6) = 4

@lock
Copy link

lock bot commented May 17, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed.

@lock lock bot locked as resolved and limited conversation to collaborators May 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants