Device State Monitoring #2694

Rosiak · 2016-01-05T15:38:40Z

This is a issue to discuss the implementation of a proper device state system.

Here are some of my notes so far:

The idea is to improve and possibly overhaul the current (almost non-existing) state monitoring.
There is quite a few people missing this feature.

We should have generic way to store & fetch these info.

We should have the ability to make custom value translations, since data presented through SNMP does not always make sense. For example @SaaldjorMike & I have some Dell iDrac devices, for their raid state they report back "1,2,3,4,5,6" which actually translates to "1=Other, 2=Unknown, 3=OK, 4=Non-critical, 5=Critical, 6=Non-recoverable". Devices could also just report back the actual state as a string.

I can't agree with myself wether we should make static severity levels "Information, Warning, Disaster etc." or we if we just should preserve and pass on these custom ones from the value translation part..

I think the UI is the easy part of this task.
Values needed to store, device_id, oid/mib entry, value, translation/mapping id etc.
Could the logic from Device Components. #2623 be used or?

Tagging #1365 #1236

paulgear · 2016-01-13T23:25:43Z

Good thoughts, and we need to discuss more.

paulgear · 2016-01-17T02:21:28Z

We probably should just merge this into #1236.

Rosiak · 2016-01-30T22:55:13Z

Possible returned states:

Dell iDrac Powersupply

other(1)
unknown(2)
ok(3)
nonCritical(4)
critical(5)
nonRecoverable(6)

Cisco 4500X

ok(1)
unavailable(2)
nonoperational(3)

Generic LibreNMS State:
Due to the fact that we cannot rely on the states returned from the devices themselves, I think we should map the returned state to a generic LibreNMS state. Those LibreNMS state values could be:

1 = unknown
2 = ok
3 = warning
4 = critical

The mapping could look like:

Dell iDrac Powersupply

other(1) = 1
unknown(2) = 1
ok(3) = 2
nonCritical(4) = 3
critical(5) = 4
nonRecoverable(6) = 4

Cisco 4500X

ok(1) = 2
unavailable(2) = 4
nonoperational(3) = 4

Or is this totally off?

paulgear · 2016-01-31T22:56:07Z

I like the OK, warning, critical, unknown set, as it matches exactly what Nagios checks use, which is a very commonly understood paradigm. In fact, making them use the same values as Nagios (0-3, if I recall correctly) would make a lot of sense to me.

Rosiak · 2016-02-09T19:35:20Z

@paulgear
Agree on that.

Proposed Table Layout:

state_id
int(11)

state_descr
varchar(255)

state_draw_graph
tinyint(1)

state_value
tinyint(1)

state_generic_value
tinyint(1)

state_lastupdated
timestamp

laf · 2016-02-09T19:48:31Z

I think trying to squeeze all states into just three types will be limiting. Can you not look that info up from the MIB to convert the state into human readable value?

paulgear · 2016-02-13T03:25:54Z

I think that if we're trying to provide a framework which allows intuitive colouring of components and alerting, we have to keep it generic and simple, and the Nagios 0-3 OK-Unknown levels are something that pretty much anyone who has worked with monitoring systems will understand. https://nagios-plugins.org/doc/guidelines.html#AEN78

Rosiak · 2016-02-13T11:06:29Z

Think I discussed this with @laf on IRC afterwards.

vpsman · 2016-02-13T20:43:25Z

I like the idea of mapping to nagios style statuses in the format "{librenms_state}: {extended_state}" e.g.

"CRITICAL: Cache module critical failure"
or
"WARNING: Rebuilding"

I'd like to discuss the relationship between the sensors table and this new table. I don't think we can use sensor_index or sensor_oid so we may need a junction table. Nothing wrong with that I guess that means we need to make the connection at discovery time thus we'll need to update everything in includes/discovery/states/*.inc.php. That looks to be manageable at this stage.

allenacevedo · 2017-03-22T19:22:30Z

Can we add some more Status checks to the Dell iDrac raid controller?
OMSA_Storage_Disk_1
vbvalue : 1
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.1

OMSA_Storage_Disk_2
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.2
vbvalue : 2

OMSA_Storage_Disk_3
oid : .1.3.6.1.4.1.674.10893.1.20.140.1.1.1.3
vbvalue : 3

It should respond with the same level of:
ok(3) = 2
nonCritical(4) = 3
critical(5) = 4
nonRecoverable(6) = 4

lock · 2018-05-17T06:42:10Z

This thread has been automatically locked since there has not been any recent activity after it was closed.

paulgear added Design Core labels Jan 13, 2016

Rosiak mentioned this issue Jan 17, 2016

Dell TL4000 tape library #2752

Closed

Rosiak mentioned this issue Feb 26, 2016

Proper State Monitoring #3102

Merged

laf closed this as completed Feb 27, 2016

lock bot locked as resolved and limited conversation to collaborators May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device State Monitoring #2694

Device State Monitoring #2694

Rosiak commented Jan 5, 2016

paulgear commented Jan 13, 2016

paulgear commented Jan 17, 2016

Rosiak commented Jan 30, 2016

paulgear commented Jan 31, 2016

Rosiak commented Feb 9, 2016

laf commented Feb 9, 2016

paulgear commented Feb 13, 2016

Rosiak commented Feb 13, 2016

vpsman commented Feb 13, 2016

allenacevedo commented Mar 22, 2017

lock bot commented May 17, 2018

Device State Monitoring #2694

Device State Monitoring #2694

Comments

Rosiak commented Jan 5, 2016

paulgear commented Jan 13, 2016

paulgear commented Jan 17, 2016

Rosiak commented Jan 30, 2016

paulgear commented Jan 31, 2016

Rosiak commented Feb 9, 2016

laf commented Feb 9, 2016

paulgear commented Feb 13, 2016

Rosiak commented Feb 13, 2016

vpsman commented Feb 13, 2016

allenacevedo commented Mar 22, 2017

lock bot commented May 17, 2018