New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Node exporter monitoring mixin #941

Open
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@tomwilkie
Member

tomwilkie commented May 9, 2018

This is currently a work in progress.

Takes a bunch of the non-kubernetes-specific alerts and rules from the kubernetes-mixin and combines them with alerts from various places.

This is not supposed to be complete, just a starting point. For instructions on how to use it, best see the kubernetes-mixin README.

@tomwilkie tomwilkie changed the title from [WIPNode exporte to [WIP] Node exporter monitoring mixin May 9, 2018

Beginnings of a node-exporter monitoring mixin.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
@brian-brazil

While I'm behind the idea, this set of rules is inconsistent and does not follow our best practices in a few areas. I'd suggest a read of https://prometheus.io/docs/practices/rules/

||| % $._config,
'for': '1h',
labels: {
severity: 'warning',

This comment has been minimized.

@brian-brazil

brian-brazil May 9, 2018

Member

Our convention is page and ticket for severity

@brian-brazil

brian-brazil May 9, 2018

Member

Our convention is page and ticket for severity

This comment has been minimized.

@tomwilkie

tomwilkie May 9, 2018

Member

I'm not aware of this being elevated to the level of "our convention" - where is this documented?

@tomwilkie

tomwilkie May 9, 2018

Member

I'm not aware of this being elevated to the level of "our convention" - where is this documented?

This comment has been minimized.

@brian-brazil

brian-brazil May 9, 2018

Member

There was a long debate at some point as to what to use in our examples, which I can't find now. Page and ticket won out in the end.

@brian-brazil

brian-brazil May 9, 2018

Member

There was a long debate at some point as to what to use in our examples, which I can't find now. Page and ticket won out in the end.

This comment has been minimized.

@grobie

grobie Aug 9, 2018

Member

They didn't win, there was no agreement on such convention and I closed the ticket as it wasn't that important to me to change the one example where you used it.

@grobie

grobie Aug 9, 2018

Member

They didn't win, there was no agreement on such convention and I closed the ticket as it wasn't that important to me to change the one example where you used it.

Show outdated Hide outdated node-mixin/alerts/alerts.libsonnet
Show outdated Hide outdated node-mixin/dashboards/node.libsonnet
Show outdated Hide outdated node-mixin/dashboards/node.libsonnet
Show outdated Hide outdated node-mixin/dashboards/node.libsonnet
},
{
// Total memory per node
record: 'instance:node_memory_bytes_total:sum',

This comment has been minimized.

@brian-brazil

brian-brazil May 9, 2018

Member

The total here is confusing, this is not a counter.

It's generally not a good idea to have a recording rule that's just renaming a metric name.

@brian-brazil

brian-brazil May 9, 2018

Member

The total here is confusing, this is not a counter.

It's generally not a good idea to have a recording rule that's just renaming a metric name.

Show outdated Hide outdated node-mixin/rules/rules.libsonnet
Show outdated Hide outdated node-mixin/rules/rules.libsonnet
Show outdated Hide outdated node-mixin/rules/rules.libsonnet
||| % $._config,
},
{
record: 'instance:node_net_utilisation:sum_irate',

This comment has been minimized.

@brian-brazil

brian-brazil May 9, 2018

Member

This is just eth0, this metric is misleading.

@brian-brazil

brian-brazil May 9, 2018

Member

This is just eth0, this metric is misleading.

This comment has been minimized.

@tomwilkie

tomwilkie May 10, 2018

Member

I've made this cover ~"eth[0-9]+", trying to filter out the veth, docker, lo etc.

@tomwilkie

tomwilkie May 10, 2018

Member

I've made this cover ~"eth[0-9]+", trying to filter out the veth, docker, lo etc.

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Don't filter. The latest kernel naming goes beyond eth - and even before that things like bridges and wifi would not be covered.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Don't filter. The latest kernel naming goes beyond eth - and even before that things like bridges and wifi would not be covered.

tomwilkie added some commits May 10, 2018

Lower case binary operators and fix indentation.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Switch to irate[1m] for node dashboard.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Switch gauges to percentunit.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Fix up some of the USE metrics.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Remove k8s from dashboard title, make gauges use datasource variable.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish May 13, 2018

Member

Nice! Can you add a few words/link to the readme how to use them?

Member

discordianfish commented May 13, 2018

Nice! Can you add a few words/link to the readme how to use them?

@tomwilkie

This comment has been minimized.

Show comment
Hide comment
@tomwilkie

tomwilkie Jun 5, 2018

Member

Sure! Still a bit WIP, I need to come back and update lots of things for the latest release. Instructions will be mostly same as the kubernetes-mixin, so I'm thinking of centralising them to prevent problems keeping everything up to date.

Member

tomwilkie commented Jun 5, 2018

Sure! Still a bit WIP, I need to come back and update lots of things for the latest release. Instructions will be mostly same as the kubernetes-mixin, so I'm thinking of centralising them to prevent problems keeping everything up to date.

metalmatze and others added some commits Jul 13, 2018

Merge pull request #1 from metalmatze/mixin
Bug fixes and improvements on the current mixin
@@ -41,9 +41,9 @@
{
alert: 'NodeFilesystemOutOfSpace',
expr: |||
node_filesystem_avail{%(nodeExporterSelector)s,%(fsSelectors)s} / node_filesystem_size{%(nodeExporterSelector)s,%(fsSelectors)s} * 100 < 5

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

5% slave left is not out of space, and overlaps with the above alerts.

@brian-brazil

brian-brazil Aug 9, 2018

Member

5% slave left is not out of space, and overlaps with the above alerts.

@@ -41,9 +41,9 @@
{
alert: 'NodeFilesystemOutOfSpace',
expr: |||
node_filesystem_avail{%(nodeExporterSelector)s,%(fsSelectors)s} / node_filesystem_size{%(nodeExporterSelector)s,%(fsSelectors)s} * 100 < 5
node_filesystem_avail{%(nodeExporterSelector)s,%(fsSelectors)s} / node_filesystem_size{%(nodeExporterSelector)s,%(fsSelectors)s} * 100 < 5

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

these are the old metric names

@brian-brazil

brian-brazil Aug 9, 2018

Member

these are the old metric names

||| % $._config,
'for': '1h',
labels: {
severity: 'critical',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Is network contention really a page? Any user-facing problem will be caught by user-visible errors or latency This generally applies to all node exporter alerts, as the machine is not a user-facing service so any alerts will be for causes - not symptoms.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Is network contention really a page? Any user-facing problem will be caught by user-visible errors or latency This generally applies to all node exporter alerts, as the machine is not a user-facing service so any alerts will be for causes - not symptoms.

{
_config+:: {
// Selectors are inserted between {} in Prometheus queries.
nodeExporterSelector: 'job="node-exporter"',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

node is the convention, and hyphens should be avoided in job names.

You;re monitoring the node, it just happens to be via an exporter.

@brian-brazil

brian-brazil Aug 9, 2018

Member

node is the convention, and hyphens should be avoided in job names.

You;re monitoring the node, it just happens to be via an exporter.

This comment has been minimized.

@discordianfish

discordianfish Aug 14, 2018

Member

We need to parameterize this anyway, it's rarely "node" in my experience.

@discordianfish

discordianfish Aug 14, 2018

Member

We need to parameterize this anyway, it's rarely "node" in my experience.

This comment has been minimized.

@brian-brazil

brian-brazil Aug 14, 2018

Member

That's already the case, anyone can override this.

@brian-brazil

brian-brazil Aug 14, 2018

Member

That's already the case, anyone can override this.

nodeExporterSelector: 'job="node-exporter"',
// Mainly extracted because they are repetitive, but also useful to customize.
fsSelectors: 'fstype=~"ext.|xfs",mountpoint!="/var/lib/docker/aufs"',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

We filter docker in the node exporter by default, no need to mention it

This is over-excluding filesystems. More than ext4 and xfs are used.

@brian-brazil

brian-brazil Aug 9, 2018

Member

We filter docker in the node exporter by default, no need to mention it

This is over-excluding filesystems. More than ext4 and xfs are used.

||| % $._config,
},
{
record: 'instance:node_net_utilisation:sum_irate',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Don't filter. The latest kernel naming goes beyond eth - and even before that things like bridges and wifi would not be covered.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Don't filter. The latest kernel naming goes beyond eth - and even before that things like bridges and wifi would not be covered.

||| % $._config,
},
{
record: 'instance:node_net_utilisation:sum_irate',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Utilisation is a percentage, not a bytes/s. This is not a useful value, half-duplex network connections aren't a thing these days.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Utilisation is a percentage, not a bytes/s. This is not a useful value, half-duplex network connections aren't a thing these days.

||| % $._config,
},
{
record: 'instance:node_net_saturation:sum_irate',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

Metric name does not match source names

@brian-brazil

brian-brazil Aug 9, 2018

Member

Metric name does not match source names

// Disk saturation (ms spent, by rate() it's bound by 1 second)
record: 'instance:node_disk_saturation:sum_irate',
expr: |||
sum by (instance) (

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

These rules are losing all target labels, prefer without to by.

@brian-brazil

brian-brazil Aug 9, 2018

Member

These rules are losing all target labels, prefer without to by.

},
{
// CPU utilisation is % CPU is not idle.
record: 'instance:node_cpu_utilisation:avg1m',

This comment has been minimized.

@brian-brazil

brian-brazil Aug 9, 2018

Member

This is not an avg1m, it's an avg_rate1m - but even then it isn't as it's excluding one label. Try to avoid subtraction which takes you away from the source metric names.

@brian-brazil

brian-brazil Aug 9, 2018

Member

This is not an avg1m, it's an avg_rate1m - but even then it isn't as it's excluding one label. Try to avoid subtraction which takes you away from the source metric names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment