Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Node exporter monitoring mixin #941

Closed
wants to merge 13 commits into from
Closed

Conversation

@tomwilkie
Copy link
Member

@tomwilkie tomwilkie commented May 9, 2018

This is currently a work in progress.

Takes a bunch of the non-kubernetes-specific alerts and rules from the kubernetes-mixin and combines them with alerts from various places.

This is not supposed to be complete, just a starting point. For instructions on how to use it, best see the kubernetes-mixin README.

@tomwilkie tomwilkie changed the title [WIPNode exporte [WIP] Node exporter monitoring mixin May 9, 2018
@tomwilkie tomwilkie force-pushed the tomwilkie:mixin branch from 5c033a4 to 3df7b10 May 9, 2018
Copy link
Member

@brian-brazil brian-brazil left a comment

While I'm behind the idea, this set of rules is inconsistent and does not follow our best practices in a few areas. I'd suggest a read of https://prometheus.io/docs/practices/rules/

node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
@discordianfish
Copy link
Member

@discordianfish discordianfish commented May 13, 2018

Nice! Can you add a few words/link to the readme how to use them?

@tomwilkie
Copy link
Member Author

@tomwilkie tomwilkie commented Jun 5, 2018

Sure! Still a bit WIP, I need to come back and update lots of things for the latest release. Instructions will be mostly same as the kubernetes-mixin, so I'm thinking of centralising them to prevent problems keeping everything up to date.

node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved
node-mixin/config.libsonnet Show resolved Hide resolved
node-mixin/config.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
@metalmatze
Copy link
Contributor

@metalmatze metalmatze commented Oct 12, 2018

I'd be happy to continue with this PR, if @tomwilkie allows and it somehow is possible for me to push changes. ☺️

@discordianfish
Copy link
Member

@discordianfish discordianfish commented Nov 5, 2018

I'd really like to see this revived too! @tomwilkie What is the state of this? Can @metalmatze help with this?

tomwilkie and others added 11 commits May 8, 2018
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
@metalmatze metalmatze force-pushed the tomwilkie:mixin branch from ce0f388 to 53e4093 Nov 19, 2018
@metalmatze
Copy link
Contributor

@metalmatze metalmatze commented Nov 19, 2018

I've updated the rules and alerts for v0.16. Additionally I've rebased my commits to sign them off.

metalmatze added 2 commits Nov 19, 2018
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
@beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 1, 2019

@metalmatze what's your state on this? As I have recently worked on (prometheus/prometheus#4474 , I would give this PR a similar treatment (respond to open comments, general grooming, add README, Makefile, etc.). I just would like to avoid the two of us working on it concurrently. So please let me know if it is OK if I take the lock here. And if you have any work on this not yet pushed, now would be a good time to do so.
Thanks!

@metalmatze
Copy link
Contributor

@metalmatze metalmatze commented Jul 1, 2019

Other than the few commits that I've pushed, I did not put more time since then into it. Feel free to go ahead! 😊

@beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 2, 2019

As you can see from the issues referencing this above, this PR is getting quite a bit of popularity.
My plan is to first clean up formal concerns in here and then take comments about more alerts to add and how to improve alerts herein.

It might also be more workable to merge this in a not-blatantly-wrong-but-still-WIP state and then let several other people create PRs affecting subsets of the alerts/dashboards.

Copy link
Member

@brian-brazil brian-brazil left a comment

Just a few small things I missed last time.

// Full utilisation would be all disks on each node spending an average of
// 1 sec per second doing I/O, normalize by node count for stacked charts
g.queryPanel(|||
instance:node_disk_utilisation:sum_irate / scalar(sum(up{%(nodeExporterSelector)s}))

This comment has been minimized.

@brian-brazil

brian-brazil Jul 2, 2019
Member

sum of irate is just :irate

This comment has been minimized.

@beorn7

beorn7 Jul 16, 2019
Member

Hmm, it's still the sum over all devices, right? Let's discuss in my follow-up PR. This rule name has changed completely anyway.

)
.addPanel(
g.panel('CPU Saturation (Load1)') +
g.queryPanel('instance:node_cpu_saturation_load1:{instance="$instance"}', 'Saturation') +

This comment has been minimized.

@brian-brazil

brian-brazil Jul 2, 2019
Member

Missing an operation, and load is not a useful saturation indication.

This comment has been minimized.

@beorn7

beorn7 Jul 16, 2019
Member

I have changed the name already, and I've added a TODO to the dashboard where this is used, questioning if load average dashboards are useful at all.

@brian-brazil
Copy link
Member

@brian-brazil brian-brazil commented Jul 2, 2019

Yeah, let's fix the various problems this already has before expanding further.

@beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 2, 2019

To clarify: I'm working on the formal and fundamental problems right now. Will drop a note here when done, and then we can decide if we want to refine things in this PR or better in later, separate PRs.

@metalmatze
Copy link
Contributor

@metalmatze metalmatze commented Jul 2, 2019

Sounds all good to me! Thanks for picking up this PR! 🎉

@beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 16, 2019

FYI: Since I cannot update Tom's branch, I'll create a new PR based on this one, with all the changes I have made included. Stay tuned…

@beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 16, 2019

New PR with all comments either completely addressed or at least acknowledged with a TODO is here: #1429 . I'll close this .

@beorn7 beorn7 closed this Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.