Fix healthcheck query to correctly monitor CoreOS nodes. #24

robertodauria · 2019-04-26T14:53:59Z

This is needed as criteria for monitoring CoreOS nodes are different.

Changes:

Check only service="ssh" (port 22)
"rate(ndt_test_total[15m]) > 0" is used instead of file creation rate

This change is

This is needed as criteria for monitoring CoreOS nodes are different. Changes: - Check only service="ssh" (port 22) - "rate(ndt_test_total[15m]) > 0" is used instead of file creation rate

stephen-soltesz

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob and @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

var (
	NodeQuery = `label_replace(sum_over_time(probe_success{service="ssh", module="ssh_v4_online"}[%[1]dm]) == 0,

Does this continue to work on the legacy platform nodes?

While we monitor port 22, I thought there ere iptable rules to block most port 22 traffic on the legacy platform.

robertodauria

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this continue to work on the legacy platform nodes?

While we monitor port 22, I thought there ere iptable rules to block most port 22 traffic on the legacy platform.

It definitely doesn't, because of the SSH port difference. Do we need to keep compatibility with legacy platform nodes?

stephen-soltesz

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

It definitely doesn't, because of the SSH port difference. Do we need to keep compatibility with legacy platform nodes?

We need something like rebot to operate for the legacy nodes until there is no more legacy platform. What is your thinking for how to achieve that?

Shared oob: it sounds like your thinking is to pin a version on eb that supports the legacy platform and develop rebot in parallel for the new platform exclusively.

What will happen when we begin upgrading some of the production nodes and have a hybrid platform? Please read these as clarifying questions -- you may have thought this all through already - i'm just unclear what the plan is from this PR. Is the plan written down anywhere?

robertodauria

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

We need something like rebot to operate for the legacy nodes until there is no more legacy platform. What is your thinking for how to achieve that?

Shared oob: it sounds like your thinking is to pin a version on eb that supports the legacy platform and develop rebot in parallel for the new platform exclusively.

What will happen when we begin upgrading some of the production nodes and have a hybrid platform? Please read these as clarifying questions -- you may have thought this all through already - i'm just unclear what the plan is from this PR. Is the plan written down anywhere?

Yes, the idea is to pin the version currently in HEAD so it can keep working with legacy nodes and running on EB.

Future versions - which will include this query change and will use the Reboot API - will be deployed on GKE, one instance per environment. They will also just consider CoreOS nodes.

This query is supposed to exclude nodes that are reachable over port 806 (i.e. legacy nodes) in the same way the previous one was excluding those reachable over port 22 (i.e. CoreOS). As you mentioned elsewhere, this isn't true as it incorrectly reports legacy platform nodes as offline.

I have a fix for that, which is re-introducing the unless on(machine) label_replace(sum_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) > 0, [...]. Before I commit it though, does the above sound good to you?

stephen-soltesz

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

Yes, the idea is to pin the version currently in HEAD so it can keep working with legacy nodes and running on EB.

Future versions - which will include this query change and will use the Reboot API - will be deployed on GKE, one instance per environment. They will also just consider CoreOS nodes.

This query is supposed to exclude nodes that are reachable over port 806 (i.e. legacy nodes) in the same way the previous one was excluding those reachable over port 22 (i.e. CoreOS). As you mentioned elsewhere, this isn't true as it incorrectly reports legacy platform nodes as offline.

I have a fix for that, which is re-introducing the unless on(machine) label_replace(sum_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) > 0, [...]. Before I commit it though, does the above sound good to you?

The idea is to have the legacy-platform rebot continue to run from eb and only consider legacy-platform nodes (it does this today). As we migrate more nodes to the new platform the legacy-platform rebot will consider fewer and fewer nodes.

At the same time, the k8s-platform rebot will be deployed to GKE and only consider k8s-platform nodes (with a modified query). As we migrate more nodes to the new platform the k8s-platform rebot will consider more and more nodes.

Is this all correct and do I understand correctly? If so, this plan sounds good to me.

It sounds like you have one more change to make to this PR?

robertodauria

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

The idea is to have the legacy-platform rebot continue to run from eb and only consider legacy-platform nodes (it does this today). As we migrate more nodes to the new platform the legacy-platform rebot will consider fewer and fewer nodes.

At the same time, the k8s-platform rebot will be deployed to GKE and only consider k8s-platform nodes (with a modified query). As we migrate more nodes to the new platform the k8s-platform rebot will consider more and more nodes.

Is this all correct and do I understand correctly? If so, this plan sounds good to me.

It sounds like you have one more change to make to this PR?

That is all correct.

However, after a bit more thinking, I now believe there is no reason why the same query can't monitor both. Please have a look at the latest revision. This query should effectively work for both legacy platform nodes and CoreOS nodes. This means:

We don't need to pin the version running on EB
We don't need an instance on EB anymore, it can just run on GKE. The instance in mlab-oti will monitor and reboot both kinds of nodes, as we start migrating production nodes.

stephen-soltesz

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

That is all correct.

However, after a bit more thinking, I now believe there is no reason why the same query can't monitor both. Please have a look at the latest revision. This query should effectively work for both legacy platform nodes and CoreOS nodes. This means:

We don't need to pin the version running on EB

We don't need an instance on EB anymore, it can just run on GKE. The instance in mlab-oti will monitor and reboot both kinds of nodes, as we start migrating production nodes.

One instance sounds much simpler. And, if we can't tell the difference between a legacy and k8s node when it's offline then having a single rebot instance to take action sounds better.

Also, I'm recalling that recent changes to how k8s nodes are entered into lame_duck_node. e.g. kube_node_spec_taint{cluster="platform-cluster", value="lame-duck"} != bool 1 or similar. Can this query take both styles into account? I got that example from the mlab-ns query.

robertodauria

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

One instance sounds much simpler. And, if we can't tell the difference between a legacy and k8s node when it's offline then having a single rebot instance to take action sounds better.

Also, I'm recalling that recent changes to how k8s nodes are entered into lame_duck_node. e.g. kube_node_spec_taint{cluster="platform-cluster", value="lame-duck"} != bool 1 or similar. Can this query take both styles into account? I got that example from the mlab-ns query.

Happy to keep this simpler and use a single query and a single version of Rebot.

About kube_node_spec_taint: which Prometheus instance holds that metric for k8s platform nodes?

My understanding is that the k8s platform's Prometheus holds the lame_duck status for the platform nodes. However, all the other metrics we use in this query are scraped by the Prometheus instance on GKE - e.g. blackbox-exporter metrics.

Are we going to move all the monitoring to the k8s platform? If so, I can add another unless on (machine) kube_node_spec_taint{...} here right now, which won't do anything until we begin using the k8s platform's Prometheus. Otherwise, do we need to query two different Prometheus instances to determine if a node must be rebooted?

I guess this would be a good topic for an in-person discussion. What do you think?

stephen-soltesz

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

Happy to keep this simpler and use a single query and a single version of Rebot.

About kube_node_spec_taint: which Prometheus instance holds that metric for k8s platform nodes?

My understanding is that the k8s platform's Prometheus holds the lame_duck status for the platform nodes. However, all the other metrics we use in this query are scraped by the Prometheus instance on GKE - e.g. blackbox-exporter metrics.

Are we going to move all the monitoring to the k8s platform? If so, I can add another unless on (machine) kube_node_spec_taint{...} here right now, which won't do anything until we begin using the k8s platform's Prometheus. Otherwise, do we need to query two different Prometheus instances to determine if a node must be rebooted?

I guess this would be a good topic for an in-person discussion. What do you think?

As discussed this morning, the kube_node_spec_taint is available from the prometheus federation instance via the federation scrape rules currently defined. This could add another minute max delay for now.

Would it help to talk more?

robertodauria

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)

healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

As discussed this morning, the kube_node_spec_taint is available from the prometheus federation instance via the federation scrape rules currently defined. This could add another minute max delay for now.

Would it help to talk more?

Thank you. I've changed the query to include kube_node_spec_taint{key="lame-duck"} and added a comment for NodeQuery.

stephen-soltesz

This query is complex enough that it may be worthwhile considering writing unit tests for the query using something like https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

I can't see a problem in the query but I've missed things before. Unit tests make the state space more explicit.

Reviewed 1 of 1 files at r3.
Reviewable status: complete! all files reviewed, all discussions resolved

robertodauria

I've opened an issue for myself to remember to write tests for this query. Thanks.

Reviewable status: complete! all files reviewed, all discussions resolved

robertodauria added 2 commits April 26, 2019 16:18

Fix healthcheck query to correctly monitor CoreOS nodes.

0c29d27

This is needed as criteria for monitoring CoreOS nodes are different. Changes: - Check only service="ssh" (port 22) - "rate(ndt_test_total[15m]) > 0" is used instead of file creation rate

Change ssh806 to ssh on the "unless" clause, too.

426dbcd

robertodauria requested a review from stephen-soltesz April 26, 2019 14:53

Change "rate" to "increase" in healthcheck query.

bff845a

stephen-soltesz requested changes Apr 26, 2019

View reviewed changes

robertodauria commented Apr 26, 2019

View reviewed changes

stephen-soltesz requested changes Apr 26, 2019

View reviewed changes

robertodauria commented Apr 26, 2019

View reviewed changes

stephen-soltesz requested changes Apr 26, 2019

View reviewed changes

robertodauria added 2 commits April 26, 2019 19:39

Fixed query to exclude legacy platform nodes.

156b3f5

Add inotify_extension_create_total criteria back.

724f635

robertodauria commented Apr 26, 2019

View reviewed changes

stephen-soltesz requested changes Apr 26, 2019

View reviewed changes

robertodauria commented Apr 29, 2019

View reviewed changes

stephen-soltesz requested changes Apr 29, 2019

View reviewed changes

Add new lame-duck metric format to NodeQuery.

687d3f0

robertodauria commented Apr 30, 2019

View reviewed changes

stephen-soltesz approved these changes Apr 30, 2019

View reviewed changes

robertodauria commented Apr 30, 2019

View reviewed changes

robertodauria merged commit 09f096d into master Apr 30, 2019

robertodauria deleted the sanbox-roberto-fix-query branch April 30, 2019 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix healthcheck query to correctly monitor CoreOS nodes. #24

Fix healthcheck query to correctly monitor CoreOS nodes. #24

robertodauria commented Apr 26, 2019 •

edited by stephen-soltesz

Loading

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

Fix healthcheck query to correctly monitor CoreOS nodes. #24

Fix healthcheck query to correctly monitor CoreOS nodes. #24

Conversation

robertodauria commented Apr 26, 2019 • edited by stephen-soltesz Loading

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

robertodauria commented Apr 26, 2019 •

edited by stephen-soltesz

Loading