Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix healthcheck query to correctly monitor CoreOS nodes. #24

Merged
merged 6 commits into from
Apr 30, 2019

Conversation

robertodauria
Copy link
Contributor

@robertodauria robertodauria commented Apr 26, 2019

This is needed as criteria for monitoring CoreOS nodes are different.

Changes:

  • Check only service="ssh" (port 22)
  • "rate(ndt_test_total[15m]) > 0" is used instead of file creation rate

This change is Reviewable

This is needed as criteria for monitoring CoreOS nodes are different.

Changes:
- Check only service="ssh" (port 22)
- "rate(ndt_test_total[15m]) > 0" is used instead of file creation rate
Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob and @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

var (
	NodeQuery = `label_replace(sum_over_time(probe_success{service="ssh", module="ssh_v4_online"}[%[1]dm]) == 0,

Does this continue to work on the legacy platform nodes?

While we monitor port 22, I thought there ere iptable rules to block most port 22 traffic on the legacy platform.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this continue to work on the legacy platform nodes?

While we monitor port 22, I thought there ere iptable rules to block most port 22 traffic on the legacy platform.

It definitely doesn't, because of the SSH port difference. Do we need to keep compatibility with legacy platform nodes?

Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

It definitely doesn't, because of the SSH port difference. Do we need to keep compatibility with legacy platform nodes?

We need something like rebot to operate for the legacy nodes until there is no more legacy platform. What is your thinking for how to achieve that?

Shared oob: it sounds like your thinking is to pin a version on eb that supports the legacy platform and develop rebot in parallel for the new platform exclusively.

What will happen when we begin upgrading some of the production nodes and have a hybrid platform? Please read these as clarifying questions -- you may have thought this all through already - i'm just unclear what the plan is from this PR. Is the plan written down anywhere?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

We need something like rebot to operate for the legacy nodes until there is no more legacy platform. What is your thinking for how to achieve that?

Shared oob: it sounds like your thinking is to pin a version on eb that supports the legacy platform and develop rebot in parallel for the new platform exclusively.

What will happen when we begin upgrading some of the production nodes and have a hybrid platform? Please read these as clarifying questions -- you may have thought this all through already - i'm just unclear what the plan is from this PR. Is the plan written down anywhere?

Yes, the idea is to pin the version currently in HEAD so it can keep working with legacy nodes and running on EB.

Future versions - which will include this query change and will use the Reboot API - will be deployed on GKE, one instance per environment. They will also just consider CoreOS nodes.

This query is supposed to exclude nodes that are reachable over port 806 (i.e. legacy nodes) in the same way the previous one was excluding those reachable over port 22 (i.e. CoreOS). As you mentioned elsewhere, this isn't true as it incorrectly reports legacy platform nodes as offline.

I have a fix for that, which is re-introducing the unless on(machine) label_replace(sum_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) > 0, [...]. Before I commit it though, does the above sound good to you?

Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

Yes, the idea is to pin the version currently in HEAD so it can keep working with legacy nodes and running on EB.

Future versions - which will include this query change and will use the Reboot API - will be deployed on GKE, one instance per environment. They will also just consider CoreOS nodes.

This query is supposed to exclude nodes that are reachable over port 806 (i.e. legacy nodes) in the same way the previous one was excluding those reachable over port 22 (i.e. CoreOS). As you mentioned elsewhere, this isn't true as it incorrectly reports legacy platform nodes as offline.

I have a fix for that, which is re-introducing the unless on(machine) label_replace(sum_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) > 0, [...]. Before I commit it though, does the above sound good to you?

The idea is to have the legacy-platform rebot continue to run from eb and only consider legacy-platform nodes (it does this today). As we migrate more nodes to the new platform the legacy-platform rebot will consider fewer and fewer nodes.

At the same time, the k8s-platform rebot will be deployed to GKE and only consider k8s-platform nodes (with a modified query). As we migrate more nodes to the new platform the k8s-platform rebot will consider more and more nodes.

Is this all correct and do I understand correctly? If so, this plan sounds good to me.

It sounds like you have one more change to make to this PR?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

The idea is to have the legacy-platform rebot continue to run from eb and only consider legacy-platform nodes (it does this today). As we migrate more nodes to the new platform the legacy-platform rebot will consider fewer and fewer nodes.

At the same time, the k8s-platform rebot will be deployed to GKE and only consider k8s-platform nodes (with a modified query). As we migrate more nodes to the new platform the k8s-platform rebot will consider more and more nodes.

Is this all correct and do I understand correctly? If so, this plan sounds good to me.

It sounds like you have one more change to make to this PR?

That is all correct.

However, after a bit more thinking, I now believe there is no reason why the same query can't monitor both. Please have a look at the latest revision. This query should effectively work for both legacy platform nodes and CoreOS nodes. This means:

  1. We don't need to pin the version running on EB
  2. We don't need an instance on EB anymore, it can just run on GKE. The instance in mlab-oti will monitor and reboot both kinds of nodes, as we start migrating production nodes.

Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

That is all correct.

However, after a bit more thinking, I now believe there is no reason why the same query can't monitor both. Please have a look at the latest revision. This query should effectively work for both legacy platform nodes and CoreOS nodes. This means:

  1. We don't need to pin the version running on EB
  2. We don't need an instance on EB anymore, it can just run on GKE. The instance in mlab-oti will monitor and reboot both kinds of nodes, as we start migrating production nodes.

One instance sounds much simpler. And, if we can't tell the difference between a legacy and k8s node when it's offline then having a single rebot instance to take action sounds better.

Also, I'm recalling that recent changes to how k8s nodes are entered into lame_duck_node. e.g. kube_node_spec_taint{cluster="platform-cluster", value="lame-duck"} != bool 1 or similar. Can this query take both styles into account? I got that example from the mlab-ns query.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

One instance sounds much simpler. And, if we can't tell the difference between a legacy and k8s node when it's offline then having a single rebot instance to take action sounds better.

Also, I'm recalling that recent changes to how k8s nodes are entered into lame_duck_node. e.g. kube_node_spec_taint{cluster="platform-cluster", value="lame-duck"} != bool 1 or similar. Can this query take both styles into account? I got that example from the mlab-ns query.

Happy to keep this simpler and use a single query and a single version of Rebot.

About kube_node_spec_taint: which Prometheus instance holds that metric for k8s platform nodes?

My understanding is that the k8s platform's Prometheus holds the lame_duck status for the platform nodes. However, all the other metrics we use in this query are scraped by the Prometheus instance on GKE - e.g. blackbox-exporter metrics.

Are we going to move all the monitoring to the k8s platform? If so, I can add another unless on (machine) kube_node_spec_taint{...} here right now, which won't do anything until we begin using the k8s platform's Prometheus. Otherwise, do we need to query two different Prometheus instances to determine if a node must be rebooted?

I guess this would be a good topic for an in-person discussion. What do you think?

Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @evfirerob)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, evfirerob (Roberto D'Auria) wrote…

Happy to keep this simpler and use a single query and a single version of Rebot.

About kube_node_spec_taint: which Prometheus instance holds that metric for k8s platform nodes?

My understanding is that the k8s platform's Prometheus holds the lame_duck status for the platform nodes. However, all the other metrics we use in this query are scraped by the Prometheus instance on GKE - e.g. blackbox-exporter metrics.

Are we going to move all the monitoring to the k8s platform? If so, I can add another unless on (machine) kube_node_spec_taint{...} here right now, which won't do anything until we begin using the k8s platform's Prometheus. Otherwise, do we need to query two different Prometheus instances to determine if a node must be rebooted?

I guess this would be a good topic for an in-person discussion. What do you think?

As discussed this morning, the kube_node_spec_taint is available from the prometheus federation instance via the federation scrape rules currently defined. This could add another minute max delay for now.

Would it help to talk more?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @stephen-soltesz)


healthcheck/prometheus.go, line 15 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

As discussed this morning, the kube_node_spec_taint is available from the prometheus federation instance via the federation scrape rules currently defined. This could add another minute max delay for now.

Would it help to talk more?

Thank you. I've changed the query to include kube_node_spec_taint{key="lame-duck"} and added a comment for NodeQuery.

Copy link

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

This query is complex enough that it may be worthwhile considering writing unit tests for the query using something like https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

I can't see a problem in the query but I've missed things before. Unit tests make the state space more explicit.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened an issue for myself to remember to write tests for this query. Thanks.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@robertodauria robertodauria merged commit 09f096d into master Apr 30, 2019
@robertodauria robertodauria deleted the sanbox-roberto-fix-query branch April 30, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants