Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change error handling when scraping metrics #551

Merged
merged 1 commit into from
Dec 6, 2021
Merged

Conversation

lkysow
Copy link
Member

@lkysow lkysow commented Jul 5, 2021

Fixes some issues with metrics merging. Namely that if there were any errors getting metrics from Envoy then we'd return a 200 with empty metrics instead of a 500 and that if the service returned an error, e.g. 400, we'd just stick the body of its response in our metrics output which would break Prometheus.

  • If Envoy returns an error then also respond with a 500 in our merged
    metrics response so that Prometheus will know that we had an error, not
    that there are no metrics.
  • If the service metrics return with a non-2xx status code then don't
    include the response body in the merged metrics. This will stop issues
    where users accidentally turn on metrics merging but they don't have an
    exporter and so their metrics endpoint returns 404. I could have
    responded with a 500 in this case in order to indicate that there is an
    error, however I think it's more likely that users are accidentally
    turning on metrics merging and the error indication is accomplished via
    a new metric (see below).
  • Append a new metric that indicates the success of the service
    scraping. This can be used for alerting by users since the response code
    of the service metrics response is discarded:
    • success: consul_metrics_merging_service_metrics_success 1
    • fail: consul_metrics_merging_service_metrics_success 0
  • modify logging to use key/value pairs
  • Fixes Metrics Merging should ignore non 2xx service metrics #546

How I've tested this PR:

  • manually and unit tests

How I expect reviewers to test this PR:

  • code

Checklist:

  • Tests added
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

@lkysow
Copy link
Member Author

lkysow commented Jul 5, 2021

@ndhanushkodi, interested in your thoughts on this. (see also exporter guidelines: https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes)

Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior looks quite nice to me. Just thinking out loud, could it make sense to have a metric similar to consul_metrics_merging_service_metrics_success for the envoy? But, I think it makes more sense what you're doing here so that the user sees the Prometheus scrape 500/404 from Envoy rather than a successful metrics scraping when something is definitely configured wrong.

I think all of the changes you've listed makes sense.

envoyMetricsAddr = "http://127.0.0.1:19000/stats/prometheus"
// prometheusServiceMetricsSuccessKey is the key of the prometheus metrics used to
// indicate if service metrics were scraped successfully.
prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success"
prometheusServiceMetricsSuccessKey = "consul_merged_service_metrics_success"

This was just the only idea I could come up with to shorten the name of this metric and still be specific, feel free to take it or not.

@lkysow
Copy link
Member Author

lkysow commented Nov 23, 2021

could it make sense to have a metric similar to consul_metrics_merging_service_metrics_success for the envoy

I think that that metric now comes out of us returning a 500/200 depending on what happened. I'm sure there's a way to track that in prometheus?

@lkysow lkysow marked this pull request as ready for review November 23, 2021 23:01
@lkysow lkysow requested review from a team, kschoche and ishustava and removed request for a team and kschoche November 23, 2021 23:01
Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@lkysow
Copy link
Member Author

lkysow commented Dec 2, 2021

Will rebase changelog after approval

Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for fixing this!!

* If Envoy returns an error then also respond with a 500 in our merged
metrics response so that Prometheus will know that we had an error, not
that there are no metrics.
* If the service metrics return with a non-2xx status code then don't
include the response body in the merged metrics. This will stop issues
where users accidentally turn on metrics merging but they don't have an
exporter and so their metrics endpoint returns 404. I could have
responded with a 500 in this case in order to indicate that there is an
error, however I think it's more likely that users are accidentally
turning on metrics merging and the error indication is accomplished via
a new metric (see below).
* Append a new metric that indicates the success of the service
scraping. This can be used for alerting by users since the response code
of the service metrics response is discarded:
  * success: consul_metrics_merging_service_metrics_success 1
  * fail: consul_metrics_merging_service_metrics_success 0
* modify logging to use key/value pairs
* Fixes #546
@lkysow lkysow merged commit d08821b into main Dec 6, 2021
@lkysow lkysow deleted the lkysow/metrics-merging branch December 6, 2021 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics Merging should ignore non 2xx service metrics
3 participants