Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receiver/prometheus: add "up" metric for instances #2918

Closed

Conversation

odeke-em
Copy link
Member

@odeke-em odeke-em commented Apr 13, 2021

Make a receiver specific view that'll be registered
and used to record the "up" status either "0.0" or "1.0"
when an instance can't be scraped from or can be, respectively.

This ensures that the collector can act as a passthrough
for statuses and it currently outputs:

# HELP up Whether the endpoint is alive or not
# TYPE up gauge
up{instance="0.0.0.0:8888"} 1
up{instance="localhost:9999"} 0

I did not take the approach of plainly sending up suffixed metric names.
to recommend instead using relabelling inside the exporter itself like:

- source_labels: [__name__]
    regex: "(.+)_up"
    target_label: "__name__"
    replacement: "up"

because:

  • it'd apply ConstLabels on every *_up metric, only want "instance=$INSTANCE"
  • other exporters wouldn't be able to use the "up" metric as is if we
    inject rewrites

Regardless of if we used a label rewrite, the end result would be the
following:

up{instance="localhost:8888",job="otlc"}
up{exported_instance="0.0.0.0:9999",instance="localhost:8888",job="otlc"}
up{exported_instance="0.0.0.0:1234",instance="localhost:8888",job="otlc"}

which this change accomplishes without having to inject any label
rewrites, but just by the new imports and upgrade of the prometheus
exporter.

Fixes open-telemetry/wg-prometheus#8
Requires census-ecosystem/opencensus-go-exporter-prometheus#24

@codecov
Copy link

codecov bot commented Apr 13, 2021

Codecov Report

Merging #2918 (0b9bc14) into main (73b55f0) will decrease coverage by 0.37%.
The diff coverage is 100.00%.

❗ Current head 0b9bc14 differs from pull request most recent head 4462b2a. Consider uploading reports for the commit 4462b2a to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2918      +/-   ##
==========================================
- Coverage   92.06%   91.69%   -0.38%     
==========================================
  Files         313      313              
  Lines       15439    15362      -77     
==========================================
- Hits        14214    14086     -128     
- Misses        817      870      +53     
+ Partials      408      406       -2     
Impacted Files Coverage Δ
receiver/prometheusreceiver/metrics_receiver.go 69.69% <ø> (ø)
receiver/prometheusreceiver/internal/metrics.go 100.00% <100.00%> (ø)
...iver/prometheusreceiver/internal/metricsbuilder.go 100.00% <100.00%> (ø)
service/telemetry.go 85.45% <100.00%> (+0.26%) ⬆️
service/zpages.go 0.00% <0.00%> (-80.00%) ⬇️
internal/version/version.go 80.00% <0.00%> (-20.00%) ⬇️
extension/zpagesextension/zpagesextension.go 78.57% <0.00%> (-14.29%) ⬇️
internal/collector/telemetry/telemetry.go 94.44% <0.00%> (-5.56%) ⬇️
translator/trace/protospan_translation.go 92.46% <0.00%> (-4.91%) ⬇️
exporter/prometheusremotewriteexporter/exporter.go 89.33% <0.00%> (-1.21%) ⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update faa78aa...4462b2a. Read the comment docs.

@odeke-em odeke-em force-pushed the receiver-prometheus-add-up-metric branch 3 times, most recently from 0c2524f to 0cc35c5 Compare April 13, 2021 07:09
@odeke-em odeke-em marked this pull request as ready for review April 13, 2021 07:09
@odeke-em odeke-em requested a review from a team as a code owner April 13, 2021 07:09
@odeke-em odeke-em force-pushed the receiver-prometheus-add-up-metric branch from 0cc35c5 to 1ffc5e3 Compare April 13, 2021 07:13
@odeke-em
Copy link
Member Author

We shall now have
Screen Shot 2021-04-13 at 12 14 17 AM

Kindly cc-ing @Aneurysm9 @rakyll @alolita @bogdandrutu @brian-brazil

@odeke-em odeke-em force-pushed the receiver-prometheus-add-up-metric branch from 1ffc5e3 to 0b9bc14 Compare April 16, 2021 23:35
@rakyll
Copy link
Contributor

rakyll commented Apr 20, 2021

Nice! Maybe you can update the description saying it's fixing open-telemetry/wg-prometheus#41.


var tagInstance, _ = tag.NewKey("instance")

var statUpStatus = stats.Int64("up", "Whether the endpoint is alive or not", stats.UnitDimensionless)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the metrics returned by the prometheus receiver itself rather than a self-observability metric? I agree it could be implemented as either a self-obs metric or as an additional metric from the receiver, but there are a few advantages:

  1. Users probably expect to see an 'up' metric when they make a pipeline with a prometheus receiver, and don't expect to have to do additional things.
  2. I probably want to apply the same set of transformations to the 'up' metric that I want to apply to the rest of my prometheus metrics, since they should have the same resource labels (e.g. instance, job).
  3. It will make it easier for us to pass prom compliance, since we won't need to route self-obs metrics to the PRW exporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a self-obs metric, will it still satisfy the prometheus compliance tests?

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to use the metrics package from opencensus instead of stats for this, if we only have one label and no other labels coming from tags or plan to change the encoding:
See https://github.com/census-instrumentation/opencensus-go/blob/master/metric/gauge.go

Make a receiver specific view that'll be registered
and used to record the "up" status either "0.0" or "1.0"
when an instance can't be scraped from or can be, respectively.

This ensures that the collector can act as a passthrough
for statuses and it currently outputs:

    # HELP up Whether the endpoint is alive or not
    # TYPE up gauge
    up{instance="0.0.0.0:8888"} 1
    up{instance="localhost:9999"} 0

I did not take the approach of plainly sending up suffixed metric names.
to recommend instead using relabelling inside the exporter itself like:

    - source_labels: [__name__]
        regex: "(.+)_up"
        target_label: "__name__"
        replacement: "up"

because:
* it'd apply ConstLabels on every *_up metric, only want "instance=$INSTANCE"
* other exporters wouldn't be able to use the "up" metric as is if we
inject rewrites

Regardless of if we used a label rewrite, the end result would be the
following:

    up{instance="localhost:8888",job="otlc"}
    up{exported_instance="0.0.0.0:9999",instance="localhost:8888",job="otlc"}
    up{exported_instance="0.0.0.0:1234",instance="localhost:8888",job="otlc"}

which this change accomplishes without having to inject any label
rewrites, but just by the new imports and upgrade of the prometheus
exporter.

Fixes open-telemetry/wg-prometheus#8
Requires census-ecosystem/opencensus-go-exporter-prometheus#24
@bogdandrutu
Copy link
Member

@rakyll @odeke-em I am confused about this (I missed this initially). Should the "up" metric be emitted via the same pipeline as the other collected metrics or is a "collector" metric, so exposed as a metric on the collector internal telemetry?

@odeke-em
Copy link
Member Author

odeke-em commented May 3, 2021

"up" should only be exported for Prometheus exporters, so essentially it'll need to be exported for the 3 different users:

  • exporter/prometheus
  • exporter/prometheusremotewrite
  • service/prometheus exporter

And not be exported to all the general pipelines. I have an unmailed change that allows exporter/prometheus and exporter/prometheusremoteremotewrite to hook into receiver/prometheus so that gauges for up can be sent directly to the target exporters.

@bogdandrutu
Copy link
Member

what is "service/prometheus exporter"?

@dashpole
Copy link
Contributor

dashpole commented May 4, 2021

Why only for prometheus exporters? For GKE's use-case, in which we scrape prometheus endpoints, and send to google cloud monitoring, we still would like to know when a prometheus endpoint is available. IMO it would even be useful to replicate the "up" metric for other scrape-based endpoints, such as when we scrape the kubelet's stats/summary endpoint.

@vishiy
Copy link

vishiy commented May 4, 2021

"up" should only be exported for Prometheus exporters, so essentially it'll need to be exported for the 3 different users:

  • exporter/prometheus
  • exporter/prometheusremotewrite
  • service/prometheus exporter

And not be exported to all the general pipelines. I have an unmailed change that allows exporter/prometheus and exporter/prometheusremoteremotewrite to hook into receiver/prometheus so that gauges for up can be sent directly to the target exporters.

If prometheus receiver is a drop-in replacement for prometheus server, shouldn't this be emitted by receiver (may be optionally), rather than in exporter(s) ?

@odeke-em
Copy link
Member Author

odeke-em commented May 4, 2021

what is "service/prometheus exporter"?

@bogdandrutu when the service pipeline is started, it emits telemetry metrics as per

pe, err := prometheus.NewExporter(opts)
if err != nil {
return err
}
view.RegisterExporter(pe)
logger.Info(
"Serving Prometheus metrics",
zap.String("address", metricsAddr),
zap.Int8("level", int8(level)), // TODO: make it human friendly
zap.String(conventions.AttributeServiceInstance, instanceID),
)
mux := http.NewServeMux()
mux.Handle("/metrics", pe)

Why only for prometheus exporters? For GKE's use-case, in which we scrape prometheus endpoints, and send to google cloud monitoring, we still would like to know when a prometheus endpoint is available. IMO it would even be useful to replicate the "up" metric for other scrape-based endpoints, such as when we scrape the kubelet's stats/summary endpoint.

@dashpole, the task was to implement a pass through for Prometheus server whereby the collector only exposes the up metric and delivers it to Prometheus server. However, if you need all other exporters to receive the up metric, then that even simplifies things much further and allows me to revert to prior versions.

If prometheus receiver is a drop-in replacement for prometheus server, shouldn't this be emitted by receiver (may be optionally), rather than in exporter(s) ?

@vishiy it is emitted by the Prometheus receiver, but consumed and relayed to the Prometheus exporters -- we are implementing a pass through. Prometheus server doesn't typically expose this metric for scraping either and uses it internally.

However, you raise a great point that the Prometheus receiver should just generate "up" metric for all of them.

@@ -93,29 +93,40 @@ func (b *metricBuilder) AddDataPoint(ls labels.Labels, t int64, v float64) error
b.numTimeseries++
b.droppedTimeseries++
return errMetricNameNotFound

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if we had a separate case for metricName "up" here, in which we didn't return (which filters out the "up" metric)? That seems simpler than filtering it out here, and adding a new internal metric above.

@gillg
Copy link
Contributor

gillg commented May 6, 2021

Hello, I started a debugging too with my initial issue. I can contribute on your pr my approach is a little more global and fix also internal scrape_ metrics. Don't merge it too quickly ! ^^
I will probably need help to fix one thing remaining

@gillg
Copy link
Contributor

gillg commented May 6, 2021

Other approach here : #3116 but I need some review / help to finish it cleanly. This approach avoid to record metrics manualy, avoid to introduce new concepts, and work for all internal metrics (not only up)

@alolita
Copy link
Member

alolita commented May 12, 2021

@odeke-em what's the next step here? What do you suggest?

@gillg
Copy link
Contributor

gillg commented May 12, 2021

For information, I don't understand why tests are not breaking here. With what I experimented at least Prometheus exporter should break because if it work, it should receive a new unexpected metric in its test case.
And all internal tests with metrics count should break too due to the new number of metrics.
I have the feeling that something is wrong

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 1, 2021
@odeke-em
Copy link
Member Author

odeke-em commented Jun 2, 2021

Superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "up" metric for Prometheus
7 participants