receiver/prometheus: add "up" metric for instances #2918

odeke-em · 2021-04-13T01:03:54Z

Make a receiver specific view that'll be registered
and used to record the "up" status either "0.0" or "1.0"
when an instance can't be scraped from or can be, respectively.

This ensures that the collector can act as a passthrough
for statuses and it currently outputs:

# HELP up Whether the endpoint is alive or not
# TYPE up gauge
up{instance="0.0.0.0:8888"} 1
up{instance="localhost:9999"} 0

I did not take the approach of plainly sending up suffixed metric names.
to recommend instead using relabelling inside the exporter itself like:

- source_labels: [__name__]
    regex: "(.+)_up"
    target_label: "__name__"
    replacement: "up"

because:

it'd apply ConstLabels on every *_up metric, only want "instance=$INSTANCE"
other exporters wouldn't be able to use the "up" metric as is if we
inject rewrites

Regardless of if we used a label rewrite, the end result would be the
following:

up{instance="localhost:8888",job="otlc"}
up{exported_instance="0.0.0.0:9999",instance="localhost:8888",job="otlc"}
up{exported_instance="0.0.0.0:1234",instance="localhost:8888",job="otlc"}

which this change accomplishes without having to inject any label
rewrites, but just by the new imports and upgrade of the prometheus
exporter.

Fixes open-telemetry/wg-prometheus#8
Requires census-ecosystem/opencensus-go-exporter-prometheus#24

codecov · 2021-04-13T01:12:17Z

Codecov Report

Merging #2918 (0b9bc14) into main (73b55f0) will decrease coverage by 0.37%.
The diff coverage is 100.00%.

❗ Current head 0b9bc14 differs from pull request most recent head 4462b2a. Consider uploading reports for the commit 4462b2a to get more accurate results

@@            Coverage Diff             @@
##             main    #2918      +/-   ##
==========================================
- Coverage   92.06%   91.69%   -0.38%     
==========================================
  Files         313      313              
  Lines       15439    15362      -77     
==========================================
- Hits        14214    14086     -128     
- Misses        817      870      +53     
+ Partials      408      406       -2

Impacted Files	Coverage Δ
receiver/prometheusreceiver/metrics_receiver.go	`69.69% <ø> (ø)`
receiver/prometheusreceiver/internal/metrics.go	`100.00% <100.00%> (ø)`
...iver/prometheusreceiver/internal/metricsbuilder.go	`100.00% <100.00%> (ø)`
service/telemetry.go	`85.45% <100.00%> (+0.26%)`	⬆️
service/zpages.go	`0.00% <0.00%> (-80.00%)`	⬇️
internal/version/version.go	`80.00% <0.00%> (-20.00%)`	⬇️
extension/zpagesextension/zpagesextension.go	`78.57% <0.00%> (-14.29%)`	⬇️
internal/collector/telemetry/telemetry.go	`94.44% <0.00%> (-5.56%)`	⬇️
translator/trace/protospan_translation.go	`92.46% <0.00%> (-4.91%)`	⬇️
exporter/prometheusremotewriteexporter/exporter.go	`89.33% <0.00%> (-1.21%)`	⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update faa78aa...4462b2a. Read the comment docs.

odeke-em · 2021-04-13T17:02:08Z

We shall now have

Kindly cc-ing @Aneurysm9 @rakyll @alolita @bogdandrutu @brian-brazil

rakyll · 2021-04-20T17:20:06Z

Nice! Maybe you can update the description saying it's fixing open-telemetry/wg-prometheus#41.

dashpole · 2021-04-21T15:50:32Z

receiver/prometheusreceiver/internal/metrics.go

+
+var tagInstance, _ = tag.NewKey("instance")
+
+var statUpStatus = stats.Int64("up", "Whether the endpoint is alive or not", stats.UnitDimensionless)


Should this be part of the metrics returned by the prometheus receiver itself rather than a self-observability metric? I agree it could be implemented as either a self-obs metric or as an additional metric from the receiver, but there are a few advantages:

Users probably expect to see an 'up' metric when they make a pipeline with a prometheus receiver, and don't expect to have to do additional things.

I probably want to apply the same set of transformations to the 'up' metric that I want to apply to the rest of my prometheus metrics, since they should have the same resource labels (e.g. instance, job).

It will make it easier for us to pass prom compliance, since we won't need to route self-obs metrics to the PRW exporter.

If this is a self-obs metric, will it still satisfy the prometheus compliance tests?

bogdandrutu

Consider to use the metrics package from opencensus instead of stats for this, if we only have one label and no other labels coming from tags or plan to change the encoding:
See https://github.com/census-instrumentation/opencensus-go/blob/master/metric/gauge.go

exporter/prometheusremotewriteexporter/exporter.go

Make a receiver specific view that'll be registered and used to record the "up" status either "0.0" or "1.0" when an instance can't be scraped from or can be, respectively. This ensures that the collector can act as a passthrough for statuses and it currently outputs: # HELP up Whether the endpoint is alive or not # TYPE up gauge up{instance="0.0.0.0:8888"} 1 up{instance="localhost:9999"} 0 I did not take the approach of plainly sending up suffixed metric names. to recommend instead using relabelling inside the exporter itself like: - source_labels: [__name__] regex: "(.+)_up" target_label: "__name__" replacement: "up" because: * it'd apply ConstLabels on every *_up metric, only want "instance=$INSTANCE" * other exporters wouldn't be able to use the "up" metric as is if we inject rewrites Regardless of if we used a label rewrite, the end result would be the following: up{instance="localhost:8888",job="otlc"} up{exported_instance="0.0.0.0:9999",instance="localhost:8888",job="otlc"} up{exported_instance="0.0.0.0:1234",instance="localhost:8888",job="otlc"} which this change accomplishes without having to inject any label rewrites, but just by the new imports and upgrade of the prometheus exporter. Fixes open-telemetry/wg-prometheus#8 Requires census-ecosystem/opencensus-go-exporter-prometheus#24

bogdandrutu · 2021-05-03T23:07:46Z

@rakyll @odeke-em I am confused about this (I missed this initially). Should the "up" metric be emitted via the same pipeline as the other collected metrics or is a "collector" metric, so exposed as a metric on the collector internal telemetry?

odeke-em · 2021-05-03T23:18:16Z

"up" should only be exported for Prometheus exporters, so essentially it'll need to be exported for the 3 different users:

exporter/prometheus
exporter/prometheusremotewrite
service/prometheus exporter

And not be exported to all the general pipelines. I have an unmailed change that allows exporter/prometheus and exporter/prometheusremoteremotewrite to hook into receiver/prometheus so that gauges for up can be sent directly to the target exporters.

bogdandrutu · 2021-05-04T15:14:02Z

what is "service/prometheus exporter"?

dashpole · 2021-05-04T15:20:51Z

Why only for prometheus exporters? For GKE's use-case, in which we scrape prometheus endpoints, and send to google cloud monitoring, we still would like to know when a prometheus endpoint is available. IMO it would even be useful to replicate the "up" metric for other scrape-based endpoints, such as when we scrape the kubelet's stats/summary endpoint.

vishiy · 2021-05-04T16:01:05Z

"up" should only be exported for Prometheus exporters, so essentially it'll need to be exported for the 3 different users:

exporter/prometheus

exporter/prometheusremotewrite

service/prometheus exporter

And not be exported to all the general pipelines. I have an unmailed change that allows exporter/prometheus and exporter/prometheusremoteremotewrite to hook into receiver/prometheus so that gauges for up can be sent directly to the target exporters.

If prometheus receiver is a drop-in replacement for prometheus server, shouldn't this be emitted by receiver (may be optionally), rather than in exporter(s) ?

odeke-em · 2021-05-04T20:08:26Z

what is "service/prometheus exporter"?

@bogdandrutu when the service pipeline is started, it emits telemetry metrics as per

opentelemetry-collector/service/telemetry.go

Lines 91 to 106 in f4d33bc

    
           pe, err := prometheus.NewExporter(opts) 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           view.RegisterExporter(pe) 
        
           logger.Info( 
        
           	"Serving Prometheus metrics", 
        
           	zap.String("address", metricsAddr), 
        
           	zap.Int8("level", int8(level)), // TODO: make it human friendly 
        
           	zap.String(conventions.AttributeServiceInstance, instanceID), 
        
           ) 
        
           mux := http.NewServeMux() 
        
           mux.Handle("/metrics", pe)

Why only for prometheus exporters? For GKE's use-case, in which we scrape prometheus endpoints, and send to google cloud monitoring, we still would like to know when a prometheus endpoint is available. IMO it would even be useful to replicate the "up" metric for other scrape-based endpoints, such as when we scrape the kubelet's stats/summary endpoint.

@dashpole, the task was to implement a pass through for Prometheus server whereby the collector only exposes the up metric and delivers it to Prometheus server. However, if you need all other exporters to receive the up metric, then that even simplifies things much further and allows me to revert to prior versions.

If prometheus receiver is a drop-in replacement for prometheus server, shouldn't this be emitted by receiver (may be optionally), rather than in exporter(s) ?

@vishiy it is emitted by the Prometheus receiver, but consumed and relayed to the Prometheus exporters -- we are implementing a pass through. Prometheus server doesn't typically expose this metric for scraping either and uses it internally.

However, you raise a great point that the Prometheus receiver should just generate "up" metric for all of them.

dashpole · 2021-05-05T21:01:14Z

receiver/prometheusreceiver/internal/metricsbuilder.go

@@ -93,29 +93,40 @@ func (b *metricBuilder) AddDataPoint(ls labels.Labels, t int64, v float64) error
 		b.numTimeseries++
 		b.droppedTimeseries++
 		return errMetricNameNotFound
+


What would happen if we had a separate case for metricName "up" here, in which we didn't return (which filters out the "up" metric)? That seems simpler than filtering it out here, and adding a new internal metric above.

gillg · 2021-05-06T05:51:30Z

Hello, I started a debugging too with my initial issue. I can contribute on your pr my approach is a little more global and fix also internal scrape_ metrics. Don't merge it too quickly ! ^^
I will probably need help to fix one thing remaining

gillg · 2021-05-06T10:03:37Z

Other approach here : #3116 but I need some review / help to finish it cleanly. This approach avoid to record metrics manualy, avoid to introduce new concepts, and work for all internal metrics (not only up)

alolita · 2021-05-12T05:42:48Z

@odeke-em what's the next step here? What do you suggest?

gillg · 2021-05-12T06:08:08Z

For information, I don't understand why tests are not breaking here. With what I experimented at least Prometheus exporter should break because if it work, it should receive a new unexpected metric in its test case.
And all internal tests with metrics count should break too due to the new number of metrics.
I have the feeling that something is wrong

github-actions · 2021-05-20T03:40:43Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2021-06-01T05:52:14Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

odeke-em · 2021-06-02T03:31:18Z

Superseded.

odeke-em force-pushed the receiver-prometheus-add-up-metric branch 3 times, most recently from 0c2524f to 0cc35c5 Compare April 13, 2021 07:09

odeke-em marked this pull request as ready for review April 13, 2021 07:09

odeke-em requested a review from a team as a code owner April 13, 2021 07:09

odeke-em force-pushed the receiver-prometheus-add-up-metric branch from 0cc35c5 to 1ffc5e3 Compare April 13, 2021 07:13

odeke-em force-pushed the receiver-prometheus-add-up-metric branch from 1ffc5e3 to 0b9bc14 Compare April 16, 2021 23:35

rakyll approved these changes Apr 20, 2021

View reviewed changes

vishiy approved these changes Apr 21, 2021

View reviewed changes

dashpole reviewed Apr 21, 2021

View reviewed changes

bogdandrutu reviewed Apr 21, 2021

View reviewed changes

bogdandrutu added the waiting-for-author label Apr 21, 2021

odeke-em force-pushed the receiver-prometheus-add-up-metric branch from 0b9bc14 to 5d67dcc Compare April 26, 2021 00:50

bogdandrutu reviewed Apr 29, 2021

View reviewed changes

exporter/prometheusremotewriteexporter/exporter.go Outdated Show resolved Hide resolved

odeke-em added 2 commits May 2, 2021 14:18

Use go.opencensus.io/metric/* as recommended by reviewers

c9c4205

odeke-em force-pushed the receiver-prometheus-add-up-metric branch from b0fd3bb to c9c4205 Compare May 2, 2021 21:24

odeke-em mentioned this pull request May 2, 2021

metric: document how to use the Registry and export metrics out census-instrumentation/opencensus-go#1258

Open

odeke-em mentioned this pull request May 5, 2021

Prometheus Receiver - "up" metric not used as expected, and not recorded. #3089

Closed

Remove debugs, fix variable names too, use "up" as metric name

4462b2a

dashpole reviewed May 5, 2021

View reviewed changes

gillg mentioned this pull request May 6, 2021

"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116

Merged

alolita added area:prometheus and removed area:prometheus labels May 7, 2021

alolita added the area:prometheus label May 12, 2021

github-actions bot added the Stale label May 20, 2021

bogdandrutu removed the Stale label May 24, 2021

github-actions bot added the Stale label Jun 1, 2021

odeke-em closed this Jun 2, 2021

odeke-em mentioned this pull request Aug 30, 2021

exporter/prometheusremotewrite: convert "up" OTLP gauge metric into a PrometheusProto Up metric open-telemetry/opentelemetry-collector-contrib#4976

Closed

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this pull request Apr 27, 2023

Remove workflow cleanup (open-telemetry#2918)

80efe80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receiver/prometheus: add "up" metric for instances #2918

receiver/prometheus: add "up" metric for instances #2918

odeke-em commented Apr 13, 2021 •

edited

codecov bot commented Apr 13, 2021 •

edited

odeke-em commented Apr 13, 2021

rakyll commented Apr 20, 2021

dashpole Apr 21, 2021

dashpole Apr 28, 2021

bogdandrutu left a comment

bogdandrutu commented May 3, 2021

odeke-em commented May 3, 2021

bogdandrutu commented May 4, 2021

dashpole commented May 4, 2021

vishiy commented May 4, 2021

odeke-em commented May 4, 2021

dashpole May 5, 2021

gillg commented May 6, 2021

gillg commented May 6, 2021 •

edited

alolita commented May 12, 2021

gillg commented May 12, 2021

github-actions bot commented May 20, 2021

github-actions bot commented Jun 1, 2021

odeke-em commented Jun 2, 2021


		var tagInstance, _ = tag.NewKey("instance")

		var statUpStatus = stats.Int64("up", "Whether the endpoint is alive or not", stats.UnitDimensionless)

receiver/prometheus: add "up" metric for instances #2918

receiver/prometheus: add "up" metric for instances #2918

Conversation

odeke-em commented Apr 13, 2021 • edited

codecov bot commented Apr 13, 2021 • edited

Codecov Report

odeke-em commented Apr 13, 2021

rakyll commented Apr 20, 2021

dashpole Apr 21, 2021

Choose a reason for hiding this comment

dashpole Apr 28, 2021

Choose a reason for hiding this comment

bogdandrutu left a comment

Choose a reason for hiding this comment

bogdandrutu commented May 3, 2021

odeke-em commented May 3, 2021

bogdandrutu commented May 4, 2021

dashpole commented May 4, 2021

vishiy commented May 4, 2021

odeke-em commented May 4, 2021

dashpole May 5, 2021

Choose a reason for hiding this comment

gillg commented May 6, 2021

gillg commented May 6, 2021 • edited

alolita commented May 12, 2021

gillg commented May 12, 2021

github-actions bot commented May 20, 2021

github-actions bot commented Jun 1, 2021

odeke-em commented Jun 2, 2021

odeke-em commented Apr 13, 2021 •

edited

codecov bot commented Apr 13, 2021 •

edited

gillg commented May 6, 2021 •

edited