Log when dropping metrics due to missing process_start_time_seconds #1921

nilebox · 2020-10-08T05:30:11Z

Description:
~~- Log a message when metrics are dropped by Prometheus receiver due to missing process_start_time_seconds metric.~~

Report error via obsreport.EndMetricsReceiveOp and return error in transaction Commit()

Link to tracking Issue: Fixes #969

nilebox · 2020-10-08T05:33:14Z

/cc @james-bebbington @rf232

codecov · 2020-10-08T05:38:02Z

Codecov Report

Merging #1921 into master will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1921      +/-   ##
==========================================
- Coverage   91.36%   91.35%   -0.02%     
==========================================
  Files         280      280              
  Lines       16640    16641       +1     
==========================================
- Hits        15203    15202       -1     
- Misses       1006     1007       +1     
- Partials      431      432       +1

Impacted Files	Coverage Δ
...eceiver/prometheusreceiver/internal/transaction.go	`95.40% <100.00%> (+0.05%)`	⬆️
translator/internaldata/resource_to_oc.go	`89.04% <0.00%> (-2.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68085ca...abb50a3. Read the comment docs.

james-bebbington · 2020-10-08T05:56:01Z

receiver/prometheusreceiver/internal/transaction.go

@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error {
 	if tr.useStartTimeMetric {
 		// AdjustStartTime - startTime has to be non-zero in this case.
 		if tr.metricBuilder.startTime == 0.0 {
+			// Unable to adjust start time because of missing start time metric
+			tr.logger.Info(


Should this be a warning?

I considered that initially, but since it's not the issue with collector or its config, but rather with the target applications, Info seems more appropriate?

i.e. collector itself performs correctly, it just informs us that some applications are misconfigured.

I would consider it a warrning, as it informs about degraded behavior and there is a user action that can be taken to resolve it.

We should not log on bad input, see https://github.com/open-telemetry/opentelemetry-collector/blob/master/CONTRIBUTING.md#logging

We should not log on bad input

@tigrannajaryan could you suggest alternative solution then?
As described in #1921, we currently don't have any visibility in these metrics being dropped and why are they being dropped. Different people at Google ran into this issue in the last few months and spent hours / days debugging it.
Having a log message would help a lot.

As @serathius pointed above, in most cases this does require human intervention - either fixing the application, or change the collector config.

FYI the error returned here is handled by Prometheus code, which will log it (but won't include Prometheus job and instance): https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L1074-L1077

Actually the logger does contain Prometheus target: https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L259

so extra log message is redundant - will remove it from here.

Sorry for late reply, I was a away for a few days.

IMO, the right approach is to record the failures in an internal metric. The guidelines mention it:

For such high-frequency events instead of logging consider adding an internal metric and increment it when the event happens.

I think obsreport.EndMetricsReceiveOp should do that.

If you want to also log the failure then I believe it is better to use logger.Debug() so that it is not enabled by default. Another alternate if it must have more visibility is to log an error once and clearly indicate in the error message that it will be only logged once. A third alternate is to use log rate limiting. zap logger seems to support it (I haven't tried it).

receiver/prometheusreceiver/internal/transaction.go

nilebox · 2020-10-08T06:59:07Z

The contrib-test failure doesn't seem related to this change:

go test ./... in ./exporter/honeycombexporter
--- FAIL: TestSampleRateAttribute (0.00s)
    honeycomb_test.go:406

tigrannajaryan · 2020-10-08T22:12:19Z

receiver/prometheusreceiver/internal/transaction.go

@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error {
 	if tr.useStartTimeMetric {
 		// AdjustStartTime - startTime has to be non-zero in this case.
 		if tr.metricBuilder.startTime == 0.0 {
+			// Unable to adjust start time because of missing start time metric
+			tr.logger.Info(


We should not log on bad input, see https://github.com/open-telemetry/opentelemetry-collector/blob/master/CONTRIBUTING.md#logging

tigrannajaryan · 2020-10-14T15:04:39Z

@nilebox I merged the PR, please feel free to submit a follow up PR if you want to introduce debug or rate-limited logging.

* Added Reason to Contributing and Updated TracerConfig * PR comment fixup * Changed how span Options work. * Fix Markdown linting * Added meter configs. * Fixes from PR comments * fix for missing instrument Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

* Update recommended installation methods * Update internal/buildscripts/packaging/installer/install.sh Co-authored-by: Ryan Fitzpatrick <rmfitzpatrick@users.noreply.github.com> Co-authored-by: Ryan Fitzpatrick <rmfitzpatrick@users.noreply.github.com>

nilebox requested review from bogdandrutu, dmitryax, james-bebbington, owais, pjanotti and tigrannajaryan as code owners October 8, 2020 05:30

james-bebbington approved these changes Oct 8, 2020

View reviewed changes

Log when dropping metrics due to missing process_start_time_seconds

447bcea

tigrannajaryan requested changes Oct 8, 2020

View reviewed changes

tigrannajaryan self-assigned this Oct 8, 2020

Nail Islamov added 2 commits October 9, 2020 18:20

Return error when start time metric is missing

7f05000

fix lint errors

abb50a3

tigrannajaryan approved these changes Oct 14, 2020

View reviewed changes

tigrannajaryan merged commit 4359f40 into open-telemetry:master Oct 14, 2020

nilebox deleted the prometheus-log-start-time branch October 14, 2020 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log when dropping metrics due to missing process_start_time_seconds #1921

Log when dropping metrics due to missing process_start_time_seconds #1921

nilebox commented Oct 8, 2020 •

edited by tigrannajaryan

Loading

nilebox commented Oct 8, 2020

codecov bot commented Oct 8, 2020 •

edited

Loading

james-bebbington Oct 8, 2020

nilebox Oct 8, 2020

serathius Oct 8, 2020

tigrannajaryan Oct 8, 2020

nilebox Oct 8, 2020

nilebox Oct 9, 2020 •

edited

Loading

nilebox Oct 9, 2020

nilebox Oct 9, 2020

james-bebbington Oct 9, 2020

tigrannajaryan Oct 14, 2020

nilebox commented Oct 8, 2020

tigrannajaryan Oct 8, 2020

tigrannajaryan commented Oct 14, 2020

Log when dropping metrics due to missing process_start_time_seconds #1921

Log when dropping metrics due to missing process_start_time_seconds #1921

Conversation

nilebox commented Oct 8, 2020 • edited by tigrannajaryan Loading

nilebox commented Oct 8, 2020

codecov bot commented Oct 8, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox commented Oct 8, 2020

Choose a reason for hiding this comment

tigrannajaryan commented Oct 14, 2020

nilebox commented Oct 8, 2020 •

edited by tigrannajaryan

Loading

codecov bot commented Oct 8, 2020 •

edited

Loading

nilebox Oct 9, 2020 •

edited

Loading