Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log when dropping metrics due to missing process_start_time_seconds #1921

Merged
merged 3 commits into from
Oct 14, 2020
Merged

Log when dropping metrics due to missing process_start_time_seconds #1921

merged 3 commits into from
Oct 14, 2020

Conversation

nilebox
Copy link
Member

@nilebox nilebox commented Oct 8, 2020

Description:
- Log a message when metrics are dropped by Prometheus receiver due to missing process_start_time_seconds metric.

  • Report error via obsreport.EndMetricsReceiveOp and return error in transaction Commit()

Link to tracking Issue: Fixes #969

@nilebox
Copy link
Member Author

nilebox commented Oct 8, 2020

/cc @james-bebbington @rf232

@codecov
Copy link

codecov bot commented Oct 8, 2020

Codecov Report

Merging #1921 into master will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1921      +/-   ##
==========================================
- Coverage   91.36%   91.35%   -0.02%     
==========================================
  Files         280      280              
  Lines       16640    16641       +1     
==========================================
- Hits        15203    15202       -1     
- Misses       1006     1007       +1     
- Partials      431      432       +1     
Impacted Files Coverage Δ
...eceiver/prometheusreceiver/internal/transaction.go 95.40% <100.00%> (+0.05%) ⬆️
translator/internaldata/resource_to_oc.go 89.04% <0.00%> (-2.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68085ca...abb50a3. Read the comment docs.

@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error {
if tr.useStartTimeMetric {
// AdjustStartTime - startTime has to be non-zero in this case.
if tr.metricBuilder.startTime == 0.0 {
// Unable to adjust start time because of missing start time metric
tr.logger.Info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that initially, but since it's not the issue with collector or its config, but rather with the target applications, Info seems more appropriate?

i.e. collector itself performs correctly, it just informs us that some applications are misconfigured.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider it a warrning, as it informs about degraded behavior and there is a user action that can be taken to resolve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not log on bad input

@tigrannajaryan could you suggest alternative solution then?
As described in #1921, we currently don't have any visibility in these metrics being dropped and why are they being dropped. Different people at Google ran into this issue in the last few months and spent hours / days debugging it.
Having a log message would help a lot.

As @serathius pointed above, in most cases this does require human intervention - either fixing the application, or change the collector config.

Copy link
Member Author

@nilebox nilebox Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the error returned here is handled by Prometheus code, which will log it (but won't include Prometheus job and instance): https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L1074-L1077

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the logger does contain Prometheus target: https://github.com/prometheus/prometheus/blob/3240cf83f08e448e0b96a4a1f96c0e8b2d51cf61/scrape/scrape.go#L259

so extra log message is redundant - will remove it from here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply, I was a away for a few days.

IMO, the right approach is to record the failures in an internal metric. The guidelines mention it:

For such high-frequency events instead of logging consider adding an internal metric and increment it when the event happens.

I think obsreport.EndMetricsReceiveOp should do that.

If you want to also log the failure then I believe it is better to use logger.Debug() so that it is not enabled by default. Another alternate if it must have more visibility is to log an error once and clearly indicate in the error message that it will be only logged once. A third alternate is to use log rate limiting. zap logger seems to support it (I haven't tried it).

receiver/prometheusreceiver/internal/transaction.go Outdated Show resolved Hide resolved
@nilebox
Copy link
Member Author

nilebox commented Oct 8, 2020

The contrib-test failure doesn't seem related to this change:

go test ./... in ./exporter/honeycombexporter
--- FAIL: TestSampleRateAttribute (0.00s)
    honeycomb_test.go:406

@@ -159,6 +159,12 @@ func (tr *transaction) Commit() error {
if tr.useStartTimeMetric {
// AdjustStartTime - startTime has to be non-zero in this case.
if tr.metricBuilder.startTime == 0.0 {
// Unable to adjust start time because of missing start time metric
tr.logger.Info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan tigrannajaryan self-assigned this Oct 8, 2020
@tigrannajaryan tigrannajaryan merged commit 4359f40 into open-telemetry:master Oct 14, 2020
@tigrannajaryan
Copy link
Member

@nilebox I merged the PR, please feel free to submit a follow up PR if you want to introduce debug or rate-limited logging.

@nilebox nilebox deleted the prometheus-log-start-time branch October 14, 2020 20:01
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this pull request Nov 11, 2021
* Added Reason to Contributing and Updated TracerConfig

* PR comment fixup

* Changed how span Options work.

* Fix Markdown linting

* Added meter configs.

* Fixes from PR comments

* fix for missing instrument

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this pull request Apr 27, 2023
* Update recommended installation methods

* Update internal/buildscripts/packaging/installer/install.sh

Co-authored-by: Ryan Fitzpatrick <rmfitzpatrick@users.noreply.github.com>

Co-authored-by: Ryan Fitzpatrick <rmfitzpatrick@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus receiver: log error message when process_start_time_seconds gauge is missing
4 participants