Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[otlp] Fix panic in dropped count (again!) #3538

Merged
merged 1 commit into from
Nov 28, 2022

Conversation

gouthamve
Copy link
Member

@gouthamve gouthamve commented Nov 28, 2022

This doesn't accurately count the dropped samples. For example if a single metric with multiple samples is faulty, we get a single error rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do DatapointCount() - samplesInMap()

The problem is the following:

  1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
  2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@gouthamve gouthamve requested a review from a team as a code owner November 28, 2022 12:44
@gouthamve gouthamve force-pushed the fix-otel-panics-once-and-forall branch from f3c0849 to f2fe3f4 Compare November 28, 2022 12:45
@gouthamve
Copy link
Member Author

cc @pstibrany @replay

@gouthamve gouthamve force-pushed the fix-otel-panics-once-and-forall branch from f2fe3f4 to 46f0333 Compare November 28, 2022 12:50
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
@gouthamve gouthamve force-pushed the fix-otel-panics-once-and-forall branch from 46f0333 to 03cbd10 Compare November 28, 2022 12:57
@replay replay mentioned this pull request Nov 28, 2022
37 tasks
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good to me.

It's little strange to me that tenant check is only done in case of errors. Should we do it before we even try to convert metrics?

Added unit tests don't seem to be checking for dropped metric at all. In other words, they are unrelated to the PR. 🤔

@gouthamve
Copy link
Member Author

It's little strange to me that tenant check is only done in case of errors. Should we do it before we even try to convert metrics?

It is on an authenticated route so the check essentially NEVER fails. It only exists to load the userID value.

Added unit tests don't seem to be checking for dropped metric at all. In other words, they are unrelated to the PR. 🤔

I first wrote the unit tests to reproduce the panic and then proceeded to fix it. Which is why it seems like the unit test is unrelated.

@replay replay merged commit 2ad7eb4 into main Nov 28, 2022
@replay replay deleted the fix-otel-panics-once-and-forall branch November 28, 2022 18:44
replay pushed a commit that referenced this pull request Nov 28, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
replay added a commit that referenced this pull request Nov 29, 2022
grafanabot pushed a commit that referenced this pull request Nov 29, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
(cherry picked from commit 2ad7eb4)
replay pushed a commit that referenced this pull request Nov 29, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
(cherry picked from commit 2ad7eb4)

Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
stevesg pushed a commit that referenced this pull request Nov 30, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
(cherry picked from commit 2ad7eb4)
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
This doesn't accurately count the dropped samples. For example
if a single metric with multiple samples is faulty, we get a single error
rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do `DatapointCount() - samplesInMap()`

The problem is the following:
1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
(cherry picked from commit 2ad7eb4)

Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants