-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collector docs on single-writer principle #4433
base: main
Are you sure you want to change the base?
Collector docs on single-writer principle #4433
Conversation
@open-telemetry/collector-approvers ptal |
|
||
There is a gateway deployment configured to handle all traffic for three other collectors in the same system. | ||
If the collectors are not uniquely identified and the SDK fails to distinguish between them, they may | ||
send identical data to the gateway collector from different points in time. In this scenario, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a more concrete example here? Having multiple instances of a collector behind a load-balancer is certainly common practice, and there's no inherent problem in having a SDK sending data via this load-balancer, causing different data points for the same workload to land at different collector instances.
There are a few situations that need to be accounted for when scaling, like using target-allocator for pull-based scraping (nothing to do with OTLP though), or tail-sampling (due to the statefulness characteristic of this component).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - what I have here isn't really specific enough. I can certainly provide an example here
All metric data streams within OTLP must have a [single writer](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer) | ||
This is because gateway collector deployments can involve multiple collectors in the same system. | ||
It is possible in this case for instances to receive and process data from the same resources, | ||
which is a violation of the Single-Writer principle. A potential consequence of this may be that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with this. Here's what the link above says:
All metric data streams within OTLP MUST have one logical writer. This means, conceptually, that any Timeseries created from the Protocol MUST have one originating source of truth. In practical terms, this implies the following:
- All metric data streams produced by OTel SDKs SHOULD have globally unique identity at any given point in time. Metric identity is defined above.
- Aggregations of metric streams MUST only be written from a single logical source at any given point time. Note: This implies aggregated metric streams must reach one destination.
Multiple instances of the collector can certainly receive and process different OTLP requests for the same resource without problems.
In the context of the collector, the single-writer principle is relevant for receivers that create metrics on their own, such as scraping receivers or things like the host metrics receiver. On that case, there should really be only one "host metrics receiver" instance per host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. This is a much more particular problem than my framing would suggest.
In hindsight,
It is possible in this case for instances to receive and process data from the same resources,
which is a violation of the Single-Writer principle. A potential consequence of this may be that
Is not really correct. That's far too general, especially considering the case you've highlighted.
Thanks for the feedback, I will get to revising this.
There are patterns in the data that may provide some insight into whether this is happening or not. | ||
For example, upon visual inspection, a series with unexplained gaps or jumps in the same series may be a clue that | ||
multiple collectors are sending the same samples. Unexplained behavior in a time series could potentially | ||
point to the backend scraping data from multiple sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another common way to find this out is when the backend complains about "out of order samples" -- if a data point for the state of a counter at T2 was received, and later a data point for the state of the same counter at T1 was received, a backend might say that the late data point is discarded.
inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent | ||
representation of metrics, such as CPU usage. | ||
|
||
### Scaling considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This selection doesn't seem like it would belong in deployment
I don't think, actually.
I'm wondering if it's even necessary to mention.
This is information basically already covered in Scaling the Collector
and I'm wondering if it's redundant, or there should be some additional mention in that document about the single-writer principle there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?
In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their | ||
own metrics, such as scraping receivers or the host metrics receiver. | ||
This is most important for receivers that create their own metrics, such as pull based scraping | ||
that targets a specific metric source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the two last statements are saying the same things. Perhaps complement the first with extra info from the second?
that targets a specific metric source. | |
In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their | |
own metrics or target a specific metric source, such as scraping receivers, host metrics receiver, kubelet stats receiver, and similar. |
inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent | ||
representation of metrics, such as CPU usage. | ||
|
||
### Scaling considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?
Single-writer principle
content/en/docs/collector/scaling.md
Outdated
|
||
Partial or incomplete traces may be consequential for an implementation of | ||
tail-sampling, as the goal is to capture all or most of the spans within the | ||
trace in order to inform sampling decisions. When using the target allocator to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the target allocator is not related to the tail-based sampling, this seems out of place here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to mention it (tail-based sampling) at all here? It's relevant but I am a bit unsure if it needs mentioned here
I think what I mean is that this topic is addressed earlier on scaling collectors
To overcome this, you can deploy a layer of Collectors containing the
load-balancing exporter in front of your Collectors doing the tail-sampling or
the span-to-metrics processing. The load-balancing exporter will hash the trace
ID or the service name consistently and determine which collector backend should
receive spans for that trace.
addresses scaling with a load balancer/hashing traces. I don't know if it makes sense for me to include this in the section here about single writers.
content/en/docs/collector/scaling.md
Outdated
service: 'my-service' | ||
``` | ||
|
||
You can also use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also out of place, it doesn't have anything to do with the tail-based sampling
content/en/docs/collector/scaling.md
Outdated
the application or service to better delineate the targets. | ||
|
||
```yaml | ||
scrape_configs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Prometheus scrape configuration, not an OTel Collector. Either provide the full configuration for the collector, possibly embedding this config you have here.
content/en/docs/collector/scaling.md
Outdated
``` | ||
|
||
```yaml | ||
scrape_configs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: users might get confused where this fits, so, please provide a complete OTel Collector config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll include a complete configuration to make sure this is clear.
…lier in the doc that these considerations should be made
18efeda
to
f1a7d4b
Compare
Single-writer principle
/fix:all |
You triggered fix:all action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190 |
fix:all run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190 for details |
/fix:markdown |
You triggered fix:markdown action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345 |
fix:markdown run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345 for details |
@michael2893 do not worry too much about the markdown, link issues, etc. lets make sure that the content is right and then we can fix the PR accordingly. Have you addressed the feedback by @jpkrohling, if so we need another round of reviews :-) |
@svrnm Hi - yes I did address the content issues from the comments above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get the opinion of another Collector approver, as well as an Operator approver. The reason is that I'm not seeing the benefit for most of this change, especially on the scaling.md: the point about scaling pull-based scrapers is already being covered in "Scaling the Scrapers". In general, I feel like the single-writer principle can be smaller mentions in existing docs, instead of new docs or new sections.
@mx-psi , is this PR following what you had in mind for #4368?
Ah, ok, right. I think with this change, I could enhance context in the existing docs with additional notes without restating as much as I have, depending on what the desired outcome for 4368 is! |
I have not had time to look into this PR yet, I will try to do it next week. What I had in mind in general terms is a page I can refer users to when they face trouble because they have an incorrect setup where the same metric stream is being produced by multiple writers. For an example, see open-telemetry/opentelemetry-collector-contrib#32043 (comment). I'll try to give a more thorough review soon |
Summary
This change addresses the request for documentation on the Single-Writer principle. #4368
Description
deployment/gateway
Open questions