Collector docs on single-writer principle #4433

michael2893 · 2024-05-07T01:11:48Z

Summary

This change addresses the request for documentation on the Single-Writer principle. #4368

Description

add section on multiple collector deployments in deployment/gateway
define single writer principle
provide examples and context

Open questions

Can I provide examples from open issues to help better capture this problem?

svrnm · 2024-05-07T08:04:17Z

@open-telemetry/collector-approvers ptal

jpkrohling · 2024-05-07T09:17:24Z

content/en/docs/collector/deployment/gateway.md

+
+There is a gateway deployment configured to handle all traffic for three other collectors in the same system.
+If the collectors are not uniquely identified and the SDK fails to distinguish between them, they may
+send identical data to the gateway collector from different points in time. In this scenario,


Can you give a more concrete example here? Having multiple instances of a collector behind a load-balancer is certainly common practice, and there's no inherent problem in having a SDK sending data via this load-balancer, causing different data points for the same workload to land at different collector instances.

There are a few situations that need to be accounted for when scaling, like using target-allocator for pull-based scraping (nothing to do with OTLP though), or tail-sampling (due to the statefulness characteristic of this component).

Yeah - what I have here isn't really specific enough. I can certainly provide an example here

jpkrohling · 2024-05-07T09:23:42Z

content/en/docs/collector/deployment/gateway.md

+All metric data streams within OTLP must have a [single writer](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer)
+This is because gateway collector deployments can involve multiple collectors in the same system.
+It is possible in this case for instances to receive and process data from the same resources,
+which is a violation of the Single-Writer principle. A potential consequence of this may be that


I'm not sure I agree with this. Here's what the link above says:

All metric data streams within OTLP MUST have one logical writer. This means, conceptually, that any Timeseries created from the Protocol MUST have one originating source of truth. In practical terms, this implies the following:

All metric data streams produced by OTel SDKs SHOULD have globally unique identity at any given point in time. Metric identity is defined above.

Aggregations of metric streams MUST only be written from a single logical source at any given point time. Note: This implies aggregated metric streams must reach one destination.

Multiple instances of the collector can certainly receive and process different OTLP requests for the same resource without problems.

In the context of the collector, the single-writer principle is relevant for receivers that create metrics on their own, such as scraping receivers or things like the host metrics receiver. On that case, there should really be only one "host metrics receiver" instance per host.

Ok, makes sense. This is a much more particular problem than my framing would suggest.

In hindsight,

It is possible in this case for instances to receive and process data from the same resources,
which is a violation of the Single-Writer principle. A potential consequence of this may be that

Is not really correct. That's far too general, especially considering the case you've highlighted.

Thanks for the feedback, I will get to revising this.

jpkrohling · 2024-05-07T09:27:48Z

content/en/docs/collector/deployment/gateway.md

+There are patterns in the data that may provide some insight into whether this is happening or not.
+For example, upon visual inspection, a series with unexplained gaps or jumps in the same series may be a clue that
+multiple collectors are sending the same samples. Unexplained behavior in a time series could potentially
+point to the backend scraping data from multiple sources.


Another common way to find this out is when the backend complains about "out of order samples" -- if a data point for the state of a counter at T2 was received, and later a data point for the state of the same counter at T1 was received, a backend might say that the late data point is discarded.

michael2893 · 2024-05-08T00:50:34Z

content/en/docs/collector/deployment/multiple-collectors.md

+inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent
+representation of metrics, such as CPU usage.
+
+### Scaling considerations


This selection doesn't seem like it would belong in deployment I don't think, actually.

I'm wondering if it's even necessary to mention.

This is information basically already covered in Scaling the Collector and I'm wondering if it's redundant, or there should be some additional mention in that document about the single-writer principle there.

Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?

jpkrohling · 2024-05-08T08:22:43Z

content/en/docs/collector/deployment/multiple-collectors.md

+In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their 
+own metrics, such as scraping receivers or the host metrics receiver. 
+This is most important for receivers that create their own metrics, such as pull based scraping 
+that targets a specific metric source.


I believe the two last statements are saying the same things. Perhaps complement the first with extra info from the second?

Suggested change

that targets a specific metric source.

In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their

own metrics or target a specific metric source, such as scraping receivers, host metrics receiver, kubelet stats receiver, and similar.

jpkrohling · 2024-05-08T08:26:19Z

content/en/docs/collector/deployment/multiple-collectors.md

+inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent
+representation of metrics, such as CPU usage.
+
+### Scaling considerations


Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?

jpkrohling · 2024-05-08T13:10:43Z

content/en/docs/collector/scaling.md

+
+Partial or incomplete traces may be consequential for an implementation of
+tail-sampling, as the goal is to capture all or most of the spans within the
+trace in order to inform sampling decisions. When using the target allocator to


the target allocator is not related to the tail-based sampling, this seems out of place here

~~Does it make sense to mention it (tail-based sampling) at all here? It's relevant but I am a bit unsure if it needs mentioned here~~

I think what I mean is that this topic is addressed earlier on scaling collectors

To overcome this, you can deploy a layer of Collectors containing the
load-balancing exporter in front of your Collectors doing the tail-sampling or
the span-to-metrics processing. The load-balancing exporter will hash the trace
ID or the service name consistently and determine which collector backend should
receive spans for that trace.

addresses scaling with a load balancer/hashing traces. I don't know if it makes sense for me to include this in the section here about single writers.

jpkrohling · 2024-05-08T13:11:07Z

content/en/docs/collector/scaling.md

+        service: 'my-service'
+```
+
+You can also use


this is also out of place, it doesn't have anything to do with the tail-based sampling

jpkrohling · 2024-05-08T13:12:55Z

content/en/docs/collector/scaling.md

+the application or service to better delineate the targets.
+
+```yaml
+scrape_configs:


This is a Prometheus scrape configuration, not an OTel Collector. Either provide the full configuration for the collector, possibly embedding this config you have here.

jpkrohling · 2024-05-08T13:13:25Z

content/en/docs/collector/scaling.md

+```
+
+```yaml
+scrape_configs:


Same here: users might get confused where this fits, so, please provide a complete OTel Collector config.

Ok, I'll include a complete configuration to make sure this is clear.

…lier in the doc that these considerations should be made

chalin · 2024-06-08T09:41:31Z

/fix:all

opentelemetrybot · 2024-06-08T09:41:43Z

You triggered fix:all action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190

opentelemetrybot · 2024-06-08T09:42:46Z

fix:all run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190 for details

chalin · 2024-06-08T09:44:05Z

/fix:markdown

opentelemetrybot · 2024-06-08T09:44:18Z

You triggered fix:markdown action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345

opentelemetrybot · 2024-06-08T09:44:28Z

fix:markdown run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345 for details

svrnm · 2024-06-10T06:15:55Z

@michael2893 do not worry too much about the markdown, link issues, etc. lets make sure that the content is right and then we can fix the PR accordingly.

Have you addressed the feedback by @jpkrohling, if so we need another round of reviews :-)

michael2893 · 2024-06-10T12:29:17Z

@svrnm Hi - yes I did address the content issues from the comments above

jpkrohling

I would like to get the opinion of another Collector approver, as well as an Operator approver. The reason is that I'm not seeing the benefit for most of this change, especially on the scaling.md: the point about scaling pull-based scrapers is already being covered in "Scaling the Scrapers". In general, I feel like the single-writer principle can be smaller mentions in existing docs, instead of new docs or new sections.

@mx-psi , is this PR following what you had in mind for #4368?

michael2893 · 2024-06-14T13:49:33Z

Ah, ok, right. I think with this change, I could enhance context in the existing docs with additional notes without restating as much as I have, depending on what the desired outcome for 4368 is!

mx-psi · 2024-06-14T14:00:51Z

I have not had time to look into this PR yet, I will try to do it next week. What I had in mind in general terms is a page I can refer users to when they face trouble because they have an incorrect setup where the same metric stream is being produced by multiple writers. For an example, see open-telemetry/opentelemetry-collector-contrib#32043 (comment). I'll try to give a more thorough review soon

github-actions bot added the sig:collector label May 7, 2024

michael2893 changed the title ~~Michael2893 update collector documentation~~ #4368 - update collector deployment documentation May 7, 2024

michael2893 marked this pull request as ready for review May 7, 2024 01:12

michael2893 requested review from a team as code owners May 7, 2024 01:12

michael2893 requested review from atoulme and removed request for a team May 7, 2024 01:12

jpkrohling reviewed May 7, 2024

View reviewed changes

michael2893 commented May 8, 2024

View reviewed changes

jpkrohling reviewed May 8, 2024

View reviewed changes

michael2893 changed the title ~~#4368 - update collector deployment documentation~~ #4368 - update collector documentation on Single-writer principle May 8, 2024

michael2893 requested a review from jpkrohling May 8, 2024 12:40

jpkrohling reviewed May 8, 2024

View reviewed changes

michael2893 requested a review from jpkrohling May 9, 2024 12:16

michael2893 added 11 commits June 8, 2024 05:38

Add some additional collector deployment docs

b6db3c4

tweak a couple of the formatting issues

b85b4c9

one last small change to the header title (added a 'when')

c7c192e

update documentation on multiple collectors. move to a new section

2d205ab

Revert gateway document to original

dfb03f2

small gramatical nitpick

7517654

update some of the formatting under the links

c5cae81

update scaling docs to include SWP, and update link references

198e937

move scaling section to proper directory

ca2b41a

remove redundant reference to tail-sampling. It's already covered ear…

57400a8

…lier in the doc that these considerations should be made

revert accidental change to sampling doc

f1a7d4b

chalin force-pushed the michael2893-update-collector-documentation branch from 18efeda to f1a7d4b Compare June 8, 2024 09:38

chalin changed the title ~~#4368 - update collector documentation on Single-writer principle~~ Collector docs on single-writer principle Jun 8, 2024

Merge branch 'main' into michael2893-update-collector-documentation

af60cba

jpkrohling reviewed Jun 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector docs on single-writer principle #4433

Collector docs on single-writer principle #4433

michael2893 commented May 7, 2024

svrnm commented May 7, 2024

jpkrohling May 7, 2024

michael2893 May 7, 2024

jpkrohling May 7, 2024

michael2893 May 7, 2024

jpkrohling May 7, 2024

michael2893 May 8, 2024

jpkrohling May 8, 2024

jpkrohling May 8, 2024

jpkrohling May 8, 2024

jpkrohling May 8, 2024

michael2893 May 8, 2024 •

edited

jpkrohling May 8, 2024

jpkrohling May 8, 2024

jpkrohling May 8, 2024

michael2893 May 8, 2024

chalin commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

chalin commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

svrnm commented Jun 10, 2024

michael2893 commented Jun 10, 2024

jpkrohling left a comment

michael2893 commented Jun 14, 2024

mx-psi commented Jun 14, 2024

	that targets a specific metric source.
	In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their
	own metrics or target a specific metric source, such as scraping receivers, host metrics receiver, kubelet stats receiver, and similar.

Collector docs on single-writer principle #4433

Are you sure you want to change the base?

Collector docs on single-writer principle #4433

Conversation

michael2893 commented May 7, 2024

Summary

Description

Open questions

svrnm commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michael2893 May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chalin commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

chalin commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

opentelemetrybot commented Jun 8, 2024

svrnm commented Jun 10, 2024

michael2893 commented Jun 10, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

michael2893 commented Jun 14, 2024

mx-psi commented Jun 14, 2024

michael2893 May 8, 2024 •

edited