Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector docs on single-writer principle #4433

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

michael2893
Copy link

Summary

This change addresses the request for documentation on the Single-Writer principle. #4368

Description

  • add section on multiple collector deployments in deployment/gateway
  • define single writer principle
  • provide examples and context

Open questions

  • Can I provide examples from open issues to help better capture this problem?

@michael2893 michael2893 changed the title Michael2893 update collector documentation #4368 - update collector deployment documentation May 7, 2024
@michael2893 michael2893 marked this pull request as ready for review May 7, 2024 01:12
@michael2893 michael2893 requested review from a team as code owners May 7, 2024 01:12
@michael2893 michael2893 requested review from atoulme and removed request for a team May 7, 2024 01:12
@svrnm
Copy link
Member

svrnm commented May 7, 2024

@open-telemetry/collector-approvers ptal


There is a gateway deployment configured to handle all traffic for three other collectors in the same system.
If the collectors are not uniquely identified and the SDK fails to distinguish between them, they may
send identical data to the gateway collector from different points in time. In this scenario,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a more concrete example here? Having multiple instances of a collector behind a load-balancer is certainly common practice, and there's no inherent problem in having a SDK sending data via this load-balancer, causing different data points for the same workload to land at different collector instances.

There are a few situations that need to be accounted for when scaling, like using target-allocator for pull-based scraping (nothing to do with OTLP though), or tail-sampling (due to the statefulness characteristic of this component).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - what I have here isn't really specific enough. I can certainly provide an example here

All metric data streams within OTLP must have a [single writer](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer)
This is because gateway collector deployments can involve multiple collectors in the same system.
It is possible in this case for instances to receive and process data from the same resources,
which is a violation of the Single-Writer principle. A potential consequence of this may be that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this. Here's what the link above says:

All metric data streams within OTLP MUST have one logical writer. This means, conceptually, that any Timeseries created from the Protocol MUST have one originating source of truth. In practical terms, this implies the following:

  • All metric data streams produced by OTel SDKs SHOULD have globally unique identity at any given point in time. Metric identity is defined above.
  • Aggregations of metric streams MUST only be written from a single logical source at any given point time. Note: This implies aggregated metric streams must reach one destination.

Multiple instances of the collector can certainly receive and process different OTLP requests for the same resource without problems.

In the context of the collector, the single-writer principle is relevant for receivers that create metrics on their own, such as scraping receivers or things like the host metrics receiver. On that case, there should really be only one "host metrics receiver" instance per host.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. This is a much more particular problem than my framing would suggest.

In hindsight,

It is possible in this case for instances to receive and process data from the same resources,
which is a violation of the Single-Writer principle. A potential consequence of this may be that

Is not really correct. That's far too general, especially considering the case you've highlighted.

Thanks for the feedback, I will get to revising this.

There are patterns in the data that may provide some insight into whether this is happening or not.
For example, upon visual inspection, a series with unexplained gaps or jumps in the same series may be a clue that
multiple collectors are sending the same samples. Unexplained behavior in a time series could potentially
point to the backend scraping data from multiple sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another common way to find this out is when the backend complains about "out of order samples" -- if a data point for the state of a counter at T2 was received, and later a data point for the state of the same counter at T1 was received, a backend might say that the late data point is discarded.

inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent
representation of metrics, such as CPU usage.

### Scaling considerations
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selection doesn't seem like it would belong in deployment I don't think, actually.

I'm wondering if it's even necessary to mention.

This is information basically already covered in Scaling the Collector and I'm wondering if it's redundant, or there should be some additional mention in that document about the single-writer principle there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?

In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their
own metrics, such as scraping receivers or the host metrics receiver.
This is most important for receivers that create their own metrics, such as pull based scraping
that targets a specific metric source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the two last statements are saying the same things. Perhaps complement the first with extra info from the second?

Suggested change
that targets a specific metric source.
In a system with multiple collectors, the single-writer principle is most relevant for receivers that generate their
own metrics or target a specific metric source, such as scraping receivers, host metrics receiver, kubelet stats receiver, and similar.

inconsistent data or data loss. Collisions resulting from inconsistent timestamps may lead to an unstable or inconsistent
representation of metrics, such as CPU usage.

### Scaling considerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling collectors is another way of saying that we need multiple collectors, so, this page is bound to repeat some/much of what's in the scaling page. How about we just add a "single-writer principle" to that one instead?

@michael2893 michael2893 changed the title #4368 - update collector deployment documentation #4368 - update collector documentation on Single-writer principle May 8, 2024

Partial or incomplete traces may be consequential for an implementation of
tail-sampling, as the goal is to capture all or most of the spans within the
trace in order to inform sampling decisions. When using the target allocator to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the target allocator is not related to the tail-based sampling, this seems out of place here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to mention it (tail-based sampling) at all here? It's relevant but I am a bit unsure if it needs mentioned here

I think what I mean is that this topic is addressed earlier on scaling collectors

To overcome this, you can deploy a layer of Collectors containing the
load-balancing exporter in front of your Collectors doing the tail-sampling or
the span-to-metrics processing. The load-balancing exporter will hash the trace
ID or the service name consistently and determine which collector backend should
receive spans for that trace.

addresses scaling with a load balancer/hashing traces. I don't know if it makes sense for me to include this in the section here about single writers.

service: 'my-service'
```

You can also use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also out of place, it doesn't have anything to do with the tail-based sampling

the application or service to better delineate the targets.

```yaml
scrape_configs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Prometheus scrape configuration, not an OTel Collector. Either provide the full configuration for the collector, possibly embedding this config you have here.

```

```yaml
scrape_configs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: users might get confused where this fits, so, please provide a complete OTel Collector config.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll include a complete configuration to make sure this is clear.

@chalin chalin force-pushed the michael2893-update-collector-documentation branch from 18efeda to f1a7d4b Compare June 8, 2024 09:38
@chalin chalin changed the title #4368 - update collector documentation on Single-writer principle Collector docs on single-writer principle Jun 8, 2024
@chalin
Copy link
Contributor

chalin commented Jun 8, 2024

/fix:all

@opentelemetrybot
Copy link
Contributor

You triggered fix:all action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190

@opentelemetrybot
Copy link
Contributor

fix:all run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427840190 for details

@chalin
Copy link
Contributor

chalin commented Jun 8, 2024

/fix:markdown

@opentelemetrybot
Copy link
Contributor

You triggered fix:markdown action run at https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345

@opentelemetrybot
Copy link
Contributor

fix:markdown run failed, please check https://github.com/open-telemetry/opentelemetry.io/actions/runs/9427849345 for details

@svrnm
Copy link
Member

svrnm commented Jun 10, 2024

@michael2893 do not worry too much about the markdown, link issues, etc. lets make sure that the content is right and then we can fix the PR accordingly.

Have you addressed the feedback by @jpkrohling, if so we need another round of reviews :-)

@michael2893
Copy link
Author

@svrnm Hi - yes I did address the content issues from the comments above

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get the opinion of another Collector approver, as well as an Operator approver. The reason is that I'm not seeing the benefit for most of this change, especially on the scaling.md: the point about scaling pull-based scrapers is already being covered in "Scaling the Scrapers". In general, I feel like the single-writer principle can be smaller mentions in existing docs, instead of new docs or new sections.

@mx-psi , is this PR following what you had in mind for #4368?

@michael2893
Copy link
Author

Ah, ok, right. I think with this change, I could enhance context in the existing docs with additional notes without restating as much as I have, depending on what the desired outcome for 4368 is!

@mx-psi
Copy link
Member

mx-psi commented Jun 14, 2024

I have not had time to look into this PR yet, I will try to do it next week. What I had in mind in general terms is a page I can refer users to when they face trouble because they have an incorrect setup where the same metric stream is being produced by multiple writers. For an example, see open-telemetry/opentelemetry-collector-contrib#32043 (comment). I'll try to give a more thorough review soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants