[component] Component status reporting rfc #10413

TylerHelmuth · 2024-06-14T21:21:44Z

Description

Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked:

codecov · 2024-06-14T21:27:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.35%. Comparing base (3ffb41e) to head (d56bf56).

❗ Current head d56bf56 differs from pull request most recent head ca4fd66

Please upload reports for the commit ca4fd66 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10413      +/-   ##
==========================================
- Coverage   92.35%   92.35%   -0.01%     
==========================================
  Files         386      386              
  Lines       18408    18370      -38     
==========================================
- Hits        17001    16965      -36     
+ Misses       1053     1052       -1     
+ Partials      354      353       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/rfcs/component-status-reporting.md

Co-authored-by: Matthew Wear <matthew.wear@gmail.com>

…ylerHelmuth/opentelemetry-collector into component-status-reporting-rfc

docs/rfcs/component-status-reporting.md

atoulme · 2024-06-17T22:00:52Z

docs/rfcs/component-status-reporting.md

+
+A couple solutions:
+1. Accept this reality as an exception to Goal 1.
+2. Add back `Host.ReportFatalError`.


Let's not do that please.

@atoulme is there a specific objection to Host.ReportFatalError?

If we keep the behavior of having the collector shutdown when it experiences a FatalError, I think we will need to reinstate Host.ReportFatalError to comply with goal 5:

Component health reporting must only be a mechanism for reporting health - it should have no mechanisms for taking actions on the health it reports. How consumers of the health reporting system respond to component updates is not a concern of the health reporting system.

I would not be opposed to bringing back Host.ReportFatalError. If we did we can update the automation to report a FatalError for the component that reported it. It keeps component status reporting out of the FatalError API, but still allows us to generate a fatal error event for Watchers.

Host.ReportFatalError seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

If we want to achieve Goal 1 (components are not required to emit status from within their own code), then we'll need some mechanism for a component to tell the Host to shutdown (assuming we choose to keep that feature).

Host.ReportFatalError alone isn't in violation of Goal 5 if it is added back exactly as we had it before. The contract would be "A component can call this function to stop the Collector". The Host would not be taking any action on a Status reported from the Health Reporting System, it would be taking action based on the Host interfaces contract.

Adding back Host.ReportFatalError before 1.0 Component is essentially saying we agree that the component.Host interface is a stable way to interact with the Host, but the Health Reporting System is not stable yet.

I would be in favor of not allowing components to directly stop the collector. Instead, I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.

I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.

This is what we have today, but it effectively is a direct way to stop the collector. So the feature "I want to stop the collector" is locked behind the Status Reporting system.

atoulme · 2024-06-17T22:01:30Z

docs/rfcs/component-status-reporting.md

+A couple solutions:
+1. Accept this reality as an exception to Goal 1.
+2. Add back `Host.ReportFatalError`.
+3. Remove the ability for components to stop the collector be removing `FatalError`.


Sounds like the right future proof approach. We can and should have the collector decide of its behavior when an error is reported.

If we remove this feature, we'd need to decide how components that are currently doing

if errGrpc := jr.grpc.Serve(ln); !errors.Is(errGrpc, grpc.ErrServerStopped) && errGrpc != nil { jr.settings.ReportStatus(component.NewFatalErrorEvent(errGrpc)) }

Would handle the error returned from .Serve.

One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional. So for components that are not opt-in to component status reporting, and no longer have Host.ReportFatalError, what would they do?

One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.

I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?

Why isn't it acceptable to expect them report a permanent error?

At the moment I am working under the assumption that our current Status Reporting System is too new to be marked stable in component 1.0. Under that assumption, it needs to be ok for a component to NOT be using ReportStatus in order to interact with the Collector Host, which, until Status Reporting was added, was always done via component.Host.

If we want all the custom components out in the wild to be as compatible with component 1.0 as possible, then relying more on component.Host to interact with the Collector Host is better than status reporting.

Maybe there is a future where component.Host isn't needed anymore and components only use status reporting, but I don't think we're there yet. At the moment I view status reporting as a great feature that components SHOULD take advantage of, but are not required to.

When you say "interact with the Collector Host", are we only talking about components being able to force the collector to shut down?

Yes, but the other ways a component can interact with the host are accessing the available extensions and accessing available factories.

It sounds like some notion of reporting a fatal error at any time is the only status-related interaction that we need to pin down for a 1.0 of component. I propose that we commit to such a mechanism for 1.0, but we do not commit to a specific behavior in reaction to the event.

This leaves us room to offer multiple strategies for reacting to a such an event, including:

Immediate shutdown of the collector.

Some automatic attempt to restart the component and/or component graph.

Custom reactions directed via opamp.

A full blown component status system which includes the notion of a fatal/permanent error.

docs/rfcs/component-status-reporting.md

mx-psi · 2024-06-18T10:50:29Z

docs/rfcs/component-status-reporting.md

+### Should component health reporting be an opt-in for `component.Host` implementations?
+
+The current implementation of component status reporting does not add anything to `component.Host` to force a `component.Host` implementation, such as `service.Service`, to be compatible with component status reporting.  Instead, it adds `ReportStatus func(*StatusEvent)` to `component.TelemetrySettings` and things that instantiate components, such as `serivce.Service`, should, but are not required, to pass in a value for `ReportStatus`.
+
+As a result, `component.Host` implementation is not required to engage with the component status reporting system.  This could lead to situations where a user adds a status watcher extension that can do nothing because the `component.Host` is not reporting component status updates.
+
+Is this acceptable? Should we:
+1. Require the `component.Host` implementations be compatible with the component status reporting framework?
+2. Add some sort of configuration/build flag then enforces the `component.Host` implementation be compatible (or not) with component status reporting?
+3. Accept this edge case.
+
+### Component TelemetrySettings Requirements
+
+The current implementation of component status reporting added a new field to `component.TelemetrySettings`, `ReportStatus`.  This field is technically optional, but would be marked as stable with component 1.0. Are we ok with 1 of the following?
+
+1. Including a component status reporting feature, `component.TelemetrySettings.ReportStatus`, in the 1.0 version of `component.TelemetrySettings`?
+2. Marking `component.TelemetrySettings.ReportStatus` as experimentatal via godoc comments in the 1.0 version of `component.TelemetrySettings`?
+
+Or should we refactor `component` somehow to remove `ReportStatus` from `component.TelemetrySettings`?
+


My gut feeling is that ReportStatus should be either part of component.Host or an optional interface to be implemented by component.Hosts

I like the idea of an optional interface that component.Hosts can implement.

The larger goal is for components to report runtime status, see #9957. One issue with implementing ReportStatus on host, is that it's passed into start, and for a component to have access to ReportStatus outside of Start it needs to save a handle, either to the function, or to the host. Few do this, but most components save a handle to TelemetrySettings, which provides similar mechanisms for the component to report data about itself. It's not impossible for it be implemented on Host, but its an inconvenient location.

Few do this

I think this is something we can remediate with good examples, documentation and possibly some autogenerated test where we assert common types of status transitions (e.g. failed to create a server).

The only difference I see is that you can't report status during component creation, but I see that as a feature, not a bug (there's going to be no status transition before a component has been initialized).

I am more interested in the ergonomics for component authors. I think its natural for the API to report status to be in the same place as I find a meter, or logger for a component. The extra step of explicitly having to save a separate handle to host, or its ReportStatus method to use outside of Start makes the API more cumbersome to use.

The advantages I see if we put this in component.Host are:

Allowing component status to evolve independently from the rest of the core APIs and decoupling this effort from Collector 1.0 (the most important benefit in my mind)

Making it impossible for components to misuse ReportStatus during creation time

Potentially removing the awkwardness in having both a component.TelemetrySettings and servicetelemetry.TelemetrySettings, and simplifying the implementation of Allow overriding telemetry providers #4970

I feel like these outweigh component.TelemetrySettings benefits, but it would be great to have a similar list so we can consider pros/cons

On point 3, we will actually introduce the same awkwardness in a different spot. We'll end up with a servicehost.Host and a component.Host. This transition from host to telemetry settings was made once before during the original implementation. See this commit for how things looked when ReportStatus was on Host: b45540e.

Thanks for the context, I think I wasn't super clear on (3). We have several vendors interested in providing their own meter provider, tracer provider, logger... (see #4970) so moving this awkwardness to component.Host would potentially simplify adding support for this: all other fields in component.TelemetrySettings are governed by the service::telemetry config section and we could more easily use a factory pattern if we were left with the other fields. That said, I don't think (3) should be the deciding factor, I see (1) as the most important one.

We would also need to confirm that we can still support sharedcomponent with moving the API to host. The sharedcomponent requirements were not fully understood until sometime after b45540e. The requirements I'm referring to are discussed here: #10059 (comment). They are also outlined in the new document: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md#implementation-details.

My first impression is that supporting sharedcomponent will be a challenge if we move ReportStatus from TelemetrySettings to Host. We were able to solve this problem from within sharedcomponent by chaining together the ReportStatus functions for each instance the component represents. This is done through the TelemetrySettings instances passed to LoadOrStore which is called during create. It's important that we have a ReportStatus function that can report for all instances of a sharecomponent before start-time, because the component needs to report the starting events for its instances at the same time. Trying to discover the instances from within the sharedcomponent at start-time will not work because its already too late. We would need a way to know what instances a sharedcomponent represents from "outside" the sharedcomponent to make this work. That said, the current sharedcomponent situation is not ideal and perhaps needs improvement anyways.

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

This PR adds documentation for the collector status reporting system. It describes the current state of the system and has a section for best practices that we intend to evolve as we develop them. The intended audience is future users of the system and anyone interested in getting a deeper look into how the system works without having to read all of the code. This is intended to be complementary to the [in-progress RFC](#10413). [Here is a preview](https://github.com/open-telemetry/opentelemetry-collector/blob/61abf91b4faec42905b409c352e0e234e5b75ac9/docs/component-status.md) with the diagrams properly rendered. --------- Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

…porting-rfc

djaglowski

When designing the existing component status system, we faced many challenges related to component instances. For example, a single processor configuration may be used in multiple pipelines, in which case multiple instances of the processor will be instantiated. Another example, a receiver or exporter which is used in pipelines of different data types will have one instance per data type. Should these instances share a single status or do they each have their own? If each instance has its own status, how do we uniquely identify these instances and communicate each one's status to users?

In my opinion this is a problem which needs to be addressed in any status system design. Alternatively, I think we should consider ways to remove these problems by making changes to the way components are instanced such that all instances correspond to exactly one configuration in the collector config.

djaglowski · 2024-07-01T14:47:37Z

docs/rfcs/component-status-reporting.md

+1. Runtime component health reporting, which cannot be automatically reported, must be opt-in for components.  Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0
+    - The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system.


I think having a full opt-in system will create a lot of unnecessary disparity between components. The collector controls the basic lifecycle of components, so why wouldn't it report this?

Why wouldn't the collector automatically report the basic lifecycle of all components (starting -> running -> stopping -> stopped) and allow each component to opt-in to additional detail (recoverable error -> recovered).

This particular goal is supposed to be scoped just to components. The intention is to allow components to be usable in a collector without utilizing any health reporting status framework.

Allowing the Collector Host to report the basic lifecycle checks it can handle is handle via Goal 4.

Should this first goal be reworded to make it clearer that this is only about components?

Yeah I think it could be clearer. I think the question of whether or not we will force implementations of component.Host to implement the basic lifecycle should probably be answered first, since the language in this goal depends on whether or not that can be assumed.

djaglowski · 2024-07-01T14:51:14Z

docs/rfcs/component-status-reporting.md

+2. Component health reporting must be opt-in for collector users.  While the underlying components are always allowed to report their health via the system, the Collector Host or any other listener may only take action when the user has configured the collector accordingly.
+   - As one example of compliance, the current health reporting system is dependent on the user configuring an extension that can watch for status updates.
+3. Component health must be representable as a finite state machine with clear transitions between states.
+4. The Collector Host may report statuses Starting, Ok,  Stopping and PermanentError on behalf of components as documented in the automation section.


What is the Collector Host?

In this case it is the Collector Service (I think technically it is serviceHost on serivce.Service, which is implementing component.Host.

Is there a more technical way we could word this, maybe just say component.Host?

Maybe "service's component.Host implementation"?

I attempted to clarify the term on line 13:

For the sake of simplicity, when `Host` is referenced in this document you can think of the Collector service, which manages all the collector pipelines and the components within.

But we can use service.Service instead.

djaglowski · 2024-07-01T14:55:42Z

docs/rfcs/component-status-reporting.md

+
+A couple solutions:
+1. Accept this reality as an exception to Goal 1.
+2. Add back `Host.ReportFatalError`.


Host.ReportFatalError seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

djaglowski · 2024-07-01T15:00:24Z

docs/rfcs/component-status-reporting.md

+A couple solutions:
+1. Accept this reality as an exception to Goal 1.
+2. Add back `Host.ReportFatalError`.
+3. Remove the ability for components to stop the collector be removing `FatalError`.


One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.

I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?

djaglowski · 2024-07-01T15:01:46Z

docs/rfcs/component-status-reporting.md

+### No way to identify components that are not reporting status
+Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature.


Suggested change

### No way to identify components that are not reporting status

Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature.

All components should have their basic lifecycle reported automatically. Whether the component itself reports additional granularity about it's state may be optional, but I don't see why it's important.

All components should have their basic lifecycle reported automatically

Agreed, thats handled by Goal 4. Goal 1 is attempting to claim that a component should not be required to call settings.ReportStatus in order to be called a component and to function successfully in a collector build.

TylerHelmuth requested review from codeboten and mwear June 14, 2024 21:22

TylerHelmuth added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jun 14, 2024

Add a component status reporting RFC doc

649dead

TylerHelmuth force-pushed the component-status-reporting-rfc branch from 97a059a to 649dead Compare June 17, 2024 20:42

TylerHelmuth marked this pull request as ready for review June 17, 2024 20:43

TylerHelmuth requested a review from a team as a code owner June 17, 2024 20:43

mwear reviewed Jun 17, 2024

View reviewed changes

TylerHelmuth and others added 3 commits June 17, 2024 15:22

Apply suggestions from code review

e10ff8d

Co-authored-by: Matthew Wear <matthew.wear@gmail.com>

Apply feedback

e87c606

Merge branch 'component-status-reporting-rfc' of https://github.com/T…

2eb2ed7

…ylerHelmuth/opentelemetry-collector into component-status-reporting-rfc

atoulme reviewed Jun 17, 2024

View reviewed changes

docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved

atoulme reviewed Jun 17, 2024

View reviewed changes

mwear mentioned this pull request Jun 17, 2024

[component] Status Reporting Documentation #10422

Merged

mx-psi reviewed Jun 18, 2024

View reviewed changes

Update docs/rfcs/component-status-reporting.md

d56bf56

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

TylerHelmuth added 2 commits June 20, 2024 16:32

Merge remote-tracking branch 'upstream/main' into component-status-re…

a2d38e2

…porting-rfc

Link to published status reporting doc

ca4fd66

djaglowski reviewed Jul 1, 2024

View reviewed changes

		1. Runtime component health reporting, which cannot be automatically reported, must be opt-in for components. Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0
		- The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system.

		### No way to identify components that are not reporting status
		Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature.

[component] Component status reporting rfc #10413

Are you sure you want to change the base?

[component] Component status reporting rfc #10413

Conversation

TylerHelmuth commented Jun 14, 2024 • edited Loading

Description

codecov bot commented Jun 14, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TylerHelmuth Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwear Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djaglowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TylerHelmuth commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading

TylerHelmuth Jul 3, 2024 •

edited

Loading

mwear Jun 18, 2024 •

edited

Loading