-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[component] Component status reporting rfc #10413
base: main
Are you sure you want to change the base?
[component] Component status reporting rfc #10413
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10413 +/- ##
==========================================
- Coverage 92.35% 92.35% -0.01%
==========================================
Files 386 386
Lines 18408 18370 -38
==========================================
- Hits 17001 16965 -36
+ Misses 1053 1052 -1
+ Partials 354 353 -1 ☔ View full report in Codecov by Sentry. |
97a059a
to
649dead
Compare
Co-authored-by: Matthew Wear <matthew.wear@gmail.com>
…ylerHelmuth/opentelemetry-collector into component-status-reporting-rfc
|
||
A couple solutions: | ||
1. Accept this reality as an exception to Goal 1. | ||
2. Add back `Host.ReportFatalError`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not do that please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atoulme is there a specific objection to Host.ReportFatalError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the behavior of having the collector shutdown when it experiences a FatalError, I think we will need to reinstate Host.ReportFatalError
to comply with goal 5:
- Component health reporting must only be a mechanism for reporting health - it should have no mechanisms for taking actions on the health it reports. How consumers of the health reporting system respond to component updates is not a concern of the health reporting system.
I would not be opposed to bringing back Host.ReportFatalError
. If we did we can update the automation to report a FatalError for the component that reported it. It keeps component status reporting out of the FatalError API, but still allows us to generate a fatal error event for Watchers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host.ReportFatalError
seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.
If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.
If we want to achieve Goal 1 (components are not required to emit status from within their own code), then we'll need some mechanism for a component to tell the Host to shutdown (assuming we choose to keep that feature).
Host.ReportFatalError
alone isn't in violation of Goal 5 if it is added back exactly as we had it before. The contract would be "A component can call this function to stop the Collector". The Host would not be taking any action on a Status reported from the Health Reporting System, it would be taking action based on the Host interfaces contract.
Adding back Host.ReportFatalError
before 1.0 Component is essentially saying we agree that the component.Host
interface is a stable way to interact with the Host, but the Health Reporting System is not stable yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of not allowing components to directly stop the collector. Instead, I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.
This is what we have today, but it effectively is a direct way to stop the collector. So the feature "I want to stop the collector" is locked behind the Status Reporting system.
A couple solutions: | ||
1. Accept this reality as an exception to Goal 1. | ||
2. Add back `Host.ReportFatalError`. | ||
3. Remove the ability for components to stop the collector be removing `FatalError`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like the right future proof approach. We can and should have the collector decide of its behavior when an error is reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove this feature, we'd need to decide how components that are currently doing
if errGrpc := jr.grpc.Serve(ln); !errors.Is(errGrpc, grpc.ErrServerStopped) && errGrpc != nil {
jr.settings.ReportStatus(component.NewFatalErrorEvent(errGrpc))
}
Would handle the error returned from .Serve
.
One option is for the component to return a NewPermanent
error, but according to goal 1, component status reporting must be optional. So for components that are not opt-in to component status reporting, and no longer have Host.ReportFatalError
, what would they do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.
I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't it acceptable to expect them report a permanent error?
At the moment I am working under the assumption that our current Status Reporting System is too new to be marked stable in component 1.0. Under that assumption, it needs to be ok for a component to NOT be using ReportStatus
in order to interact with the Collector Host, which, until Status Reporting was added, was always done via component.Host
.
If we want all the custom components out in the wild to be as compatible with component 1.0 as possible, then relying more on component.Host
to interact with the Collector Host is better than status reporting.
Maybe there is a future where component.Host
isn't needed anymore and components only use status reporting, but I don't think we're there yet. At the moment I view status reporting as a great feature that components SHOULD take advantage of, but are not required to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "interact with the Collector Host", are we only talking about components being able to force the collector to shut down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the other ways a component can interact with the host are accessing the available extensions and accessing available factories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like some notion of reporting a fatal error at any time is the only status-related interaction that we need to pin down for a 1.0 of component. I propose that we commit to such a mechanism for 1.0, but we do not commit to a specific behavior in reaction to the event.
This leaves us room to offer multiple strategies for reacting to a such an event, including:
- Immediate shutdown of the collector.
- Some automatic attempt to restart the component and/or component graph.
- Custom reactions directed via opamp.
- A full blown component status system which includes the notion of a fatal/permanent error.
### Should component health reporting be an opt-in for `component.Host` implementations? | ||
|
||
The current implementation of component status reporting does not add anything to `component.Host` to force a `component.Host` implementation, such as `service.Service`, to be compatible with component status reporting. Instead, it adds `ReportStatus func(*StatusEvent)` to `component.TelemetrySettings` and things that instantiate components, such as `serivce.Service`, should, but are not required, to pass in a value for `ReportStatus`. | ||
|
||
As a result, `component.Host` implementation is not required to engage with the component status reporting system. This could lead to situations where a user adds a status watcher extension that can do nothing because the `component.Host` is not reporting component status updates. | ||
|
||
Is this acceptable? Should we: | ||
1. Require the `component.Host` implementations be compatible with the component status reporting framework? | ||
2. Add some sort of configuration/build flag then enforces the `component.Host` implementation be compatible (or not) with component status reporting? | ||
3. Accept this edge case. | ||
|
||
### Component TelemetrySettings Requirements | ||
|
||
The current implementation of component status reporting added a new field to `component.TelemetrySettings`, `ReportStatus`. This field is technically optional, but would be marked as stable with component 1.0. Are we ok with 1 of the following? | ||
|
||
1. Including a component status reporting feature, `component.TelemetrySettings.ReportStatus`, in the 1.0 version of `component.TelemetrySettings`? | ||
2. Marking `component.TelemetrySettings.ReportStatus` as experimentatal via godoc comments in the 1.0 version of `component.TelemetrySettings`? | ||
|
||
Or should we refactor `component` somehow to remove `ReportStatus` from `component.TelemetrySettings`? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that ReportStatus
should be either part of component.Host
or an optional interface to be implemented by component.Host
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of an optional interface that component.Host
s can implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The larger goal is for components to report runtime status, see #9957. One issue with implementing ReportStatus
on host, is that it's passed into start, and for a component to have access to ReportStatus
outside of Start
it needs to save a handle, either to the function, or to the host. Few do this, but most components save a handle to TelemetrySettings, which provides similar mechanisms for the component to report data about itself. It's not impossible for it be implemented on Host, but its an inconvenient location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few do this
I think this is something we can remediate with good examples, documentation and possibly some autogenerated test where we assert common types of status transitions (e.g. failed to create a server).
The only difference I see is that you can't report status during component creation, but I see that as a feature, not a bug (there's going to be no status transition before a component has been initialized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am more interested in the ergonomics for component authors. I think its natural for the API to report status to be in the same place as I find a meter, or logger for a component. The extra step of explicitly having to save a separate handle to host, or its ReportStatus method to use outside of Start makes the API more cumbersome to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantages I see if we put this in component.Host
are:
- Allowing component status to evolve independently from the rest of the core APIs and decoupling this effort from Collector 1.0 (the most important benefit in my mind)
- Making it impossible for components to misuse
ReportStatus
during creation time - Potentially removing the awkwardness in having both a
component.TelemetrySettings
andservicetelemetry.TelemetrySettings
, and simplifying the implementation of Allow overriding telemetry providers #4970
I feel like these outweigh component.TelemetrySettings
benefits, but it would be great to have a similar list so we can consider pros/cons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On point 3, we will actually introduce the same awkwardness in a different spot. We'll end up with a servicehost.Host
and a component.Host
. This transition from host to telemetry settings was made once before during the original implementation. See this commit for how things looked when ReportStatus was on Host: b45540e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context, I think I wasn't super clear on (3). We have several vendors interested in providing their own meter provider, tracer provider, logger... (see #4970) so moving this awkwardness to component.Host
would potentially simplify adding support for this: all other fields in component.TelemetrySettings
are governed by the service::telemetry
config section and we could more easily use a factory pattern if we were left with the other fields. That said, I don't think (3) should be the deciding factor, I see (1) as the most important one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would also need to confirm that we can still support sharedcomponent with moving the API to host. The sharedcomponent requirements were not fully understood until sometime after b45540e. The requirements I'm referring to are discussed here: #10059 (comment). They are also outlined in the new document: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md#implementation-details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first impression is that supporting sharedcomponent will be a challenge if we move ReportStatus from TelemetrySettings to Host. We were able to solve this problem from within sharedcomponent by chaining together the ReportStatus functions for each instance the component represents. This is done through the TelemetrySettings instances passed to LoadOrStore which is called during create. It's important that we have a ReportStatus function that can report for all instances of a sharecomponent before start-time, because the component needs to report the starting events for its instances at the same time. Trying to discover the instances from within the sharedcomponent at start-time will not work because its already too late. We would need a way to know what instances a sharedcomponent represents from "outside" the sharedcomponent to make this work. That said, the current sharedcomponent situation is not ideal and perhaps needs improvement anyways.
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
This PR adds documentation for the collector status reporting system. It describes the current state of the system and has a section for best practices that we intend to evolve as we develop them. The intended audience is future users of the system and anyone interested in getting a deeper look into how the system works without having to read all of the code. This is intended to be complementary to the [in-progress RFC](#10413). [Here is a preview](https://github.com/open-telemetry/opentelemetry-collector/blob/61abf91b4faec42905b409c352e0e234e5b75ac9/docs/component-status.md) with the diagrams properly rendered. --------- Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When designing the existing component status system, we faced many challenges related to component instances. For example, a single processor configuration may be used in multiple pipelines, in which case multiple instances of the processor will be instantiated. Another example, a receiver or exporter which is used in pipelines of different data types will have one instance per data type. Should these instances share a single status or do they each have their own? If each instance has its own status, how do we uniquely identify these instances and communicate each one's status to users?
In my opinion this is a problem which needs to be addressed in any status system design. Alternatively, I think we should consider ways to remove these problems by making changes to the way components are instanced such that all instances correspond to exactly one configuration in the collector config.
1. Runtime component health reporting, which cannot be automatically reported, must be opt-in for components. Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0 | ||
- The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a full opt-in system will create a lot of unnecessary disparity between components. The collector controls the basic lifecycle of components, so why wouldn't it report this?
Why wouldn't the collector automatically report the basic lifecycle of all components (starting -> running -> stopping -> stopped) and allow each component to opt-in to additional detail (recoverable error -> recovered).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular goal is supposed to be scoped just to components. The intention is to allow components to be usable in a collector without utilizing any health reporting status framework.
Allowing the Collector Host to report the basic lifecycle checks it can handle is handle via Goal 4.
Should this first goal be reworded to make it clearer that this is only about components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it could be clearer. I think the question of whether or not we will force implementations of component.Host
to implement the basic lifecycle should probably be answered first, since the language in this goal depends on whether or not that can be assumed.
2. Component health reporting must be opt-in for collector users. While the underlying components are always allowed to report their health via the system, the Collector Host or any other listener may only take action when the user has configured the collector accordingly. | ||
- As one example of compliance, the current health reporting system is dependent on the user configuring an extension that can watch for status updates. | ||
3. Component health must be representable as a finite state machine with clear transitions between states. | ||
4. The Collector Host may report statuses Starting, Ok, Stopping and PermanentError on behalf of components as documented in the automation section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the Collector Host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it is the Collector Service (I think technically it is serviceHost
on serivce.Service
, which is implementing component.Host
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more technical way we could word this, maybe just say component.Host
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "service's component.Host
implementation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to clarify the term on line 13:
For the sake of simplicity, when `Host` is referenced in this document you can think of the Collector service, which manages all the collector pipelines and the components within.
But we can use service.Service
instead.
|
||
A couple solutions: | ||
1. Accept this reality as an exception to Goal 1. | ||
2. Add back `Host.ReportFatalError`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host.ReportFatalError
seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.
If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.
A couple solutions: | ||
1. Accept this reality as an exception to Goal 1. | ||
2. Add back `Host.ReportFatalError`. | ||
3. Remove the ability for components to stop the collector be removing `FatalError`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.
I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?
### No way to identify components that are not reporting status | ||
Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### No way to identify components that are not reporting status | |
Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature. |
All components should have their basic lifecycle reported automatically. Whether the component itself reports additional granularity about it's state may be optional, but I don't see why it's important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All components should have their basic lifecycle reported automatically
Agreed, thats handled by Goal 4. Goal 1 is attempting to claim that a component should not be required to call settings.ReportStatus
in order to be called a component and to function successfully in a collector build.
Description
Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked:
component
public API #6506