Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[component] Component status reporting rfc #10413

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TylerHelmuth
Copy link
Member

@TylerHelmuth TylerHelmuth commented Jun 14, 2024

Description

Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked:

@TylerHelmuth TylerHelmuth added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jun 14, 2024
Copy link

codecov bot commented Jun 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.35%. Comparing base (3ffb41e) to head (d56bf56).

Current head d56bf56 differs from pull request most recent head ca4fd66

Please upload reports for the commit ca4fd66 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10413      +/-   ##
==========================================
- Coverage   92.35%   92.35%   -0.01%     
==========================================
  Files         386      386              
  Lines       18408    18370      -38     
==========================================
- Hits        17001    16965      -36     
+ Misses       1053     1052       -1     
+ Partials      354      353       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TylerHelmuth TylerHelmuth force-pushed the component-status-reporting-rfc branch from 97a059a to 649dead Compare June 17, 2024 20:42
@TylerHelmuth TylerHelmuth marked this pull request as ready for review June 17, 2024 20:43
@TylerHelmuth TylerHelmuth requested a review from a team as a code owner June 17, 2024 20:43
docs/rfcs/component-status-reporting.md Show resolved Hide resolved
docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved
docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved
docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved
docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved

A couple solutions:
1. Accept this reality as an exception to Goal 1.
2. Add back `Host.ReportFatalError`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not do that please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme is there a specific objection to Host.ReportFatalError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the behavior of having the collector shutdown when it experiences a FatalError, I think we will need to reinstate Host.ReportFatalError to comply with goal 5:

  1. Component health reporting must only be a mechanism for reporting health - it should have no mechanisms for taking actions on the health it reports. How consumers of the health reporting system respond to component updates is not a concern of the health reporting system.

I would not be opposed to bringing back Host.ReportFatalError. If we did we can update the automation to report a FatalError for the component that reported it. It keeps component status reporting out of the FatalError API, but still allows us to generate a fatal error event for Watchers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host.ReportFatalError seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

If we want to achieve Goal 1 (components are not required to emit status from within their own code), then we'll need some mechanism for a component to tell the Host to shutdown (assuming we choose to keep that feature).

Host.ReportFatalError alone isn't in violation of Goal 5 if it is added back exactly as we had it before. The contract would be "A component can call this function to stop the Collector". The Host would not be taking any action on a Status reported from the Health Reporting System, it would be taking action based on the Host interfaces contract.

Adding back Host.ReportFatalError before 1.0 Component is essentially saying we agree that the component.Host interface is a stable way to interact with the Host, but the Health Reporting System is not stable yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of not allowing components to directly stop the collector. Instead, I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should only be able to report a permanent/fatal error (i.e. whatever is the maximum severity error) and then it is always up to the collector as a whole to decide whether that warrants shutting down.

This is what we have today, but it effectively is a direct way to stop the collector. So the feature "I want to stop the collector" is locked behind the Status Reporting system.

A couple solutions:
1. Accept this reality as an exception to Goal 1.
2. Add back `Host.ReportFatalError`.
3. Remove the ability for components to stop the collector be removing `FatalError`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like the right future proof approach. We can and should have the collector decide of its behavior when an error is reported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove this feature, we'd need to decide how components that are currently doing

if errGrpc := jr.grpc.Serve(ln); !errors.Is(errGrpc, grpc.ErrServerStopped) && errGrpc != nil {
    jr.settings.ReportStatus(component.NewFatalErrorEvent(errGrpc))
}

Would handle the error returned from .Serve.

One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional. So for components that are not opt-in to component status reporting, and no longer have Host.ReportFatalError, what would they do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.

I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it acceptable to expect them report a permanent error?

At the moment I am working under the assumption that our current Status Reporting System is too new to be marked stable in component 1.0. Under that assumption, it needs to be ok for a component to NOT be using ReportStatus in order to interact with the Collector Host, which, until Status Reporting was added, was always done via component.Host.

If we want all the custom components out in the wild to be as compatible with component 1.0 as possible, then relying more on component.Host to interact with the Collector Host is better than status reporting.

Maybe there is a future where component.Host isn't needed anymore and components only use status reporting, but I don't think we're there yet. At the moment I view status reporting as a great feature that components SHOULD take advantage of, but are not required to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "interact with the Collector Host", are we only talking about components being able to force the collector to shut down?

Copy link
Member Author

@TylerHelmuth TylerHelmuth Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the other ways a component can interact with the host are accessing the available extensions and accessing available factories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like some notion of reporting a fatal error at any time is the only status-related interaction that we need to pin down for a 1.0 of component. I propose that we commit to such a mechanism for 1.0, but we do not commit to a specific behavior in reaction to the event.

This leaves us room to offer multiple strategies for reacting to a such an event, including:

  • Immediate shutdown of the collector.
  • Some automatic attempt to restart the component and/or component graph.
  • Custom reactions directed via opamp.
  • A full blown component status system which includes the notion of a fatal/permanent error.

docs/rfcs/component-status-reporting.md Outdated Show resolved Hide resolved
Comment on lines +152 to +171
### Should component health reporting be an opt-in for `component.Host` implementations?

The current implementation of component status reporting does not add anything to `component.Host` to force a `component.Host` implementation, such as `service.Service`, to be compatible with component status reporting. Instead, it adds `ReportStatus func(*StatusEvent)` to `component.TelemetrySettings` and things that instantiate components, such as `serivce.Service`, should, but are not required, to pass in a value for `ReportStatus`.

As a result, `component.Host` implementation is not required to engage with the component status reporting system. This could lead to situations where a user adds a status watcher extension that can do nothing because the `component.Host` is not reporting component status updates.

Is this acceptable? Should we:
1. Require the `component.Host` implementations be compatible with the component status reporting framework?
2. Add some sort of configuration/build flag then enforces the `component.Host` implementation be compatible (or not) with component status reporting?
3. Accept this edge case.

### Component TelemetrySettings Requirements

The current implementation of component status reporting added a new field to `component.TelemetrySettings`, `ReportStatus`. This field is technically optional, but would be marked as stable with component 1.0. Are we ok with 1 of the following?

1. Including a component status reporting feature, `component.TelemetrySettings.ReportStatus`, in the 1.0 version of `component.TelemetrySettings`?
2. Marking `component.TelemetrySettings.ReportStatus` as experimentatal via godoc comments in the 1.0 version of `component.TelemetrySettings`?

Or should we refactor `component` somehow to remove `ReportStatus` from `component.TelemetrySettings`?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that ReportStatus should be either part of component.Host or an optional interface to be implemented by component.Hosts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of an optional interface that component.Hosts can implement.

Copy link
Member

@mwear mwear Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The larger goal is for components to report runtime status, see #9957. One issue with implementing ReportStatus on host, is that it's passed into start, and for a component to have access to ReportStatus outside of Start it needs to save a handle, either to the function, or to the host. Few do this, but most components save a handle to TelemetrySettings, which provides similar mechanisms for the component to report data about itself. It's not impossible for it be implemented on Host, but its an inconvenient location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few do this

I think this is something we can remediate with good examples, documentation and possibly some autogenerated test where we assert common types of status transitions (e.g. failed to create a server).

The only difference I see is that you can't report status during component creation, but I see that as a feature, not a bug (there's going to be no status transition before a component has been initialized).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more interested in the ergonomics for component authors. I think its natural for the API to report status to be in the same place as I find a meter, or logger for a component. The extra step of explicitly having to save a separate handle to host, or its ReportStatus method to use outside of Start makes the API more cumbersome to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantages I see if we put this in component.Host are:

  1. Allowing component status to evolve independently from the rest of the core APIs and decoupling this effort from Collector 1.0 (the most important benefit in my mind)
  2. Making it impossible for components to misuse ReportStatus during creation time
  3. Potentially removing the awkwardness in having both a component.TelemetrySettings and servicetelemetry.TelemetrySettings, and simplifying the implementation of Allow overriding telemetry providers #4970

I feel like these outweigh component.TelemetrySettings benefits, but it would be great to have a similar list so we can consider pros/cons

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On point 3, we will actually introduce the same awkwardness in a different spot. We'll end up with a servicehost.Host and a component.Host. This transition from host to telemetry settings was made once before during the original implementation. See this commit for how things looked when ReportStatus was on Host: b45540e.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, I think I wasn't super clear on (3). We have several vendors interested in providing their own meter provider, tracer provider, logger... (see #4970) so moving this awkwardness to component.Host would potentially simplify adding support for this: all other fields in component.TelemetrySettings are governed by the service::telemetry config section and we could more easily use a factory pattern if we were left with the other fields. That said, I don't think (3) should be the deciding factor, I see (1) as the most important one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also need to confirm that we can still support sharedcomponent with moving the API to host. The sharedcomponent requirements were not fully understood until sometime after b45540e. The requirements I'm referring to are discussed here: #10059 (comment). They are also outlined in the new document: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md#implementation-details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first impression is that supporting sharedcomponent will be a challenge if we move ReportStatus from TelemetrySettings to Host. We were able to solve this problem from within sharedcomponent by chaining together the ReportStatus functions for each instance the component represents. This is done through the TelemetrySettings instances passed to LoadOrStore which is called during create. It's important that we have a ReportStatus function that can report for all instances of a sharecomponent before start-time, because the component needs to report the starting events for its instances at the same time. Trying to discover the instances from within the sharedcomponent at start-time will not work because its already too late. We would need a way to know what instances a sharedcomponent represents from "outside" the sharedcomponent to make this work. That said, the current sharedcomponent situation is not ideal and perhaps needs improvement anyways.

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
mx-psi added a commit that referenced this pull request Jun 20, 2024
This PR adds documentation for the collector status reporting system. It
describes the current state of the system and has a section for best
practices that we intend to evolve as we develop them. The intended
audience is future users of the system and anyone interested in getting
a deeper look into how the system works without having to read all of
the code. This is intended to be complementary to the [in-progress
RFC](#10413).

[Here is a
preview](https://github.com/open-telemetry/opentelemetry-collector/blob/61abf91b4faec42905b409c352e0e234e5b75ac9/docs/component-status.md)
with the diagrams properly rendered.

---------

Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When designing the existing component status system, we faced many challenges related to component instances. For example, a single processor configuration may be used in multiple pipelines, in which case multiple instances of the processor will be instantiated. Another example, a receiver or exporter which is used in pipelines of different data types will have one instance per data type. Should these instances share a single status or do they each have their own? If each instance has its own status, how do we uniquely identify these instances and communicate each one's status to users?

In my opinion this is a problem which needs to be addressed in any status system design. Alternatively, I think we should consider ways to remove these problems by making changes to the way components are instanced such that all instances correspond to exactly one configuration in the collector config.

Comment on lines +45 to +46
1. Runtime component health reporting, which cannot be automatically reported, must be opt-in for components. Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0
- The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a full opt-in system will create a lot of unnecessary disparity between components. The collector controls the basic lifecycle of components, so why wouldn't it report this?

Why wouldn't the collector automatically report the basic lifecycle of all components (starting -> running -> stopping -> stopped) and allow each component to opt-in to additional detail (recoverable error -> recovered).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular goal is supposed to be scoped just to components. The intention is to allow components to be usable in a collector without utilizing any health reporting status framework.

Allowing the Collector Host to report the basic lifecycle checks it can handle is handle via Goal 4.

Should this first goal be reworded to make it clearer that this is only about components?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it could be clearer. I think the question of whether or not we will force implementations of component.Host to implement the basic lifecycle should probably be answered first, since the language in this goal depends on whether or not that can be assumed.

2. Component health reporting must be opt-in for collector users. While the underlying components are always allowed to report their health via the system, the Collector Host or any other listener may only take action when the user has configured the collector accordingly.
- As one example of compliance, the current health reporting system is dependent on the user configuring an extension that can watch for status updates.
3. Component health must be representable as a finite state machine with clear transitions between states.
4. The Collector Host may report statuses Starting, Ok, Stopping and PermanentError on behalf of components as documented in the automation section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Collector Host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it is the Collector Service (I think technically it is serviceHost on serivce.Service, which is implementing component.Host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more technical way we could word this, maybe just say component.Host?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "service's component.Host implementation"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to clarify the term on line 13:

For the sake of simplicity, when `Host` is referenced in this document you can think of the Collector service, which manages all the collector pipelines and the components within.

But we can use service.Service instead.


A couple solutions:
1. Accept this reality as an exception to Goal 1.
2. Add back `Host.ReportFatalError`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host.ReportFatalError seems like a violation of a goal 5 because it is essentially reporting a status AND saying what to do about it.

If we want these to be decoupled, then isn't the solution to just allow components to report fatal/permanent errors w/o any expectations for how that will be handled? Then a separate layer can decide what to do about it. e.g. shut down the collector, attempt to restart components, etc.

A couple solutions:
1. Accept this reality as an exception to Goal 1.
2. Add back `Host.ReportFatalError`.
3. Remove the ability for components to stop the collector be removing `FatalError`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is for the component to return a NewPermanent error, but according to goal 1, component status reporting must be optional.

I don't understand the point of goal 1. Why isn't it acceptable to expect them report a permanent error?

Comment on lines +65 to +66
### No way to identify components that are not reporting status
Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### No way to identify components that are not reporting status
Goal 1 states that consumers of component status reporting must be able to identify components in use that have not opted in to component status reporting. Our current implementation does not have this feature.

All components should have their basic lifecycle reported automatically. Whether the component itself reports additional granularity about it's state may be optional, but I don't see why it's important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All components should have their basic lifecycle reported automatically

Agreed, thats handled by Goal 4. Goal 1 is attempting to claim that a component should not be required to call settings.ReportStatus in order to be called a component and to function successfully in a collector build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants