Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

Closed
tigrannajaryan opened this issue Oct 18, 2022 · 8 comments
Closed
Milestone

Comments

@tigrannajaryan
Copy link
Member

The Problem

Currently many components use Host.ReportFatalError() to indicate problems during the startup.

ReportFatalError() is called asynchronously after the component's Start() function returns.

Unfortunately this creates a problem for anyone who wants to know whether the Collector has started successfully or no. It is currently simply impossible to know. You may have a Collector that starts all the pipelines, you see nice output in the log with zero error messages and assume all is good. However, arbitrary time after that the Collector can fail with a fatal error.

There is no point in time when you can be sure that you have a running Collector that is not going to crash the next second due some component calling ReportFatalError() when it pleases so.

This makes the following capability difficult or impossible: to know if the configuration that the Collector is using a good one. This is necessary for a notion of "last known good config" that we want to use when the Collector is reconfigured by config.Provider watchers.

Proposal

I suggest to get rid of Host.ReportFatalError() altogether.

The Start() function must block until it is certain that the component is up and running.

I looked at our usage of Host.ReportFatalError(). Vast majority of the calls are from the failures to start an HTTP Server. This is totally unnecessary. The errors to start an HTTP Server can be reported synchronously from Start().

Of course Start() function should not block the startup indefinitely. However, problems that can happen at an unknown time after Start() is invoked are not startup problems. They are health problems. Such problems should not result in failing the entire Collector. They should result in the component reporting that it is unhealthly.

My proposal is the following:

  1. Deprecate Host.ReportFatalError()
  2. In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.
  3. In components which call Host.ReportFatalError() for other reasons carefully analyze the usage. If it is clearly a startup failure that can happen within known limited time (e.g. 10 seconds) then make sure it is blocking the Start() and return an error from Start(). Otherwise report it as an error in the log and indicate component's bad health (using the proposed component health reporting capability).
  4. Change Host.ReportFatalError() to log an error instead of terminating the Collector.
  5. After some graceful period remove Host.ReportFatalError().
  6. Optionally: modify the Collector startup to call Start() functions concurrently to avoid one serializing the blocking operations. We still need to honour the startup sequence of pipelines (exporters->processors->receivers).

Related to #6226

@tigrannajaryan
Copy link
Member Author

@open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers please let me know what you think.

@tigrannajaryan
Copy link
Member Author

cc @portertech

@bogdandrutu
Copy link
Member

bogdandrutu commented Oct 18, 2022

In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.

Does not work like that, "Start" is blocking call, so not sure how and when you decide to "return", welcome to golang world.

Unless you provide a viable solution for this, your proposal from my opinion is not possible to implement.

@bogdandrutu
Copy link
Member

You should also look into #5304, which you may have to implement and use an Extension for your use-case.

@tigrannajaryan
Copy link
Member Author

In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.

Does not work like that, "Start" is blocking call, so not sure how and when you decide to "return", welcome to golang world.

Unless you provide a viable solution for this, your proposal from my opinion is not possible to implement.

Yes, I know "Start" is a blocking call. It must block until the server is up, i.e. it can listen on the port. It doesn't need to block after that and any error that happen after that do not need to be fatal errors.

I mean this:

func (c *MyComponent) Start(_ context.Context, host component.Host) error {
  server := http.Server{...}
  ln, err := c.config.TCPAddr.Listen()
  if err != nil {
    return err
  }
  go func() {
    err :=server.Serve(ln)
    if err != nil && !errors.Is(err, http.ErrServerClosed) {
      // Something happened that was not supposed to happen. Listen succeeded so Serve() should work. 
      c.logger.Error(err) // Just log. There is no need to call ReportFatalError() here like we often do
      host.ReportBadHealth(c, err) // Use the new component health reporting feature. 
    }
  }()
}

There are some paths that http.Server.Serve() may still fail (e.g. HTTP/2 config) after Listen() is successful. We may need to check if it matters for our purposes.

@bogdandrutu
Copy link
Member

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

@tigrannajaryan
Copy link
Member Author

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

Yes, exactly. I mean instead of ReportFatalError we should report "bad health". That's it.

@tigrannajaryan
Copy link
Member Author

So your proposal, is not necessary to remove the "ReportFatalError" but to try to report synchronously as many failures as possible, and we still need a "heath/status" channel.

Yes, exactly. I mean instead of ReportFatalError we should report "bad health". That's it.

In some components, it may be useful to refactor the code to move certain parts that currently happen asynchronously after Start() returns to happen instead within Start() if we believe they are a critically necessary condition for starting the Collector. I don't know if such code exist, just speculating at the moment. Again, I believe vast majority of existing ReportFatalError() calls are because of failed Server.Serve(), which is unnecessary.

@tigrannajaryan tigrannajaryan changed the title Eliminate Host.ReportFatalError() Eliminate Host.ReportFatalError(), replace by Component Health Reporting Nov 2, 2022
@tigrannajaryan tigrannajaryan added this to the OpAMP milestone Nov 21, 2022
mx-psi added a commit that referenced this issue Feb 9, 2024
**Description:** 
 Remove `host.ReportFatalError`. It has been deprecated since 0.87.0.

**Link to tracking Issue:**
#6344

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
jeffhostetler added a commit to git-ecosystem/trace2receiver that referenced this issue Mar 6, 2024
host.ReportFatalError() was deprecated and removed.  Update
Windows named pipe startup code to use ReportStatus() instead.

open-telemetry/opentelemetry-collector#6344
open-telemetry/opentelemetry-collector#9506

Signed-off-by: Jeff Hostetler <jeffhostetler@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants