Skip to content

Commit

Permalink
docs(specs): Add specification for startup-error behavior (#14920)
Browse files Browse the repository at this point in the history
  • Loading branch information
srebhan committed Mar 7, 2024
1 parent 4664b6d commit 66bd72b
Showing 1 changed file with 102 additions and 0 deletions.
102 changes: 102 additions & 0 deletions docs/specs/tsd-006-startup-error-behavior.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Startup Error Behavior

## Objective

Unified, configurable behavior on retriable startup errors.

## Keywords

inputs, outputs, startup, error, retry

## Overview

Many Telegraf plugins connect to an external service either on the same machine
or via network. On automated startup of Telegraf (e.g. via service) there is no
guarantee that those services are fully started yet, especially when they reside
on a remote host. More and more plugins implement mechanisms to retry reaching
their related service if they failed to do so on startup.

This specification intends to unify the naming of configuration-options, the
values of those options, and their semantic meaning. It describes the behavior
for the different options on handling startup-errors.

Startup errors are all errors occurring in calls to `Start()` for inputs and
service-inputs or `Connect()` for outputs. The behaviors described below
should only be applied in cases where the plugin *explicitly* states that an
startup error is *retriable*. This includes for example network errors
indicating that the host or service is not yet reachable or external
resources, like a machine or file, which are not yet available, but might become
available later. To indicate a retriable startup error the plugin should return
a predefined error-type.

In cases where the error cannot be generally determined be retriable by
the plugin, the plugin might add configuration settings to let the user
configure that property. For example, where an error code indicates a fatal,
non-recoverable error in one case but a non-fatal, recoverable error in another
case.

## Configuration Options and Behaviors

Telegraf must introduce a unified `startup_error_behavior` configuration option
for inputs and output plugins. The option is handled directly by the Telegraf
agent and is not passed down to the plugins. The setting must be available on a
per-plugin basis and defines how Telegraf behaves on startup errors.

For all config option values Telegraf might retry to start the plugin for a
limited number of times during the startup phase before actually processing
data. This corresponds to the current behavior of Telegraf to retry three times
with a fifteen second interval before continuing processing of the plugins.

### `error` behavior

The `error` setting for the `startup_error_behavior` option causes Telegraf to
fail and exit on startup errors. This must be the default behavior.

### `retry` behavior

The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
fail on startup errors and should continue running. Telegraf must retry to
startup the failed plugin in each gather or write cycle, for inputs or for
outputs respectively, for an unlimited number of times. Neither the
plugin's `Gather()` nor `Write()` method is called as long as the startup did
not succeed. Metrics sent to an output plugin will be buffered until the plugin
is actually started. If the metric-buffer limit is reached **metrics might be
dropped**!

In case a plugin signals a partially successful startup, e.g. a subset of the
given endpoints are reachable, Telegraf must try to fully startup the remaining
endpoints by calling `Start()` or `Connect()`, respectively, until full startup
is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.

### `ignore` behavior

When using the `ignore` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing.

## Plugin Requirements

Plugins participating in handling startup errors must implement the `Start()`
or `Connect()` method for inputs and outputs respectively. Those methods must be
safe to be called multiple times during retries without leaking resources or
causing issues in the service used.

Furthermore, the `Close()` method of the plugins must be safe to be called for
cases where the startup failed without causing panics.

The plugins should return a `nil` error during startup to indicate a successful
startup or a retriable error (via predefined error type) to enable the defined
startup error behaviors. A non-retriable error (via predefined error type) or
a generic error will bypass the startup error behaviors and Telegraf must fail
and exit in the startup phase.

## Related Issues

- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`

0 comments on commit 66bd72b

Please sign in to comment.