Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on bad dogstatsd connection #13091

Merged
merged 11 commits into from
May 19, 2022
Merged

Conversation

huikang
Copy link
Contributor

@huikang huikang commented May 16, 2022

Description

  • Introduce a new telemetry configurable parameter dogstatsd_exit_bad_connection. User can set the value to false to let consul agent continue its start process on failed connection to datadog server. The default behavior is true, which exits the agent.

  • An error message is emitted in case dogstatsd_exit_bad_connection=false and the dns name of the dogstatds can't be resolved:

2022-05-16T09:18:14.638-0400 [WARN]  agent: bootstrap = true: do not enable unless necessary
2022-05-16T09:18:14.698-0400 [ERROR] agent: failed connection to datadog sink: error="lookup dogstatss on 192.168.1.1:53: no such host"

Why not retrying? When connecting to datadog server in udp, it seems the only place to verify the connection is during consul's initialization phase at https://github.com/DataDog/datadog-go/blob/8bfdc335936a79b55b3e2c1080930bc5a3eb57f2/statsd/udp.go#L23
Since datadog agent doesn't send any ack packet, it is hard to detect if the connection is lost. Therefore, retrying may not be a good solution or help in mitigating the situation.

Follow up To overcome the issue of lost connection and inform user of the status, shall we add dogstatds_connection to consul's health check?

Testing & Reproduction steps

Start the consult agent with the following telemetry configuration:

  "telemetry": {
    "dogstatsd_exit_bad_connection": false,
    "dogstatsd_addr": "bad.dogstatsd.name:8125"
  }

Links

close #3419

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern
  • checklist folder consulted

@github-actions github-actions bot added theme/config Relating to Consul Agent configuration, including reloading theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/contributing Additions and enhancements to community contributing materials theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics labels May 16, 2022
@huikang huikang requested review from boxofrad and eculver May 16, 2022 13:11
@boxofrad
Copy link
Contributor

Hey @huikang 👋🏻

Thanks for taking this on!

Although (because this is UDP) we can't be sure that the connection remains good, the problem we're trying to solve here is specifically what happens when the given hostname cannot be resolved but may be resolved in the future.

To that end, I think we should aim to retry with backoff. Otherwise, a transient DNS problem could leave an agent in a state of never emitting metrics.

We could target this specific failure-mode like so:

var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.NotFound {
	go retryWithBackoff()
}

What do you think?

Copy link
Contributor

@eculver eculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm most curious about @boxofrad' question too. My feedback is pretty small.

@@ -3,6 +3,7 @@ on:
branches:
- main
types: [labeled]
workflow_dispatch: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was inadvertently included from another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this. Will remove this line in the update commit.

lib/telemetry.go Outdated
@@ -153,6 +155,11 @@ type TelemetryConfig struct {
// hcl: telemetry { dogstatsd_tags = []string }
DogstatsdTags []string `json:"dogstatsd_tags,omitempty" mapstructure:"dogstatsd_tags"`

// DogstatsdExitBadConnection verify connection to dogstatsd server
//
// hcl: telemetry { dogstatsd_exit_bad_connection = (true|false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocker, but I am curious what you think about calling this dogstatsd_verify_connection if it's not too much trouble to change? I saw mention of a splunk_verify_connection from the external ticket and thought it was simple and descriptive and wanted to get your thoughts.

)

func TestInitTelemetry(t *testing.T) {
// TODO: add test cases for init telemetry sink
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to just cover the case for the datadog config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will add the test case.

@huikang
Copy link
Contributor Author

huikang commented May 16, 2022

We could target this specific failure-mode like so:

var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.NotFound {
	go retryWithBackoff()
}

What do you think?

Hi, @boxofrad , thanks for the feedback and the code snippet. I will change the error assertion to if errors.As(err, &dnsError) && dnsError.NotFound.

Regarding the go retryWithBackoff(), my understanding is the following:

since

consul/lib/telemetry.go

Lines 334 to 340 in b9e0b14

if len(sinks) > 0 {
sinks = append(sinks, memSink)
metrics.NewGlobal(metricsConf, sinks)
} else {
metricsConf.EnableHostname = false
metrics.NewGlobal(metricsConf, memSink)
}

will add all metrics sinks to an array, the retryWithBackoff method will indefinitely (?) dial the datadog server with some backoff algorithm (exponential?).
if the connection succeeds, retryWithBackoff appends the dogstatsd sink to the sinks array and exit. Another caveat is that shall we add a mutex lock to the array race condition to the sink array.

Is my understanding correct? Thanks.

@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 002686e to 9b27d64 Compare May 17, 2022 04:04
@huikang huikang changed the title Ignore bad dogstatsd connection Retry on bad dogstatsd connection May 17, 2022
@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 1d64a69 to 374a6b5 Compare May 17, 2022 04:19
@boxofrad
Copy link
Contributor

Good question @huikang, and sorry that I'd only briefly looked at the code before commenting 😅

On reflection, I'm wondering if it'd be easier to understand if we wrapped the entire sink-building process in a retry loop, such that if we fail to configure any of the sinks we'll use whichever we were able to configure and keep trying the others.

I'm imagining a method (called configureSinks in the snippet below) that always builds the sinks slice from scratch, that way you wouldn't need to worry about guarding it with a mutex in order to mutate it in a retry goroutine.

func InitTelemetry(cfg TelemetryConfig, logger hclog.Logger) *metrics.InmemSink {
	// ...
	errCh := make(chan error, 1)
	go func() {
		waiter := &retry.Waiter{}

		for {
			sinks, err := configureSinks(cfg, metricsConf.HostName, memSink)
			metrics.NewGlobal(metricsConf, sinks)

			select {
			case errCh <- err:
			default:
			}

			if err == nil || !cfg.RetryFailedConfiguration {
				return
			}

			logger.Warn("failed to configure metric sinks", "retries", waiter.Failures())
			waiter.Wait(context.Background())
		}
	}()

	var err error
	if !cfg.RetryFailedConfiguration {
		err = <-errCh
	}
	return memSink, err
}

func configureSinks(cfg TelemetryConfig, hostName string, memSink metrics.MetricSink) (metrics.MetricSink, error) {
	var (
		sinks  metrics.FanoutSink
		errors error
	)
	addSink := func(fn func(TelemetryConfig, string) (metrics.MetricSink, error)) {
		s, err := fn(cfg, hostName)
		if err != nil {
			errors = multierror.Append(errors, err)
		}
		if s != nil {
			sinks = append(sinks, s)
		}
	}

	addSink(statsiteSink)
	addSink(statsdSink)
	addSink(dogstatdSink)
	addSink(circonusSink)
	addSink(prometheusSink)

	if len(sinks) == 0 {
		return memSink, errors
	}

	return append(sinks, memSink), errors
}

We may still want to target specific (i.e. transient) errors to retry, rather than configuration-level errors. Ideally these would be caught during config validation, but I'm not sure if they are or not.

lib/telemetry.go Outdated
}

for {
logger.Warn("failed to configure metric sinks", "retries", waiter.Failures())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this "failed to configure ..." message will get printed twice for every failure (here and on line 374). Should this first one say something like "retrying.." or something that follows the failure message below?

@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 0d76525 to b86d4a4 Compare May 17, 2022 18:38

retryWithBackoff := func() {
waiter := &retry.Waiter{
MaxWait: 5 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about potentially making this configurable, at least on the TelemetryConfig and not necessarily externally configurable by operators. This could allow us to set it to a short wait in tests and assert that it reached the timeout, maybe? I just wanted to note it here. It may not actually work out without larger changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

The PR still misses handling the case where the agent process exits and the retry routine is still running. My understanding is that to catch the agent process shutdown signal, we have to extract the InitTelemetry from the BaseOps method and move it after agent.New. With that, we can pass the agent.ShutdownCh to the retry routine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eculver , when I revisit your question, I realized that Waiter doesn't have a timeout field

type Waiter struct {
// MinFailures before exponential backoff starts. Any failures before
// MinFailures is reached will wait MinWait time.
MinFailures uint
// MinWait time. Returned after the first failure.
MinWait time.Duration
// MaxWait time applied before Jitter. Note that the actual maximum wait time
// is MaxWait + MaxWait * Jitter.
MaxWait time.Duration
// Jitter to add to each wait time. The Jitter is applied after MaxWait, which
// may cause the actual wait time to exceed MaxWait.
Jitter Jitter
// Factor is the multiplier to use when calculating the delay. Defaults to
// 1 second.
Factor time.Duration
failures uint
}

In the latest version, a cancel function is added to stop the retry routine on agent exit. Please take a look. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig (eg. MinFailures, MaxWait, MinWait, Jitter) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.

@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 7542c9b to 66f6b45 Compare May 18, 2022 04:44
@huikang
Copy link
Contributor Author

huikang commented May 18, 2022

@boxofrad @boxofrad , I implemented two approaches that stops the retry routine on agent's exit:

  1. Pass the agent's shutdownCh to InitTelemetry 5aa172e
  2. Add a Cancel function to the metricsHanlder struct: https://github.com/hashicorp/consul/pull/13091/files#diff-78adaa1dd41819d1ec7f10c4a7ca9497675cf1ef56515cfb49988283c526a412R220-R224
    and call the cancel method in agent shutdown: https://github.com/hashicorp/consul/pull/13091/files#diff-62b6ad581fe3a3059ae8c85ef0f31dde4092bfdecfa7d6857c470bcacaa8cc8b

The first one seems less complex, but move the InitTelemetry out of the Setup approach. The second follows how autoconfig is stopped

consul/agent/agent.go

Lines 1429 to 1431 in 15b6494

// this would be cancelled anyways (by the closing of the shutdown ch) but
// this should help them to be stopped more quickly
a.baseDeps.AutoConfig.Stop()

Please let me know which one is preferred or there is another solution. Thanks.

@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 1569295 to 9197afb Compare May 18, 2022 14:24
@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 9197afb to 9a36a9d Compare May 18, 2022 17:43
Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, nice work @huikang 👏🏻 🎉

I've a handful of comments inline. I think the only real blocker is the IsRetrying thread-safety, though.

In answer to your question, I much prefer the pattern of having the agent responsible for shutting the goroutine down (by calling Cancel).

agent/agent.go Outdated
Comment on lines 1432 to 1434
if a.baseDeps.MetricsConfig.IsRetrying {
a.baseDeps.MetricsConfig.Cancel()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this would be a good place to adopt the "Tell Don't Ask" principle, and rely on Cancel doing the right thing, rather than exposing the IsRetrying state.

On a related note, I think checking IsRetrying like this is thread-unsafe because it may be written to by the retry goroutine.

lib/telemetry.go Outdated Show resolved Hide resolved
lib/telemetry.go Outdated Show resolved Hide resolved
lib/telemetry.go Outdated Show resolved Hide resolved
.changelog/13091.txt Outdated Show resolved Hide resolved
lib/telemetry.go Outdated
Comment on lines 383 to 390
if err := waiter.Wait(ctx); err != nil {
if errors.Is(err, context.Canceled) {
logger.Info("stop retrying configure metrics sinks")
} else {
logger.Error("waiting for retry", "error", err)
}
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think context.Canceled is the only error that could be returned from Wait here. That said, I'm a bit confused about the log message we're emitting in the else branch (as it will exit/not retry).

lib/telemetry.go Outdated
Comment on lines 411 to 416
for _, err := range errs.WrappedErrors() {
var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.IsNotFound {
return true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can call erorrs.As with the multierror directly, rather than iterating over the wrapped errors like this.

From the docs:

The resulting error supports errors.As/Is/Unwrap so you can continue to use the stdlib errors package to introspect further.

lib/telemetry.go Outdated Show resolved Hide resolved
@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from d7caf78 to 5f96b63 Compare May 19, 2022 02:55
Co-authored-by: Dan Upton <daniel@floppy.co>
@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 5f96b63 to 52e265a Compare May 19, 2022 17:35
@huikang huikang requested review from eculver and boxofrad May 19, 2022 17:38
Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Thanks again for taking this on 🙇🏻

lib/telemetry.go Outdated Show resolved Hide resolved
Copy link
Contributor

@eculver eculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this has shaped up! Good stuff. I just have a few questions, most are small.


retryWithBackoff := func() {
waiter := &retry.Waiter{
MaxWait: 5 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig (eg. MinFailures, MaxWait, MinWait, Jitter) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.

lib/telemetry.go Outdated

if _, errs := configureSinks(cfg, metricsConf.HostName, memSink); errs != nil {
if isRetriableError(errs) && cfg.RetryFailedConfiguration {
logger.Error("failed configure sinks", "error", multierror.Flatten(errs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a warning make more sense here since we are going to retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

lib/telemetry_test.go Outdated Show resolved Hide resolved
lib/telemetry.go Outdated Show resolved Hide resolved
@@ -216,7 +217,9 @@ func (a *TestAgent) Start(t *testing.T) error {
bd.Logger = logger
// if we are not testing telemetry things, let's use a "mock" sink for metrics
if bd.RuntimeConfig.Telemetry.Disable {
bd.MetricsHandler = metrics.NewInmemSink(1*time.Second, time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting because I would think that we should be able to configure multiple handlers, but this would lead me to believe that technically, we only allow a single metrics handler? It may be rare that users set more than one, but I'm wondering if this is just a deficiency in our docs. If I'm reading them right, I would assume that any configured backend would be used including if I configure more than one.

@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 354f840 to 60c2bfb Compare May 19, 2022 19:21
Co-authored-by: Dan Upton <daniel@floppy.co>
Co-authored-by: Evan Culver <eculver@users.noreply.github.com>
@huikang huikang force-pushed the ignore-bad-dogstatsd-connection branch from 60c2bfb to a097c23 Compare May 19, 2022 19:22
@huikang huikang merged commit 364d4f5 into main May 19, 2022
@huikang huikang deleted the ignore-bad-dogstatsd-connection branch May 19, 2022 20:03
@hashicorp-ci
Copy link
Contributor

After merging, confirm that you see linked PRs AND check that them for CI errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/config Relating to Consul Agent configuration, including reloading theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/contributing Additions and enhancements to community contributing materials theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consul fails to start if Datadog is down
4 participants