Retry on bad dogstatsd connection #13091

huikang · 2022-05-16T13:11:07Z

Description

Introduce a new telemetry configurable parameter dogstatsd_exit_bad_connection. User can set the value to false to let consul agent continue its start process on failed connection to datadog server. The default behavior is true, which exits the agent.
An error message is emitted in case dogstatsd_exit_bad_connection=false and the dns name of the dogstatds can't be resolved:

2022-05-16T09:18:14.638-0400 [WARN]  agent: bootstrap = true: do not enable unless necessary
2022-05-16T09:18:14.698-0400 [ERROR] agent: failed connection to datadog sink: error="lookup dogstatss on 192.168.1.1:53: no such host"

Why not retrying? When connecting to datadog server in udp, it seems the only place to verify the connection is during consul's initialization phase at https://github.com/DataDog/datadog-go/blob/8bfdc335936a79b55b3e2c1080930bc5a3eb57f2/statsd/udp.go#L23
Since datadog agent doesn't send any ack packet, it is hard to detect if the connection is lost. Therefore, retrying may not be a good solution or help in mitigating the situation.

Follow up To overcome the issue of lost connection and inform user of the status, shall we add dogstatds_connection to consul's health check?

Testing & Reproduction steps

Start the consult agent with the following telemetry configuration:

  "telemetry": {
    "dogstatsd_exit_bad_connection": false,
    "dogstatsd_addr": "bad.dogstatsd.name:8125"
  }

Links

close #3419

PR Checklist

updated test coverage
external facing docs updated
not a security concern
checklist folder consulted

boxofrad · 2022-05-16T15:15:27Z

Hey @huikang 👋🏻

Thanks for taking this on!

Although (because this is UDP) we can't be sure that the connection remains good, the problem we're trying to solve here is specifically what happens when the given hostname cannot be resolved but may be resolved in the future.

To that end, I think we should aim to retry with backoff. Otherwise, a transient DNS problem could leave an agent in a state of never emitting metrics.

We could target this specific failure-mode like so:

var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.NotFound {
	go retryWithBackoff()
}

What do you think?

eculver

I'm most curious about @boxofrad' question too. My feedback is pretty small.

eculver · 2022-05-16T15:01:37Z

.github/workflows/load-test.yml

@@ -3,6 +3,7 @@ on:
    branches:
      - main
    types: [labeled]
+  workflow_dispatch: {}


I think this was inadvertently included from another PR?

thanks for catching this. Will remove this line in the update commit.

eculver · 2022-05-16T15:04:33Z

lib/telemetry.go

@@ -153,6 +155,11 @@ type TelemetryConfig struct {
 	// hcl: telemetry { dogstatsd_tags = []string }
 	DogstatsdTags []string `json:"dogstatsd_tags,omitempty" mapstructure:"dogstatsd_tags"`

+	// DogstatsdExitBadConnection verify connection to dogstatsd server
+	//
+	// hcl: telemetry { dogstatsd_exit_bad_connection = (true|false)


This is not a blocker, but I am curious what you think about calling this dogstatsd_verify_connection if it's not too much trouble to change? I saw mention of a splunk_verify_connection from the external ticket and thought it was simple and descriptive and wanted to get your thoughts.

eculver · 2022-05-16T15:17:28Z

lib/telemetry_test.go

+)
+
+func TestInitTelemetry(t *testing.T) {
+	// TODO: add test cases for init telemetry sink


It might make sense to just cover the case for the datadog config.

Sounds good. Will add the test case.

huikang · 2022-05-16T16:45:21Z

We could target this specific failure-mode like so:
var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.NotFound {
	go retryWithBackoff()
}
What do you think?

Hi, @boxofrad , thanks for the feedback and the code snippet. I will change the error assertion to if errors.As(err, &dnsError) && dnsError.NotFound.

Regarding the go retryWithBackoff(), my understanding is the following:

since

consul/lib/telemetry.go

Lines 334 to 340 in b9e0b14

    
           if len(sinks) > 0 { 
        
           	sinks = append(sinks, memSink) 
        
           	metrics.NewGlobal(metricsConf, sinks) 
        
           } else { 
        
           	metricsConf.EnableHostname = false 
        
           	metrics.NewGlobal(metricsConf, memSink) 
        
           }

will add all metrics sinks to an array, the retryWithBackoff method will indefinitely (?) dial the datadog server with some backoff algorithm (exponential?).
if the connection succeeds, retryWithBackoff appends the dogstatsd sink to the sinks array and exit. Another caveat is that shall we add a mutex lock to the array race condition to the sink array.

Is my understanding correct? Thanks.

boxofrad · 2022-05-17T10:30:45Z

Good question @huikang, and sorry that I'd only briefly looked at the code before commenting 😅

On reflection, I'm wondering if it'd be easier to understand if we wrapped the entire sink-building process in a retry loop, such that if we fail to configure any of the sinks we'll use whichever we were able to configure and keep trying the others.

I'm imagining a method (called configureSinks in the snippet below) that always builds the sinks slice from scratch, that way you wouldn't need to worry about guarding it with a mutex in order to mutate it in a retry goroutine.

func InitTelemetry(cfg TelemetryConfig, logger hclog.Logger) *metrics.InmemSink {
	// ...
	errCh := make(chan error, 1)
	go func() {
		waiter := &retry.Waiter{}

		for {
			sinks, err := configureSinks(cfg, metricsConf.HostName, memSink)
			metrics.NewGlobal(metricsConf, sinks)

			select {
			case errCh <- err:
			default:
			}

			if err == nil || !cfg.RetryFailedConfiguration {
				return
			}

			logger.Warn("failed to configure metric sinks", "retries", waiter.Failures())
			waiter.Wait(context.Background())
		}
	}()

	var err error
	if !cfg.RetryFailedConfiguration {
		err = <-errCh
	}
	return memSink, err
}

func configureSinks(cfg TelemetryConfig, hostName string, memSink metrics.MetricSink) (metrics.MetricSink, error) {
	var (
		sinks  metrics.FanoutSink
		errors error
	)
	addSink := func(fn func(TelemetryConfig, string) (metrics.MetricSink, error)) {
		s, err := fn(cfg, hostName)
		if err != nil {
			errors = multierror.Append(errors, err)
		}
		if s != nil {
			sinks = append(sinks, s)
		}
	}

	addSink(statsiteSink)
	addSink(statsdSink)
	addSink(dogstatdSink)
	addSink(circonusSink)
	addSink(prometheusSink)

	if len(sinks) == 0 {
		return memSink, errors
	}

	return append(sinks, memSink), errors
}

We may still want to target specific (i.e. transient) errors to retry, rather than configuration-level errors. Ideally these would be caught during config validation, but I'm not sure if they are or not.

eculver · 2022-05-17T17:38:21Z

lib/telemetry.go

+		}
+
+		for {
+			logger.Warn("failed to configure metric sinks", "retries", waiter.Failures())


It looks like this "failed to configure ..." message will get printed twice for every failure (here and on line 374). Should this first one say something like "retrying.." or something that follows the failure message below?

eculver · 2022-05-17T19:05:34Z

lib/telemetry.go

+
+	retryWithBackoff := func() {
+		waiter := &retry.Waiter{
+			MaxWait: 5 * time.Minute,


We talked about potentially making this configurable, at least on the TelemetryConfig and not necessarily externally configurable by operators. This could allow us to set it to a short wait in tests and assert that it reached the timeout, maybe? I just wanted to note it here. It may not actually work out without larger changes.

Agreed.

The PR still misses handling the case where the agent process exits and the retry routine is still running. My understanding is that to catch the agent process shutdown signal, we have to extract the InitTelemetry from the BaseOps method and move it after agent.New. With that, we can pass the agent.ShutdownCh to the retry routine.

@eculver , when I revisit your question, I realized that Waiter doesn't have a timeout field

consul/lib/retry/retry.go

Lines 34 to 50 in 5f96b63

type Waiter struct {

// MinFailures before exponential backoff starts. Any failures before

// MinFailures is reached will wait MinWait time.

MinFailures uint

// MinWait time. Returned after the first failure.

MinWait time.Duration

// MaxWait time applied before Jitter. Note that the actual maximum wait time

// is MaxWait + MaxWait * Jitter.

MaxWait time.Duration

// Jitter to add to each wait time. The Jitter is applied after MaxWait, which

// may cause the actual wait time to exceed MaxWait.

Jitter Jitter

// Factor is the multiplier to use when calculating the delay. Defaults to

// 1 second.

Factor time.Duration

failures uint

}

In the latest version, a cancel function is added to stop the retry routine on agent exit. Please take a look. Thanks.

Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig (eg. MinFailures, MaxWait, MinWait, Jitter) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.

huikang · 2022-05-18T13:56:41Z

@boxofrad @boxofrad , I implemented two approaches that stops the retry routine on agent's exit:

Pass the agent's shutdownCh to InitTelemetry 5aa172e
Add a Cancel function to the metricsHanlder struct: https://github.com/hashicorp/consul/pull/13091/files#diff-78adaa1dd41819d1ec7f10c4a7ca9497675cf1ef56515cfb49988283c526a412R220-R224
and call the cancel method in agent shutdown: https://github.com/hashicorp/consul/pull/13091/files#diff-62b6ad581fe3a3059ae8c85ef0f31dde4092bfdecfa7d6857c470bcacaa8cc8b

The first one seems less complex, but move the InitTelemetry out of the Setup approach. The second follows how autoconfig is stopped

consul/agent/agent.go

Lines 1429 to 1431 in 15b6494

    
           // this would be cancelled anyways (by the closing of the shutdown ch) but 
        
           // this should help them to be stopped more quickly 
        
           a.baseDeps.AutoConfig.Stop()

Please let me know which one is preferred or there is another solution. Thanks.

- Add a new telemetry configuration dogstatsd_exit_bad_connection when set to false, continue consul agent on failed connection to datadog server. Default is true to exit the agent process.

boxofrad

This is looking great, nice work @huikang 👏🏻 🎉

I've a handful of comments inline. I think the only real blocker is the IsRetrying thread-safety, though.

In answer to your question, I much prefer the pattern of having the agent responsible for shutting the goroutine down (by calling Cancel).

boxofrad · 2022-05-18T19:48:30Z

agent/agent.go

+	if a.baseDeps.MetricsConfig.IsRetrying {
+		a.baseDeps.MetricsConfig.Cancel()
+	}


I'm wondering if this would be a good place to adopt the "Tell Don't Ask" principle, and rely on Cancel doing the right thing, rather than exposing the IsRetrying state.

On a related note, I think checking IsRetrying like this is thread-unsafe because it may be written to by the retry goroutine.

lib/telemetry.go

.changelog/13091.txt

boxofrad · 2022-05-18T20:03:01Z

lib/telemetry.go

+			if err := waiter.Wait(ctx); err != nil {
+				if errors.Is(err, context.Canceled) {
+					logger.Info("stop retrying configure metrics sinks")
+				} else {
+					logger.Error("waiting for retry", "error", err)
+				}
+				return
+			}


I think context.Canceled is the only error that could be returned from Wait here. That said, I'm a bit confused about the log message we're emitting in the else branch (as it will exit/not retry).

boxofrad · 2022-05-18T20:07:20Z

lib/telemetry.go

+	for _, err := range errs.WrappedErrors() {
+		var dnsError *net.DNSError
+		if errors.As(err, &dnsError) && dnsError.IsNotFound {
+			return true
+		}
+	}


I think you can call erorrs.As with the multierror directly, rather than iterating over the wrapped errors like this.

From the docs:

The resulting error supports errors.As/Is/Unwrap so you can continue to use the stdlib errors package to introspect further.

lib/telemetry.go

Co-authored-by: Dan Upton <daniel@floppy.co>

boxofrad

Awesome work! Thanks again for taking this on 🙇🏻

lib/telemetry.go

eculver

I really like how this has shaped up! Good stuff. I just have a few questions, most are small.

eculver · 2022-05-19T18:12:42Z

lib/telemetry.go

+
+	retryWithBackoff := func() {
+		waiter := &retry.Waiter{
+			MaxWait: 5 * time.Minute,


Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig (eg. MinFailures, MaxWait, MinWait, Jitter) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.

eculver · 2022-05-19T18:14:01Z

lib/telemetry.go

+
+	if _, errs := configureSinks(cfg, metricsConf.HostName, memSink); errs != nil {
+		if isRetriableError(errs) && cfg.RetryFailedConfiguration {
+			logger.Error("failed configure sinks", "error", multierror.Flatten(errs))


Would a warning make more sense here since we are going to retry?

lib/telemetry_test.go

lib/telemetry.go

eculver · 2022-05-19T18:33:16Z

agent/testagent.go

@@ -216,7 +217,9 @@ func (a *TestAgent) Start(t *testing.T) error {
 	bd.Logger = logger
 	// if we are not testing telemetry things, let's use a "mock" sink for metrics
 	if bd.RuntimeConfig.Telemetry.Disable {
-		bd.MetricsHandler = metrics.NewInmemSink(1*time.Second, time.Minute)


This is interesting because I would think that we should be able to configure multiple handlers, but this would lead me to believe that technically, we only allow a single metrics handler? It may be rare that users set more than one, but I'm wondering if this is just a deficiency in our docs. If I'm reading them right, I would assume that any configured backend would be used including if I configure more than one.

Co-authored-by: Dan Upton <daniel@floppy.co> Co-authored-by: Evan Culver <eculver@users.noreply.github.com>

hashicorp-ci · 2022-05-19T20:04:53Z

After merging, confirm that you see linked PRs AND check that them for CI errors.

huikang requested review from boxofrad and eculver May 16, 2022 13:11

eculver reviewed May 16, 2022

View reviewed changes

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 002686e to 9b27d64 Compare May 17, 2022 04:04

huikang changed the title ~~Ignore bad dogstatsd connection~~ Retry on bad dogstatsd connection May 17, 2022

huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 1d64a69 to 374a6b5 Compare May 17, 2022 04:19

eculver reviewed May 17, 2022

View reviewed changes

huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 0d76525 to b86d4a4 Compare May 17, 2022 18:38

eculver reviewed May 17, 2022

View reviewed changes

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 7542c9b to 66f6b45 Compare May 18, 2022 04:44

huikang added 8 commits May 18, 2022 10:08

Enable manual triggering of load test

d355f5d

Allow consul agent to start on failed dail to datadog server

7790f2c

- Add a new telemetry configuration dogstatsd_exit_bad_connection when set to false, continue consul agent on failed connection to datadog server. Default is true to exit the agent process.

fix unit test

3ef26ac

Remove extraneous line and change to e DNSError type

9b9de4c

Add retry logic

1a3f593

Re-build the sinks array in each retry to avoid race condition

a5d96ee

Use multierror to handle multiple sinks and add a simple unit test

93cf872

Stop retry telemetry routine on agent shutdown

c9c3a97

huikang force-pushed the ignore-bad-dogstatsd-connection branch 2 times, most recently from 1569295 to 9197afb Compare May 18, 2022 14:24

Another way to cancel the retry routine

9a36a9d

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 9197afb to 9a36a9d Compare May 18, 2022 17:43

boxofrad reviewed May 18, 2022

View reviewed changes

huikang force-pushed the ignore-bad-dogstatsd-connection branch from d7caf78 to 5f96b63 Compare May 19, 2022 02:55

Apply suggestions from code review

52e265a

Co-authored-by: Dan Upton <daniel@floppy.co>

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 5f96b63 to 52e265a Compare May 19, 2022 17:35

huikang requested review from eculver and boxofrad May 19, 2022 17:38

boxofrad approved these changes May 19, 2022

View reviewed changes

lib/telemetry.go Outdated Show resolved Hide resolved

eculver reviewed May 19, 2022

View reviewed changes

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 354f840 to 60c2bfb Compare May 19, 2022 19:21

Update lib/telemetry.go

a097c23

Co-authored-by: Dan Upton <daniel@floppy.co> Co-authored-by: Evan Culver <eculver@users.noreply.github.com>

huikang force-pushed the ignore-bad-dogstatsd-connection branch from 60c2bfb to a097c23 Compare May 19, 2022 19:22

eculver approved these changes May 19, 2022

View reviewed changes

huikang merged commit 364d4f5 into main May 19, 2022

huikang deleted the ignore-bad-dogstatsd-connection branch May 19, 2022 20:03

huikang added the backport/1.12 label May 19, 2022

hc-github-team-consul-core mentioned this pull request May 19, 2022

Backport of Retry on bad dogstatsd connection into release/1.12.x #13152

Merged

4 tasks

eculver mentioned this pull request May 20, 2022

telemetry: remove unused arg #13161

Merged

1 task

hc-github-team-consul-core mentioned this pull request May 20, 2022

Backport of telemetry: remove unused arg into release/1.12.x #13163

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on bad dogstatsd connection #13091

Retry on bad dogstatsd connection #13091

huikang commented May 16, 2022 •

edited

Loading

boxofrad commented May 16, 2022

eculver left a comment •

edited

Loading

eculver May 16, 2022

huikang May 16, 2022

eculver May 16, 2022

eculver May 16, 2022

huikang May 16, 2022

huikang commented May 16, 2022 •

edited

Loading

boxofrad commented May 17, 2022

eculver May 17, 2022

eculver May 17, 2022

huikang May 17, 2022

huikang May 19, 2022

eculver May 19, 2022

huikang commented May 18, 2022

boxofrad left a comment

boxofrad May 18, 2022

boxofrad May 18, 2022

boxofrad May 18, 2022

boxofrad left a comment

eculver left a comment

eculver May 19, 2022

eculver May 19, 2022

huikang May 19, 2022

eculver May 19, 2022

hashicorp-ci commented May 19, 2022

	type Waiter struct {
	// MinFailures before exponential backoff starts. Any failures before
	// MinFailures is reached will wait MinWait time.
	MinFailures uint
	// MinWait time. Returned after the first failure.
	MinWait time.Duration
	// MaxWait time applied before Jitter. Note that the actual maximum wait time
	// is MaxWait + MaxWait * Jitter.
	MaxWait time.Duration
	// Jitter to add to each wait time. The Jitter is applied after MaxWait, which
	// may cause the actual wait time to exceed MaxWait.
	Jitter Jitter
	// Factor is the multiplier to use when calculating the delay. Defaults to
	// 1 second.
	Factor time.Duration
	failures uint
	}

Retry on bad dogstatsd connection #13091

Retry on bad dogstatsd connection #13091

Conversation

huikang commented May 16, 2022 • edited Loading

Description

Testing & Reproduction steps

Links

PR Checklist

boxofrad commented May 16, 2022

eculver left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huikang commented May 16, 2022 • edited Loading

boxofrad commented May 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huikang commented May 18, 2022

boxofrad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boxofrad left a comment

Choose a reason for hiding this comment

eculver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashicorp-ci commented May 19, 2022

huikang commented May 16, 2022 •

edited

Loading

eculver left a comment •

edited

Loading

huikang commented May 16, 2022 •

edited

Loading