-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on bad dogstatsd connection #13091
Conversation
Hey @huikang 👋🏻 Thanks for taking this on! Although (because this is UDP) we can't be sure that the connection remains good, the problem we're trying to solve here is specifically what happens when the given hostname cannot be resolved but may be resolved in the future. To that end, I think we should aim to retry with backoff. Otherwise, a transient DNS problem could leave an agent in a state of never emitting metrics. We could target this specific failure-mode like so: var dnsError *net.DNSError
if errors.As(err, &dnsError) && dnsError.NotFound {
go retryWithBackoff()
} What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm most curious about @boxofrad' question too. My feedback is pretty small.
.github/workflows/load-test.yml
Outdated
@@ -3,6 +3,7 @@ on: | |||
branches: | |||
- main | |||
types: [labeled] | |||
workflow_dispatch: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was inadvertently included from another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this. Will remove this line in the update commit.
lib/telemetry.go
Outdated
@@ -153,6 +155,11 @@ type TelemetryConfig struct { | |||
// hcl: telemetry { dogstatsd_tags = []string } | |||
DogstatsdTags []string `json:"dogstatsd_tags,omitempty" mapstructure:"dogstatsd_tags"` | |||
|
|||
// DogstatsdExitBadConnection verify connection to dogstatsd server | |||
// | |||
// hcl: telemetry { dogstatsd_exit_bad_connection = (true|false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocker, but I am curious what you think about calling this dogstatsd_verify_connection
if it's not too much trouble to change? I saw mention of a splunk_verify_connection
from the external ticket and thought it was simple and descriptive and wanted to get your thoughts.
lib/telemetry_test.go
Outdated
) | ||
|
||
func TestInitTelemetry(t *testing.T) { | ||
// TODO: add test cases for init telemetry sink |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to just cover the case for the datadog config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will add the test case.
Hi, @boxofrad , thanks for the feedback and the code snippet. I will change the error assertion to Regarding the since Lines 334 to 340 in b9e0b14
will add all metrics sinks to an array, the Is my understanding correct? Thanks. |
002686e
to
9b27d64
Compare
1d64a69
to
374a6b5
Compare
Good question @huikang, and sorry that I'd only briefly looked at the code before commenting 😅 On reflection, I'm wondering if it'd be easier to understand if we wrapped the entire sink-building process in a retry loop, such that if we fail to configure any of the sinks we'll use whichever we were able to configure and keep trying the others. I'm imagining a method (called func InitTelemetry(cfg TelemetryConfig, logger hclog.Logger) *metrics.InmemSink {
// ...
errCh := make(chan error, 1)
go func() {
waiter := &retry.Waiter{}
for {
sinks, err := configureSinks(cfg, metricsConf.HostName, memSink)
metrics.NewGlobal(metricsConf, sinks)
select {
case errCh <- err:
default:
}
if err == nil || !cfg.RetryFailedConfiguration {
return
}
logger.Warn("failed to configure metric sinks", "retries", waiter.Failures())
waiter.Wait(context.Background())
}
}()
var err error
if !cfg.RetryFailedConfiguration {
err = <-errCh
}
return memSink, err
}
func configureSinks(cfg TelemetryConfig, hostName string, memSink metrics.MetricSink) (metrics.MetricSink, error) {
var (
sinks metrics.FanoutSink
errors error
)
addSink := func(fn func(TelemetryConfig, string) (metrics.MetricSink, error)) {
s, err := fn(cfg, hostName)
if err != nil {
errors = multierror.Append(errors, err)
}
if s != nil {
sinks = append(sinks, s)
}
}
addSink(statsiteSink)
addSink(statsdSink)
addSink(dogstatdSink)
addSink(circonusSink)
addSink(prometheusSink)
if len(sinks) == 0 {
return memSink, errors
}
return append(sinks, memSink), errors
} We may still want to target specific (i.e. transient) errors to retry, rather than configuration-level errors. Ideally these would be caught during config validation, but I'm not sure if they are or not. |
lib/telemetry.go
Outdated
} | ||
|
||
for { | ||
logger.Warn("failed to configure metric sinks", "retries", waiter.Failures()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this "failed to configure ..." message will get printed twice for every failure (here and on line 374). Should this first one say something like "retrying.." or something that follows the failure message below?
0d76525
to
b86d4a4
Compare
|
||
retryWithBackoff := func() { | ||
waiter := &retry.Waiter{ | ||
MaxWait: 5 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about potentially making this configurable, at least on the TelemetryConfig
and not necessarily externally configurable by operators. This could allow us to set it to a short wait in tests and assert that it reached the timeout, maybe? I just wanted to note it here. It may not actually work out without larger changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
The PR still misses handling the case where the agent process exits and the retry routine is still running. My understanding is that to catch the agent process shutdown signal, we have to extract the InitTelemetry from the BaseOps method and move it after agent.New. With that, we can pass the agent.ShutdownCh to the retry routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eculver , when I revisit your question, I realized that Waiter doesn't have a timeout field
Lines 34 to 50 in 5f96b63
type Waiter struct { | |
// MinFailures before exponential backoff starts. Any failures before | |
// MinFailures is reached will wait MinWait time. | |
MinFailures uint | |
// MinWait time. Returned after the first failure. | |
MinWait time.Duration | |
// MaxWait time applied before Jitter. Note that the actual maximum wait time | |
// is MaxWait + MaxWait * Jitter. | |
MaxWait time.Duration | |
// Jitter to add to each wait time. The Jitter is applied after MaxWait, which | |
// may cause the actual wait time to exceed MaxWait. | |
Jitter Jitter | |
// Factor is the multiplier to use when calculating the delay. Defaults to | |
// 1 second. | |
Factor time.Duration | |
failures uint | |
} |
In the latest version, a cancel function is added to stop the retry routine on agent exit. Please take a look. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig
(eg. MinFailures
, MaxWait
, MinWait
, Jitter
) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.
7542c9b
to
66f6b45
Compare
@boxofrad @boxofrad , I implemented two approaches that stops the retry routine on agent's exit:
The first one seems less complex, but move the InitTelemetry out of the Setup approach. The second follows how autoconfig is stopped Lines 1429 to 1431 in 15b6494
Please let me know which one is preferred or there is another solution. Thanks. |
- Add a new telemetry configuration dogstatsd_exit_bad_connection when set to false, continue consul agent on failed connection to datadog server. Default is true to exit the agent process.
1569295
to
9197afb
Compare
9197afb
to
9a36a9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, nice work @huikang 👏🏻 🎉
I've a handful of comments inline. I think the only real blocker is the IsRetrying
thread-safety, though.
In answer to your question, I much prefer the pattern of having the agent responsible for shutting the goroutine down (by calling Cancel
).
agent/agent.go
Outdated
if a.baseDeps.MetricsConfig.IsRetrying { | ||
a.baseDeps.MetricsConfig.Cancel() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this would be a good place to adopt the "Tell Don't Ask" principle, and rely on Cancel
doing the right thing, rather than exposing the IsRetrying
state.
On a related note, I think checking IsRetrying
like this is thread-unsafe because it may be written to by the retry goroutine.
lib/telemetry.go
Outdated
if err := waiter.Wait(ctx); err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
logger.Info("stop retrying configure metrics sinks") | ||
} else { | ||
logger.Error("waiting for retry", "error", err) | ||
} | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think context.Canceled
is the only error that could be returned from Wait
here. That said, I'm a bit confused about the log message we're emitting in the else
branch (as it will exit/not retry).
lib/telemetry.go
Outdated
for _, err := range errs.WrappedErrors() { | ||
var dnsError *net.DNSError | ||
if errors.As(err, &dnsError) && dnsError.IsNotFound { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can call erorrs.As
with the multierror directly, rather than iterating over the wrapped errors like this.
From the docs:
The resulting error supports errors.As/Is/Unwrap so you can continue to use the stdlib errors package to introspect further.
d7caf78
to
5f96b63
Compare
Co-authored-by: Dan Upton <daniel@floppy.co>
5f96b63
to
52e265a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Thanks again for taking this on 🙇🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this has shaped up! Good stuff. I just have a few questions, most are small.
|
||
retryWithBackoff := func() { | ||
waiter := &retry.Waiter{ | ||
MaxWait: 5 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revisiting. I am wondering if rather than morphing it into a "timeout" we would just expose the configuration on the TelemetryConfig
(eg. MinFailures
, MaxWait
, MinWait
, Jitter
) so that the callers could configure the behavior if necessary. It is certainly not a blocking concern. I will leave it up to you.
lib/telemetry.go
Outdated
|
||
if _, errs := configureSinks(cfg, metricsConf.HostName, memSink); errs != nil { | ||
if isRetriableError(errs) && cfg.RetryFailedConfiguration { | ||
logger.Error("failed configure sinks", "error", multierror.Flatten(errs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a warning make more sense here since we are going to retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -216,7 +217,9 @@ func (a *TestAgent) Start(t *testing.T) error { | |||
bd.Logger = logger | |||
// if we are not testing telemetry things, let's use a "mock" sink for metrics | |||
if bd.RuntimeConfig.Telemetry.Disable { | |||
bd.MetricsHandler = metrics.NewInmemSink(1*time.Second, time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting because I would think that we should be able to configure multiple handlers, but this would lead me to believe that technically, we only allow a single metrics handler? It may be rare that users set more than one, but I'm wondering if this is just a deficiency in our docs. If I'm reading them right, I would assume that any configured backend would be used including if I configure more than one.
354f840
to
60c2bfb
Compare
Co-authored-by: Dan Upton <daniel@floppy.co> Co-authored-by: Evan Culver <eculver@users.noreply.github.com>
60c2bfb
to
a097c23
Compare
After merging, confirm that you see linked PRs AND check that them for CI errors. |
Description
Introduce a new telemetry configurable parameter
dogstatsd_exit_bad_connection
. User can set the value to false to let consul agent continue its start process on failed connection to datadog server. The default behavior is true, which exits the agent.An error message is emitted in case
dogstatsd_exit_bad_connection=false
and the dns name of the dogstatds can't be resolved:Why not retrying? When connecting to datadog server in udp, it seems the only place to verify the connection is during consul's initialization phase at https://github.com/DataDog/datadog-go/blob/8bfdc335936a79b55b3e2c1080930bc5a3eb57f2/statsd/udp.go#L23
Since datadog agent doesn't send any ack packet, it is hard to detect if the connection is lost. Therefore, retrying may not be a good solution or help in mitigating the situation.
Follow up To overcome the issue of lost connection and inform user of the status, shall we add
dogstatds_connection
to consul's health check?Testing & Reproduction steps
Start the consult agent with the following telemetry configuration:
Links
close #3419
PR Checklist