New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTTP timeout to Splunk logger, fixes #43000 #43001
base: master
Are you sure you want to change the base?
Conversation
5de3fc9
to
cf3fa90
Compare
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
cf3fa90
to
af249f9
Compare
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
daemon/logger/splunk/splunk.go
Outdated
value, err := strconv.ParseInt(timeoutStr, 10, 64) | ||
|
||
if err != nil { | ||
return nil, err | ||
} | ||
timeout = time.Duration(value) * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating. Erm, I thought I commented, but looks like I never did 😬
Similar to my comment on #43100 (comment), I'm wondering if we should make the splunk-timeout
option accept a "duration" string, instead of millisecond. Let me quote that comment here, as it may clarify why I think that would be good to do;
Perhaps we should consider to parse the value as a
time.Duration
(time.ParseDuration()
) to make it easier (and less error-prone) for the user to set the right interval. Having to specify the value inms
works "ok(ish)" if you want to, say, specify a second (1000 ms
), but can become a bit awkward after that. It's also error-prone, for example, if a user wants to specify1 second
, but specifies1
and ends up with1 ms
.And looking at the code, I see we already use duration strings for
fluentd-retry-wait
, so it's a bit more consistent.So I would suggest to;
- use
time.ParseDuration()
- validate the no negative values are specified
- perhaps also a minimum / maximum (not sure what reasonable values are? between 100ms and 10s ? (open to suggestions)
Would be good to have a minimal unit test (I see there's currently no tests in this package, which is a bit unfortunate, but let's not continue that "tradition" 😅)
Happy to hear your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, that sounds like a good suggestion. Sorry for missing it. I will get to it asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the duration string support and am testing for negative values. I didn't enforce min/max, given that the default is no timeout, it doesn't make sense to bound max value.
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
e58d106
to
23c8bf2
Compare
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a bit more time looking at the changes in this PR, and on closer look (I missed it at first), I don't think the current changes are correct; let me try to describe it (sorry, it's lengthy! 😅)
Looking at TestCustomHttpTimeout()
, it seems like we're not actually testing this new feature; the test is setting the timeout to 100ms
;
info := logger.Info{
Config: map[string]string{
splunkURLKey: "http://" + tcpListener.Addr().String(),
splunkTokenKey: "1234",
splunkHTTPTimeout: "100ms",
},
...
But then continues to set batchSendTimeout
to 1 second
, and verify that the test fails if it takes longer than 1 second (not 100ms
). batchSendTimeout
itself doesn't appear to be user-configurable, so this only works in the test.
Digging a bit more into the batchSendTimeout
variable, I see that was added by @cpuguy83 in #35496. The variable is used to set a timeout on the context used in postMessages()
moby/daemon/logger/splunk/splunk.go
Lines 437 to 441 in 085c6a9
func (l *splunkLogger) postMessages(messages []*splunkMessage, lastChance bool) []*splunkMessage { | |
messagesLen := len(messages) | |
ctx, cancel := context.WithTimeout(context.Background(), batchSendTimeout) | |
defer cancel() |
So why does the driver hang? (as described in #43000 and the StackOverflow question ?).
Could the problem be in verifySplunkConnection()
? Looking at the code, I see that the driver enables verifyConnection
by default;
moby/daemon/logger/splunk/splunk.go
Line 287 in 085c6a9
verifyConnection := true |
And if it's enabled, then verifySplunkConnection()
is called;
moby/daemon/logger/splunk/splunk.go
Line 296 in 085c6a9
err = verifySplunkConnection(logger) |
However, unlike postMessages()
, that function does not set a timeout on the context;
moby/daemon/logger/splunk/splunk.go
Lines 624 to 631 in 085c6a9
req, err := http.NewRequest(http.MethodOptions, l.url, nil) | |
if err != nil { | |
return err | |
} | |
resp, err := l.client.Do(req) | |
if err != nil { | |
return err | |
} |
Which likely means that currently, it waits forever to establish a connection, and never times out. To check that, I modified the code a bit;
if verifyConnection {
start := time.Now()
fmt.Println("checking connection")
err = verifySplunkConnection(logger)
if err != nil {
return nil, err
}
duration := time.Since(start)
fmt.Printf("connection ok (%v)\n", duration)
}
And updated the TestCustomHttpTimeout
test to always apply the 5 second sleep (not just for POST
);
go http.Serve(tcpListener, http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
if request.Method == http.MethodPost {
counter++
}
// block so connection will timeout.
time.Sleep(5000 * time.Millisecond)
writer.WriteHeader(200)
}))
When running the test with the splunk-timeout
removed, I saw that (indeed) it doesn't time out until the sleep ends;
=== RUN TestCustomHTTPTimeout
checking connection
connection ok (5.000969035s)
And with the splunk-timeout
enabled;
=== RUN TestCustomHTTPTimeout
checking connection
splunk_test.go:1429: Options "http://127.0.0.1:44205/services/collector/event/1.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
So although this PR "fixes" the infinite wait on verifySplunkConnection()
, the current changes (setting a timeout on the client) also override the existing batchSendTimeout
, which means that all connections will now be limited to splunk-timeout
. To verify that, I again modified the test; I reverted the "sleep" changes, re-enabled the splunk-timeout
option, picked a shorter duration for the check, and modified the t.Fatal()
to print the actual duration;
duration := time.Since(start)
if duration > 100*time.Millisecond {
t.Fatalf("Sending didn't use the custom timeout: took %v", duration)
}
Now, running the test showed;
=== RUN TestCustomHTTPTimeout
checking connection
connection ok (1.012728ms)
time="2021-12-24T12:45:07Z" level=warning msg="Error while sending logs" error="Post \"http://127.0.0.1:35291/services/collector/event/1.0\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" module=logger/splunk
time="2021-12-24T12:45:07Z" level=error msg="Failed to send a message '{\"event\":{\"line\":\"message1\",\"source\":\"stdout\",\"tag\":\"containeriid\"},\"time\":\"1640349907.281913\",\"host\":\"66d62dcab3f5\"}'"
splunk_test.go:1445: Sending didn't use the custom timeout: took 101.066177ms
So, the above confirms that;
- setting
splunk-timeout
effectively overridesbatchSendTimeout
- the test passed because it was checking for the longer (
batchSendTimeout
) timeout, which would never be hit because of the above - the test would indeed be "flaky" if it would check for the configured
splunk-timeout
(101.066177ms
>100ms
, so it would likely fail in all cases)
So, how to solve?
Summarizing the current situation:
- the splunk driver has a 30 second timeout for sending logs (defined by
batchSendTimeout
) - there's no timeout for
verifySplunkConnection()
, so it will hang indefinitely if it's unable to connect (this likely is the problem that's described in the ticket)
I think there's some options here;
- we could pass a
context.Context
toverifySplunkConnection()
, and set acontext.WithTimeout()
, re-using the existingbatchSendTimeout
(30 seconds) - if 30 seconds is too long for the initial check, we could define a different timeout for it (what's a reasonable timeout to verify if the remote logger is "up"?)
(Perhaps a shorter than 30 seconds timeout for the initial check would make sense)
Does it have to be configurable?
That's the next question; so the original report was about the driver hanging forever, which would be solved by the above. Question is; do we need an config option for this?
And, if we do, that likely would require two separate options; one for the initial check, and one for the timeout to send the logs. Based on everything in this thread, I'm a bit on the fence if it needs to be configurable, and to first fix the verifySplunkConnection()
issue. If that turns out to be insufficient, we can add options to make it configurable (for the initial check we could also consider a retry
approach?).
@@ -1392,3 +1393,80 @@ func TestDeadlockOnBlockedEndpoint(t *testing.T) { | |||
case <-done: | |||
} | |||
} | |||
|
|||
func TestCustomHttpTimeout(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be named TestCustomHTTPTimeout
to follow Go's naming conventions
@@ -1392,3 +1393,80 @@ func TestDeadlockOnBlockedEndpoint(t *testing.T) { | |||
case <-done: | |||
} | |||
} | |||
|
|||
func TestCustomHttpTimeout(t *testing.T) { | |||
tcpAddr := &net.TCPAddr{IP: []byte{127, 0, 0, 1}, Port: 0, Zone: ""} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Port
and Zone
fields are redundant here (0
is the default value for Port
(int
), and ""
the default for Zone
(string
)). Perhaps a more idiomatic way to write this is to use net.IPv4
instead of []byte{}
;
tcpAddr := &net.TCPAddr{IP: net.IPv4(127, 0, 0, 1)}
} | ||
counter := 0 | ||
go http.Serve(tcpListener, http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) { | ||
if request.Method == "POST" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're already importing the net/http
package, let's use the http.MethodPost
const here;
if request.Method == http.MethodPost {
} | ||
|
||
duration := time.Since(start) | ||
if duration > 1*time.Second { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should likely take the value we set above, to prevent this check from accidentally being out of sync if that value would ever be changed;
if duration > batchSendTimeout {
I'm a bit worried that this could become a bit "flaky"; time taken could possibly be "slightly more than 1 second" and make the test randomly fail. For such cases it may be good to give it a "bit" more time (and leave a comment that describes we're doing so). Something like;
// check that the timeout is used, but give it a bit more time to prevent
// the test to become flaky.
if duration > batchSendTimeout+75*time.Millisecond {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.... so looking at this... I'm now wondering if this PR is doing the right thing (I'll leave another comment describing what I mean)
value, err := time.ParseDuration(timeoutStr) | ||
|
||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should wrap (errors.Wrap()
) the error to give the error a bit more context. Currently it would produce something like time: unknown unit "msec" in duration "100msec"
, what describes what's wrong with the value, but still makes it difficult for the user to find "what value was this?". Something like:
return nil, errors.Wrapf(err, "invalid value for %s", splunkHTTPTimeout)
Which would produce:
invalid value for splunk-timeout: time: unknown unit "msec" in duration "100msec"
if timeoutStr, ok := info.Config[splunkHTTPTimeout]; ok { | ||
value, err := time.ParseDuration(timeoutStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise other bits of this function do the same, but given that timeoutStr
is only used on these lines, it's more idiomatic to use a short variable name. Using a short name helps someone reading the code understand that it's a short-lived variable, reducing mental overhead ("is this variable used in other places?");
if v, ok := info.Config[splunkHTTPTimeout]; ok {
And now that we parse as a time.Duration
(and given that we return early if there's an error), we can get rid of the intermediate value
variable (looks like err
is already declared earlier in the function, so we can reuse that as well);
if v, ok := info.Config[splunkHTTPTimeout]; ok {
timeout, err = time.ParseDuration(v)
return nil, err | ||
} | ||
if value < 0 { | ||
return nil, errors.Errorf("negative timeout provided: %v", value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here; would be good to add a bit of context, e.g.:
return nil, errors.Errorf("invalid value for %s: value (%v) must be a positive duration", splunkHTTPTimeout, timeout)
if value < 0 { | ||
return nil, errors.Errorf("negative timeout provided: %v", value) | ||
} | ||
timeout = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment about the intermediate value
variable. With my suggestions applied the whole block would look like;
var timeout time.Duration
if v, ok := info.Config[splunkHTTPTimeout]; ok {
timeout, err = time.ParseDuration(v)
if err != nil {
return nil, errors.Wrapf(err, "invalid value for %s", splunkHTTPTimeout)
}
if timeout < 0 {
return nil, errors.Errorf("invalid value for %s: value (%v) must be a positive duration", splunkHTTPTimeout, timeout)
}
}
} | ||
counter := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps would be slightly cleaner to move the tcpListener.Close()
from the end of the test to defer
here
}
defer tcpListener.Close()
conter := 0
(I think it's ok to ignore error handling)
|
||
duration := time.Since(start) | ||
if duration > 1*time.Second { | ||
t.Fatal("Sending didn't use the custom timeout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps include the actual time in the error;
t.Fatalf("Sending didn't use the custom timeout: took %v", duration)
Thanks again, @atoulme! I left some comments inline, and a long comment as review. It took me some time to get a bit more familiar with this part of the code, but looks like I overlooked some issues in my initial reviews (see the long comment). Let me know what you think (happy to hear / get input / suggestions), and if you need help with making the changes 👍 🤗 |
Signed-off-by: Antoine Toulme antoine@lunar-ocean.com
fixes #43000
- What I did
Made sure Splunk logging would timeout rather than block indefinitely if HTTP errors occur.
**- How I did **
Added a default HTTP timeout and a config key to set the HTTP timeout for the Splunk HEC logger.
- How to verify it
Unit test covers the use case.
- Description for the changelog
Add HTTP timeout to Splunk logger with a default of 1 second. Users can override with the
--splunk-timeout
flag.- A picture of a cute animal (not mandatory but encouraged)