Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTTP timeout to Splunk logger, fixes #43000 #43001

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

atoulme
Copy link

@atoulme atoulme commented Nov 8, 2021

Signed-off-by: Antoine Toulme antoine@lunar-ocean.com

fixes #43000

- What I did
Made sure Splunk logging would timeout rather than block indefinitely if HTTP errors occur.

**- How I did **
Added a default HTTP timeout and a config key to set the HTTP timeout for the Splunk HEC logger.

- How to verify it
Unit test covers the use case.

- Description for the changelog
Add HTTP timeout to Splunk logger with a default of 1 second. Users can override with the --splunk-timeout flag.

- A picture of a cute animal (not mandatory but encouraged)
image

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
daemon/logger/splunk/splunk.go Outdated Show resolved Hide resolved
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
Comment on lines 234 to 239
value, err := strconv.ParseInt(timeoutStr, 10, 64)

if err != nil {
return nil, err
}
timeout = time.Duration(value) * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating. Erm, I thought I commented, but looks like I never did 😬

Similar to my comment on #43100 (comment), I'm wondering if we should make the splunk-timeout option accept a "duration" string, instead of millisecond. Let me quote that comment here, as it may clarify why I think that would be good to do;

Perhaps we should consider to parse the value as a time.Duration (time.ParseDuration()) to make it easier (and less error-prone) for the user to set the right interval. Having to specify the value in ms works "ok(ish)" if you want to, say, specify a second (1000 ms), but can become a bit awkward after that. It's also error-prone, for example, if a user wants to specify 1 second, but specifies 1 and ends up with 1 ms.

And looking at the code, I see we already use duration strings for fluentd-retry-wait, so it's a bit more consistent.

So I would suggest to;

  • use time.ParseDuration()
  • validate the no negative values are specified
  • perhaps also a minimum / maximum (not sure what reasonable values are? between 100ms and 10s ? (open to suggestions)

Would be good to have a minimal unit test (I see there's currently no tests in this package, which is a bit unfortunate, but let's not continue that "tradition" 😅)

Happy to hear your thoughts!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, that sounds like a good suggestion. Sorry for missing it. I will get to it asap.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the duration string support and am testing for negative values. I didn't enforce min/max, given that the default is no timeout, it doesn't make sense to bound max value.

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bit more time looking at the changes in this PR, and on closer look (I missed it at first), I don't think the current changes are correct; let me try to describe it (sorry, it's lengthy! 😅)

Looking at TestCustomHttpTimeout(), it seems like we're not actually testing this new feature; the test is setting the timeout to 100ms;

info := logger.Info{
    Config: map[string]string{
        splunkURLKey:      "http://" + tcpListener.Addr().String(),
        splunkTokenKey:    "1234",
        splunkHTTPTimeout: "100ms",
    },
    ...

But then continues to set batchSendTimeout to 1 second, and verify that the test fails if it takes longer than 1 second (not 100ms). batchSendTimeout itself doesn't appear to be user-configurable, so this only works in the test.

Digging a bit more into the batchSendTimeout variable, I see that was added by @cpuguy83 in #35496. The variable is used to set a timeout on the context used in postMessages()

func (l *splunkLogger) postMessages(messages []*splunkMessage, lastChance bool) []*splunkMessage {
messagesLen := len(messages)
ctx, cancel := context.WithTimeout(context.Background(), batchSendTimeout)
defer cancel()

So why does the driver hang? (as described in #43000 and the StackOverflow question ?).

Could the problem be in verifySplunkConnection() ? Looking at the code, I see that the driver enables verifyConnection by default;

verifyConnection := true

And if it's enabled, then verifySplunkConnection() is called;

err = verifySplunkConnection(logger)

However, unlike postMessages(), that function does not set a timeout on the context;

req, err := http.NewRequest(http.MethodOptions, l.url, nil)
if err != nil {
return err
}
resp, err := l.client.Do(req)
if err != nil {
return err
}

Which likely means that currently, it waits forever to establish a connection, and never times out. To check that, I modified the code a bit;

if verifyConnection {
	start := time.Now()
	fmt.Println("checking connection")
	err = verifySplunkConnection(logger)
	if err != nil {
		return nil, err
	}
	duration := time.Since(start)
	fmt.Printf("connection ok (%v)\n", duration)
}

And updated the TestCustomHttpTimeout test to always apply the 5 second sleep (not just for POST);

go http.Serve(tcpListener, http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
	if request.Method == http.MethodPost {
		counter++
	}
	// block so connection will timeout.
	time.Sleep(5000 * time.Millisecond)
	writer.WriteHeader(200)
}))

When running the test with the splunk-timeout removed, I saw that (indeed) it doesn't time out until the sleep ends;

=== RUN   TestCustomHTTPTimeout
checking connection
connection ok (5.000969035s)

And with the splunk-timeout enabled;

=== RUN   TestCustomHTTPTimeout
checking connection
    splunk_test.go:1429: Options "http://127.0.0.1:44205/services/collector/event/1.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

So although this PR "fixes" the infinite wait on verifySplunkConnection(), the current changes (setting a timeout on the client) also override the existing batchSendTimeout, which means that all connections will now be limited to splunk-timeout. To verify that, I again modified the test; I reverted the "sleep" changes, re-enabled the splunk-timeout option, picked a shorter duration for the check, and modified the t.Fatal() to print the actual duration;

duration := time.Since(start)
if duration > 100*time.Millisecond {
	t.Fatalf("Sending didn't use the custom timeout: took %v", duration)
}

Now, running the test showed;

=== RUN   TestCustomHTTPTimeout
checking connection
connection ok (1.012728ms)
time="2021-12-24T12:45:07Z" level=warning msg="Error while sending logs" error="Post \"http://127.0.0.1:35291/services/collector/event/1.0\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" module=logger/splunk
time="2021-12-24T12:45:07Z" level=error msg="Failed to send a message '{\"event\":{\"line\":\"message1\",\"source\":\"stdout\",\"tag\":\"containeriid\"},\"time\":\"1640349907.281913\",\"host\":\"66d62dcab3f5\"}'"
    splunk_test.go:1445: Sending didn't use the custom timeout: took 101.066177ms

So, the above confirms that;

  • setting splunk-timeout effectively overrides batchSendTimeout
  • the test passed because it was checking for the longer (batchSendTimeout) timeout, which would never be hit because of the above
  • the test would indeed be "flaky" if it would check for the configured splunk-timeout (101.066177ms > 100ms, so it would likely fail in all cases)

So, how to solve?

Summarizing the current situation:

  • the splunk driver has a 30 second timeout for sending logs (defined by batchSendTimeout)
  • there's no timeout for verifySplunkConnection(), so it will hang indefinitely if it's unable to connect (this likely is the problem that's described in the ticket)

I think there's some options here;

  • we could pass a context.Context to verifySplunkConnection(), and set a context.WithTimeout(), re-using the existing batchSendTimeout (30 seconds)
  • if 30 seconds is too long for the initial check, we could define a different timeout for it (what's a reasonable timeout to verify if the remote logger is "up"?)

(Perhaps a shorter than 30 seconds timeout for the initial check would make sense)

Does it have to be configurable?

That's the next question; so the original report was about the driver hanging forever, which would be solved by the above. Question is; do we need an config option for this?

And, if we do, that likely would require two separate options; one for the initial check, and one for the timeout to send the logs. Based on everything in this thread, I'm a bit on the fence if it needs to be configurable, and to first fix the verifySplunkConnection() issue. If that turns out to be insufficient, we can add options to make it configurable (for the initial check we could also consider a retry approach?).

@@ -1392,3 +1393,80 @@ func TestDeadlockOnBlockedEndpoint(t *testing.T) {
case <-done:
}
}

func TestCustomHttpTimeout(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be named TestCustomHTTPTimeout to follow Go's naming conventions

@@ -1392,3 +1393,80 @@ func TestDeadlockOnBlockedEndpoint(t *testing.T) {
case <-done:
}
}

func TestCustomHttpTimeout(t *testing.T) {
tcpAddr := &net.TCPAddr{IP: []byte{127, 0, 0, 1}, Port: 0, Zone: ""}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Port and Zone fields are redundant here (0 is the default value for Port (int), and "" the default for Zone (string)). Perhaps a more idiomatic way to write this is to use net.IPv4 instead of []byte{};

tcpAddr := &net.TCPAddr{IP: net.IPv4(127, 0, 0, 1)}

}
counter := 0
go http.Serve(tcpListener, http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
if request.Method == "POST" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're already importing the net/http package, let's use the http.MethodPost const here;

if request.Method == http.MethodPost {

}

duration := time.Since(start)
if duration > 1*time.Second {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely take the value we set above, to prevent this check from accidentally being out of sync if that value would ever be changed;

if duration > batchSendTimeout {

I'm a bit worried that this could become a bit "flaky"; time taken could possibly be "slightly more than 1 second" and make the test randomly fail. For such cases it may be good to give it a "bit" more time (and leave a comment that describes we're doing so). Something like;

// check that the timeout is used, but give it a bit more time to prevent
// the test to become flaky.
if duration > batchSendTimeout+75*time.Millisecond {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.... so looking at this... I'm now wondering if this PR is doing the right thing (I'll leave another comment describing what I mean)

value, err := time.ParseDuration(timeoutStr)

if err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should wrap (errors.Wrap()) the error to give the error a bit more context. Currently it would produce something like time: unknown unit "msec" in duration "100msec", what describes what's wrong with the value, but still makes it difficult for the user to find "what value was this?". Something like:

return nil, errors.Wrapf(err, "invalid value for %s", splunkHTTPTimeout)

Which would produce:

invalid value for splunk-timeout: time: unknown unit "msec" in duration "100msec"

Comment on lines +234 to +235
if timeoutStr, ok := info.Config[splunkHTTPTimeout]; ok {
value, err := time.ParseDuration(timeoutStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise other bits of this function do the same, but given that timeoutStr is only used on these lines, it's more idiomatic to use a short variable name. Using a short name helps someone reading the code understand that it's a short-lived variable, reducing mental overhead ("is this variable used in other places?");

if v, ok := info.Config[splunkHTTPTimeout]; ok {

And now that we parse as a time.Duration (and given that we return early if there's an error), we can get rid of the intermediate value variable (looks like err is already declared earlier in the function, so we can reuse that as well);

if v, ok := info.Config[splunkHTTPTimeout]; ok {
	timeout, err = time.ParseDuration(v)

return nil, err
}
if value < 0 {
return nil, errors.Errorf("negative timeout provided: %v", value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here; would be good to add a bit of context, e.g.:

return nil, errors.Errorf("invalid value for %s: value (%v) must be a positive duration", splunkHTTPTimeout, timeout)

if value < 0 {
return nil, errors.Errorf("negative timeout provided: %v", value)
}
timeout = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about the intermediate value variable. With my suggestions applied the whole block would look like;

var timeout time.Duration
if v, ok := info.Config[splunkHTTPTimeout]; ok {
	timeout, err = time.ParseDuration(v)
	if err != nil {
		return nil, errors.Wrapf(err, "invalid value for %s", splunkHTTPTimeout)
	}
	if timeout < 0 {
		return nil, errors.Errorf("invalid value for %s: value (%v) must be a positive duration", splunkHTTPTimeout, timeout)
	}
}

Comment on lines +1402 to +1403
}
counter := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps would be slightly cleaner to move the tcpListener.Close() from the end of the test to defer here

}
defer tcpListener.Close()
conter := 0

(I think it's ok to ignore error handling)


duration := time.Since(start)
if duration > 1*time.Second {
t.Fatal("Sending didn't use the custom timeout")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps include the actual time in the error;

t.Fatalf("Sending didn't use the custom timeout: took %v", duration)

@thaJeztah
Copy link
Member

Thanks again, @atoulme! I left some comments inline, and a long comment as review. It took me some time to get a bit more familiar with this part of the code, but looks like I overlooked some issues in my initial reviews (see the long comment).

Let me know what you think (happy to hear / get input / suggestions), and if you need help with making the changes 👍 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add HTTP timeout to Splunk logging
3 participants