File descriptor / HTTP client connection leak on telegraf reload #15177

nick-kentik · 2024-04-17T11:57:24Z

Relevant telegraf.conf

[[inputs.prometheus]]
alias = "foo"
urls = ["http://localhost/metrics"]

Logs from Telegraf

E! [inputs.prometheus::foo] Error in plugin: error making HTTP request to "http://localhost/metrics": Get "http://localhost/metrics": dial tcp 127.0.0.1:80: socket: too many open files

System info

Telegraf 1.27.4

Docker

No response

Steps to reproduce

Start telegraf with a minimal prometheus input config
Issue a quite extraordinary number of reloads
Watch the number of open file descriptors climb over time(lsof -np <telegraf-pid>)
Once open file descriptor limits are exceeded, metrics stop flowing

Expected behavior

Config reloads should not cause file descriptors to accumulate

Actual behavior

EMFILE

Additional info

We're pushing telegraf 1.27.4 everywhere, with a nofile=16384 limit. Just after a restart, we hover around 200-300 file descriptors open, but within 24 hours we're seeing the 16384 limit exceeded on busier hosts.

Our infrastructure automatically issues systemctl reload telegraf commands quite a lot.

Digging into the code a bit, I see new http.Transport instances (wrapped in http.Client) created by the prometheus input plugin each time it's instantiated, but CloseIdleConnections() is never called on them - so when the proemtheus.Prometheus struct is discarded at reload, the idle connections will hang around forever.

There is a configuration workaround - set the idle_conn_timeout config value to something non-zero - but at minimum I think telegraf needs:

diff --git a/plugins/inputs/prometheus/prometheus.go b/plugins/inputs/prometheus/prometheus.go
index 82805ae86..74f55a973 100644
--- a/plugins/inputs/prometheus/prometheus.go
+++ b/plugins/inputs/prometheus/prometheus.go
@@ -584,6 +584,9 @@ func (p *Prometheus) Start(_ telegraf.Accumulator) error {
 func (p *Prometheus) Stop() {
        p.cancel()
        p.wg.Wait()
+       if p.client != nil {
+               p.client.CloseIdleConnections()
+       }
 }
 
 func init() {

This is an issue rather than a PR because the http client is part of inputs/common, so presumably used elsewhere - all sites should be audited.

There's also a uClient with a custom transport that is created per-gather(!) in the prometheus input when hitting a UNIX socket URL. I've not dug into that at all, but would suggest some eyes on it to make sure it's OK.

The text was updated successfully, but these errors were encountered:

srebhan · 2024-04-17T12:42:56Z

@nick-kentik please submit your change as a PR!

nick-kentik · 2024-04-17T12:43:40Z

@srebhan as I said, I don't think it's a complete fix

nick-kentik · 2024-04-17T12:59:05Z

Yeah, just poking one at random, it looks like the kibana input plugin has the same issue, for instance.

A complete fix would need to audit each of these files and make sure CloseIdleConnections() is called, or perhaps consider a different approach (a non-zero default for idle_conn_timeout perhaps?)

plugins/outputs/http/http.go:94:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)
plugins/outputs/cloudwatch/cloudwatch.go:167:	client, err := c.HTTPClientConfig.CreateClient(ctx, c.Log)
plugins/inputs/vault/vault.go:64:	client, err := n.HTTPClientConfig.CreateClient(ctx, n.Log)
plugins/inputs/prometheus/prometheus.go:216:	client, err := p.HTTPClientConfig.CreateClient(ctx, p.Log)
plugins/inputs/elasticsearch_query/elasticsearch_query.go:202:	return e.HTTPClientConfig.CreateClient(ctx, e.Log)
plugins/inputs/logstash/logstash.go:139:	return logstash.HTTPClientConfig.CreateClient(ctx, logstash.Log)
plugins/inputs/http/http.go:70:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)
plugins/inputs/elasticsearch/elasticsearch.go:291:	return e.HTTPClientConfig.CreateClient(ctx, e.Log)
plugins/inputs/ctrlx_datalayer/ctrlx_datalayer.go:280:	c.connection, err = c.HTTPClientConfig.CreateClient(ctx, c.Log)
plugins/inputs/kibana/kibana.go:155:	return k.HTTPClientConfig.CreateClient(ctx, k.Log)
plugins/inputs/openstack/openstack.go:151:	client, err := o.HTTPClientConfig.CreateClient(ctx, o.Log)
plugins/secretstores/http/http.go:50:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)

The config workaround doesn't work for me, since our config files need to work across both 1.21.4 and 1.27.4 at the moment, so I'm unlikely to have time to dedicate to upstream for a bit.

nick-kentik · 2024-04-23T13:03:47Z

PR: #15213

nick-kentik added the bug unexpected problem or unintended behavior label Apr 17, 2024

nick-kentik changed the title ~~file descriptor / HTTP client connection leak on telegraf reload~~ File descriptor / HTTP client connection leak on telegraf reload Apr 17, 2024

nick-kentik mentioned this issue Apr 23, 2024

fix(http): Stop plugins from leaking file descriptors on telegraf reload #15213

Merged

1 task

DStrand1 closed this as completed in #15213 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File descriptor / HTTP client connection leak on telegraf reload #15177

File descriptor / HTTP client connection leak on telegraf reload #15177

nick-kentik commented Apr 17, 2024

srebhan commented Apr 17, 2024

nick-kentik commented Apr 17, 2024

nick-kentik commented Apr 17, 2024

nick-kentik commented Apr 23, 2024

File descriptor / HTTP client connection leak on telegraf reload #15177

File descriptor / HTTP client connection leak on telegraf reload #15177

Comments

nick-kentik commented Apr 17, 2024

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

srebhan commented Apr 17, 2024

nick-kentik commented Apr 17, 2024

nick-kentik commented Apr 17, 2024

nick-kentik commented Apr 23, 2024