Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptor / HTTP client connection leak on telegraf reload #15177

Closed
nick-kentik opened this issue Apr 17, 2024 · 4 comments · Fixed by #15213
Closed

File descriptor / HTTP client connection leak on telegraf reload #15177

nick-kentik opened this issue Apr 17, 2024 · 4 comments · Fixed by #15213
Labels
bug unexpected problem or unintended behavior

Comments

@nick-kentik
Copy link
Contributor

Relevant telegraf.conf

[[inputs.prometheus]]
alias = "foo"
urls = ["http://localhost/metrics"]

Logs from Telegraf

E! [inputs.prometheus::foo] Error in plugin: error making HTTP request to "http://localhost/metrics": Get "http://localhost/metrics": dial tcp 127.0.0.1:80: socket: too many open files

System info

Telegraf 1.27.4

Docker

No response

Steps to reproduce

  1. Start telegraf with a minimal prometheus input config
  2. Issue a quite extraordinary number of reloads
  3. Watch the number of open file descriptors climb over time(lsof -np <telegraf-pid>)
  4. Once open file descriptor limits are exceeded, metrics stop flowing

Expected behavior

Config reloads should not cause file descriptors to accumulate

Actual behavior

EMFILE

Additional info

We're pushing telegraf 1.27.4 everywhere, with a nofile=16384 limit. Just after a restart, we hover around 200-300 file descriptors open, but within 24 hours we're seeing the 16384 limit exceeded on busier hosts.

Our infrastructure automatically issues systemctl reload telegraf commands quite a lot.

Digging into the code a bit, I see new http.Transport instances (wrapped in http.Client) created by the prometheus input plugin each time it's instantiated, but CloseIdleConnections() is never called on them - so when the proemtheus.Prometheus struct is discarded at reload, the idle connections will hang around forever.

There is a configuration workaround - set the idle_conn_timeout config value to something non-zero - but at minimum I think telegraf needs:

diff --git a/plugins/inputs/prometheus/prometheus.go b/plugins/inputs/prometheus/prometheus.go
index 82805ae86..74f55a973 100644
--- a/plugins/inputs/prometheus/prometheus.go
+++ b/plugins/inputs/prometheus/prometheus.go
@@ -584,6 +584,9 @@ func (p *Prometheus) Start(_ telegraf.Accumulator) error {
 func (p *Prometheus) Stop() {
        p.cancel()
        p.wg.Wait()
+       if p.client != nil {
+               p.client.CloseIdleConnections()
+       }
 }
 
 func init() {

This is an issue rather than a PR because the http client is part of inputs/common, so presumably used elsewhere - all sites should be audited.

There's also a uClient with a custom transport that is created per-gather(!) in the prometheus input when hitting a UNIX socket URL. I've not dug into that at all, but would suggest some eyes on it to make sure it's OK.

@nick-kentik nick-kentik added the bug unexpected problem or unintended behavior label Apr 17, 2024
@nick-kentik nick-kentik changed the title file descriptor / HTTP client connection leak on telegraf reload File descriptor / HTTP client connection leak on telegraf reload Apr 17, 2024
@srebhan
Copy link
Member

srebhan commented Apr 17, 2024

@nick-kentik please submit your change as a PR!

@nick-kentik
Copy link
Contributor Author

@srebhan as I said, I don't think it's a complete fix

@nick-kentik
Copy link
Contributor Author

Yeah, just poking one at random, it looks like the kibana input plugin has the same issue, for instance.

A complete fix would need to audit each of these files and make sure CloseIdleConnections() is called, or perhaps consider a different approach (a non-zero default for idle_conn_timeout perhaps?)

plugins/outputs/http/http.go:94:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)
plugins/outputs/cloudwatch/cloudwatch.go:167:	client, err := c.HTTPClientConfig.CreateClient(ctx, c.Log)
plugins/inputs/vault/vault.go:64:	client, err := n.HTTPClientConfig.CreateClient(ctx, n.Log)
plugins/inputs/prometheus/prometheus.go:216:	client, err := p.HTTPClientConfig.CreateClient(ctx, p.Log)
plugins/inputs/elasticsearch_query/elasticsearch_query.go:202:	return e.HTTPClientConfig.CreateClient(ctx, e.Log)
plugins/inputs/logstash/logstash.go:139:	return logstash.HTTPClientConfig.CreateClient(ctx, logstash.Log)
plugins/inputs/http/http.go:70:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)
plugins/inputs/elasticsearch/elasticsearch.go:291:	return e.HTTPClientConfig.CreateClient(ctx, e.Log)
plugins/inputs/ctrlx_datalayer/ctrlx_datalayer.go:280:	c.connection, err = c.HTTPClientConfig.CreateClient(ctx, c.Log)
plugins/inputs/kibana/kibana.go:155:	return k.HTTPClientConfig.CreateClient(ctx, k.Log)
plugins/inputs/openstack/openstack.go:151:	client, err := o.HTTPClientConfig.CreateClient(ctx, o.Log)
plugins/secretstores/http/http.go:50:	client, err := h.HTTPClientConfig.CreateClient(ctx, h.Log)

The config workaround doesn't work for me, since our config files need to work across both 1.21.4 and 1.27.4 at the moment, so I'm unlikely to have time to dedicate to upstream for a bit.

@nick-kentik
Copy link
Contributor Author

PR: #15213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants