Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close running outputs when reloading #8769

Merged
merged 10 commits into from
Mar 16, 2021
Merged

Close running outputs when reloading #8769

merged 10 commits into from
Mar 16, 2021

Conversation

viperstars
Copy link
Contributor

@viperstars viperstars commented Jan 28, 2021

closes #8726, closes #8185

As we know, telegraf will reload when getting a SIGHUP signal. If we go deeper into the reload loop, we can see "runAgent" is called without closing running outputs.

This will cause some issues like reload failure and connection leaking.

  • Reload failure
  1. Add prometheus configuration
[[outputs.prometheus_client]]
  listen = ":9273"
  path = "/metrics"
  metric_version = 2
  1. Send a SIGHUP to the process

  2. We will get the "address already in use" error:

2021-01-28T08:47:59Z I! Starting Telegraf 
2021-01-28T08:47:59Z I! Loaded inputs: x509_cert
2021-01-28T08:47:59Z I! Loaded aggregators: 
2021-01-28T08:47:59Z I! Loaded processors: 
2021-01-28T08:47:59Z I! Loaded outputs: prometheus_client
2021-01-28T08:47:59Z I! Tags enabled: host=Zachary-MBP.local
2021-01-28T08:47:59Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"Zachary-MBP.local", Flush Interval:20s
2021-01-28T08:47:59Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2021-01-28T08:48:12Z I! Reloading Telegraf config
2021-01-28T08:48:12Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-01-28T08:48:12Z I! Starting Telegraf 
2021-01-28T08:48:12Z I! Loaded inputs: x509_cert
2021-01-28T08:48:12Z I! Loaded aggregators: 
2021-01-28T08:48:12Z I! Loaded processors: 
2021-01-28T08:48:12Z I! Loaded outputs: prometheus_client
2021-01-28T08:48:12Z I! Tags enabled: host=Zachary-MBP.local
2021-01-28T08:48:12Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"Zachary-MBP.local", Flush Interval:20s
2021-01-28T08:48:12Z E! [agent] Failed to connect to [outputs.prometheus_client], retrying in 15s, error was 'listen tcp :9273: bind: address already in use'
2021-01-28T08:48:27Z E! [telegraf] Error running agent: connecting output outputs.prometheus_client: Error connecting to output "outputs.prometheus_client": listen tcp :9273: bind: address already in use
  • Connection leaking
  1. Add influxdb output
[[outputs.influxdb_v2]]
urls = ["http://127.0.0.1:8086"]
  1. Sent SIGHUP several times.

  2. We can find more than one tcp connection to influxdb server:

Zachary-MBP:telegraf zachary$ lsof -p 34680
COMMAND    PID    USER   FD     TYPE             DEVICE  SIZE/OFF                NODE NAME
telegraf 34680 zachary  cwd      DIR                1,4      1408            26952554 /Users/zachary/Projects/golang/telegraf
telegraf 34680 zachary  txt      REG                1,4 117061680            27081706 /Users/zachary/Projects/golang/telegraf/telegraf
telegraf 34680 zachary  txt      REG                1,4     34032            24991620 /Library/Preferences/Logging/.plist-cache.eRVHsOgp
telegraf 34680 zachary  txt      REG                1,4     62747            26934247 /private/var/db/analyticsd/events.whitelist
telegraf 34680 zachary  txt      REG                1,4   2528384 1152921500312783762 /usr/lib/dyld
telegraf 34680 zachary  txt      REG                1,4  30148944 1152921500312795328 /usr/share/icu/icudt66l.dat
telegraf 34680 zachary    0u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    1u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    2u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    3     PIPE 0xdbb151f6d2284bc1     16384                     ->0x500c3504a5abf149
telegraf 34680 zachary    4     PIPE 0x500c3504a5abf149     16384                     ->0xdbb151f6d2284bc1
telegraf 34680 zachary    5u  KQUEUE                                                  count=0, state=0xa
telegraf 34680 zachary    6     PIPE 0x7060bc4282b87de5     16384                     ->0x58c782c11bf250d
telegraf 34680 zachary    7     PIPE  0x58c782c11bf250d     16384                     ->0x7060bc4282b87de5
telegraf 34680 zachary    8u    unix 0xba652578907083cf       0t0                     ->0xba6525787cd608e7
telegraf 34680 zachary    9r     CHR               17,1    0t4096                 593 /dev/urandom
telegraf 34680 zachary   10u    IPv4 0xba6525789e21f5d7       0t0                 TCP localhost:64693->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   11u    IPv4 0xba65257897816267       0t0                 TCP localhost:64841->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   12u    IPv4 0xba65257888979267       0t0                 TCP localhost:64847->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   13u    IPv4 0xba6525789ad5b267       0t0                 TCP localhost:64853->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   14u    IPv4 0xba65257898eb0fef       0t0                 TCP localhost:64859->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   15u    IPv4 0xba6525789e129697       0t0                 TCP localhost:64865->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   16u    IPv4 0xba6525789ad5d0af       0t0                 TCP localhost:64870->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   17u    IPv4 0xba6525788bc6f267       0t0                 TCP localhost:64876->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   18u    IPv4 0xba6525789e13fc7f       0t0                 TCP localhost:64884->localhost:8086 (ESTABLISHED)
  1. This issue was mentioned before. "client.Close()" was add in the influxdb output, but "(i *InfluxDB) Close()" method is never called during the reload.

I add some code to fix this.

Required for all PRs:

  • Associated README.md updated.
  • Has appropriate unit tests.

Copy link
Contributor

@telegraf-tiger telegraf-tiger bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤝 ✅ CLA has been signed. Thank you!

@helenosheaa helenosheaa added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Feb 4, 2021
cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved
cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved
cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set the global agent variable directly from withinrunAgent() and please also think about a better name for the global variable. How about runningAgent or something.

@srebhan srebhan self-assigned this Feb 5, 2021
@viperstars
Copy link
Contributor Author

Please set the global agent variable directly from withinrunAgent() and please also think about a better name for the global variable. How about runningAgent or something.

I initalize a global variable named "runningAgent". Move closing outputs into the runAgent() before call agent.NewAgent(c)

cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good already. Just one more minor request, please set the agent to a defined state in case of agent.NewAgent() fails. I know this call cannot fail currently, but it might fail in the future and I don't want to introduce later hard-to-find issues by assuming the above function to return something sensible in case of error.

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 9, 2021
@viperstars
Copy link
Contributor Author

related issue: #8726 #8185

@srebhan
Copy link
Contributor

srebhan commented Feb 10, 2021

@viperstars can you please add closes #8726, #8185 to your PR description to allow automatic closing of the tickets!?

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right approach. The change has to be internal to the agent after the output has finished processing its writes.

@srebhan srebhan removed the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Mar 2, 2021
@viperstars viperstars requested a review from ssoroka March 5, 2021 06:07
agent/agent.go Show resolved Hide resolved
agent/agent.go Show resolved Hide resolved
agent/agent.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better, thank you!

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me if you remove the extra log-message as @ssoroka requested/suggested.

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for one more request, but can you move the body of stopRunningOutputs() to the location of the function call. That is dissolve stopRunningOutputs() and fold into the gatherLoop() function. It's only a for loop that survived, wrapping it in a function is not worth it.

@viperstars
Copy link
Contributor Author

Sorry for one more request, but can you move the body of stopRunningOutputs() to the location of the function call. That is dissolve stopRunningOutputs() and fold into the gatherLoop() function. It's only a for loop that survived, wrapping it in a function is not worth it.

I think we should keep it. I add "stopRunningOutputs" based on the existing function "stopServiceInputs". These two functions work in the same way, so the definition and call should be same.

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Looks good to me. @ssoroka any comments?

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Mar 16, 2021
@reimda reimda changed the title Add closing running outputs while reloading Close running outputs when reloading Mar 16, 2021
@reimda reimda merged commit 71757e8 into influxdata:master Mar 16, 2021
reimda pushed a commit that referenced this pull request Apr 7, 2021
jblesener pushed a commit to jblesener/telegraf that referenced this pull request Apr 18, 2021
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
6 participants