Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] prometheus_client stops gather metrics after reciving HUP #3049

Closed
freeseacher opened this issue Jul 24, 2017 · 10 comments · Fixed by #3053
Closed

[bug] prometheus_client stops gather metrics after reciving HUP #3049

freeseacher opened this issue Jul 24, 2017 · 10 comments · Fixed by #3053
Milestone

Comments

@freeseacher
Copy link
Contributor

Bug report

Relevant telegraf.conf:


[global_tags]
dc = "DC" # will tag all metrics with dc=us-east-1
env = "prod"

[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 10000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  quiet = true
  hostname = "ddiscovery02"

[[inputs.mem]]

[[outputs.prometheus_client]]
  ## Address to listen on
  listen = "0.0.0.0:9126"
  expiration_interval = "10s"

System info:

Telegraf v1.3.4 (git: release-1.3 7bbd3da)
Rhel 7.3

Steps to reproduce:

  1. add config
  2. start telegraf
  3. curl http://0.0.0.0:9126/metrics ensure mem metrics are here
  4. systemctl reload telegraf
  5. curl http://0.0.0.0:9126/metrics ensure mem metrics are here

Expected behavior:

After reload metrics still there

Actual behavior:

only prometheus_client itself metrics are in output.

Additional info:

[Include gist of relevant config, logs, etc.]

https://gist.github.com/freeseacher/e8647e49591f21b348accd8201bb6173

@freeseacher
Copy link
Contributor Author

got the same behaviour for ubuntu and 1.2.1
reload looks like works as expected in 1.1.1

@freeseacher
Copy link
Contributor Author

it seems there were already problems with it in the
54c9a38
and #2309

@danielnelson
Copy link
Contributor

Are the metrics restored after the next collection interval?

@freeseacher
Copy link
Contributor Author

freeseacher commented Jul 25, 2017 via email

@danielnelson
Copy link
Contributor

Do you get the same behavior if you send a SIGHUP? I just tried on the master branch and it recovered with the cpu input.

@freeseacher
Copy link
Contributor Author

yep. the same behavior

[stage] 23:14:05 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6735  100  6735    0     0  1781k      0 --:--:-- --:--:-- --:--:-- 2192k
123
[stage] 23:14:09 /etc/docker-compose/ch # systemctl restart telegraf 
[stage] 23:14:16 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6390  100  6390    0     0  1727k      0 --:--:-- --:--:-- --:--:-- 2080k
120
[stage] 23:14:19 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6624  100  6624    0     0   824k      0 --:--:-- --:--:-- --:--:--  924k
123
[stage] 23:14:21 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6625  100  6625    0     0  2459k      0 --:--:-- --:--:-- --:--:-- 3234k
123
[stage] 23:14:23 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  507k  100  507k    0     0  16.2M      0 --:--:-- --:--:-- --:--:-- 16.5M
4769
[stage] 23:14:36 /etc/docker-compose/ch # ps wax | grep telegraf                
 9917 ?        Ssl    0:05 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
[stage] 23:14:46 /etc/docker-compose/ch # kill -HUP 9917
[stage] 23:14:52 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  517k  100  517k    0     0  16.0M      0 --:--:-- --:--:-- --:--:-- 16.2M
4898
[stage] 23:14:53 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  517k  100  517k    0     0  15.3M      0 --:--:-- --:--:-- --:--:-- 15.7M
4898
[stage] 23:14:57 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6655  100  6655    0     0  1315k      0 --:--:-- --:--:-- --:--:-- 1624k
123
[stage] 23:15:03 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6666  100  6666    0     0  3539k      0 --:--:-- --:--:-- --:--:-- 6509k
123
[stage] 23:15:43 /etc/docker-compose/ch # curl http://0.0.0.0:9126/metrics|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6666  100  6666    0     0  2040k      0 --:--:-- --:--:-- --:--:-- 3254k
123

@danielnelson
Copy link
Contributor

I can reproduce, I must have not waited long enough for them to expire. Will work on it now.

@lastsky
Copy link

lastsky commented Jul 25, 2017

virgin centos 7 (and ubuntu!) - too.

[root@virtualbox-srv-ubuntu ~]# date
Tue Jul 25 23:40:20 MSK 2017
[root@virtualbox-srv-ubuntu ~]# curl -s virtualbox-srv-ubuntu:9126/metrics | wc -l
517
[root@virtualbox-srv-ubuntu ~]# systemctl reload telegraf
[root@virtualbox-srv-ubuntu ~]# date
Tue Jul 25 23:40:29 MSK 2017
[root@virtualbox-srv-ubuntu ~]# curl -s virtualbox-srv-ubuntu:9126/metrics | wc -l
517
[root@virtualbox-srv-ubuntu ~]# systemctl restart telegraf
[root@virtualbox-srv-ubuntu ~]# curl -s virtualbox-srv-ubuntu:9126/metrics | wc -l
120

telegraf version

Telegraf v1.3.4 (git: release-1.3 7bbd3da)

@danielnelson danielnelson mentioned this issue Jul 25, 2017
3 tasks
@danielnelson danielnelson added this to the 1.3.5 milestone Jul 25, 2017
@lastsky
Copy link

lastsky commented Jul 25, 2017

@danielnelson Thanks!!!!!

@danielnelson
Copy link
Contributor

Small warning, I expect you will now notice this issue #2839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants