Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU after config change and SIGHUP #1114

Closed
dan-cleinmark opened this Issue Sep 24, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@dan-cleinmark
Copy link

dan-cleinmark commented Sep 24, 2015

We're seeing very high CPU use on all cores after a SIGHUP if the prometheus config file changes. Originally we were on 0.15.1 rev. 7a6d12a and the issue still persists on 0.16.0rc1. CPU performance is as expected when the service first starts but consistently drops off after a config change & SIGHUP. A screenshot showing CPU load after a config change & HUP is below.

The green line shows CPU cores idle. The first drop shows an addition in the global labels followed by a HUP, the next 2 are HUPing prometheus after removing target groups - the increases correlate to restarts of the prometheus service. When CPU spikes after a config reload, we see ~ 4/4 cores in 'user' state and on a 16 core box we see ~25% user, ~25% system and ~50% iowait - interestingly, disk load was always constantly stable, even though CPU iowait was very high.

image

Our prometheus config (200 total file_sd_config groups):

global:
  evaluation_interval: 15s
  labels:
    env: qa1
  scrape_interval: 15s
  scrape_timeout: 5s
rule_files: [/etc/prometheus/prometheus.rules, /etc/prometheus/common.rules, /etc/prometheus/elastic-search.rules,
  /etc/prometheus/test.rules, /etc/prometheus/third-party-services.rules, /etc/prometheus/jenkins.rules,
  /etc/prometheus/team-abc.rules, /etc/prometheus/team-123.rules]
scrape_configs:
- job_name: services_health
  target_groups:
  - targets: ['localhost:8180']
- file_sd_configs:
  - names: [/etc/prometheus/sd/service1.node.yml]
  job_name: service1_node
  metrics_path: /9100/metrics
  scrape_interval: 5s
- file_sd_configs:
  - names: [/etc/prometheus/sd/service1.app.yml]
  job_name: service1
  metrics_path: /9100/metrics
  scrape_interval: 5s
  - file_sd_configs:
    - names: [/etc/prometheus/sd/service2.node.yml]
    job_name: service1_node
    metrics_path: /9100/metrics
    scrape_interval: 5s
  - file_sd_configs:
    - names: [/etc/prometheus/sd/service2.app.yml]
    job_name: service1
    metrics_path: /9100/metrics
    scrape_interval: 5s
...
@dan-cleinmark

This comment has been minimized.

Copy link
Author

dan-cleinmark commented Sep 24, 2015

I'm not entirely sure how to interpret these, but here's an strace before / after a config reload:

[pid 22451] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid 22461] futex(0xc855551310, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22451] <... socket resumed> )      = 852
[pid 22386] <... futex resumed> )       = 0
[pid 22451] setsockopt(852, SOL_SOCKET, SO_BROADCAST, [1], 4 <unfinished ...>
[pid 22386] futex(0xc854f6c110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22451] <... setsockopt resumed> )  = 0
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22451] connect(852, {sa_family=AF_INET, sin_port=htons(9900), sin_addr=inet_addr("172.16.131.220")}, 16 <unfinished ...>
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22451] <... connect resumed> )     = -1 EINPROGRESS (Operation now in progress)
[pid 22359] <... clock_gettime resumed> {193998, 2266363}) = 0
[pid 22451] epoll_ctl(23, EPOLL_CTL_ADD, 852, {EPOLLIN|EPOLLOUT|EPOLLET|0x2000, {u32=491268928, u64=139698597538624}} <unfinished ...>
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22451] <... epoll_ctl resumed> )   = 0
[pid 22359] <... clock_gettime resumed> {1443129826, 2205126}) = 0
[pid 22451] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22451] <... clock_gettime resumed> {193998, 2512865}) = 0
[pid 22451] futex(0xc854f6c110, FUTEX_WAKE, 1) = 1
[pid 22386] <... futex resumed> )       = 0
[pid 22451] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22386] futex(0xc854f6c110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22451] <... clock_gettime resumed> {1443129826, 2551053}) = 0
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22451] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22387] <... epoll_wait resumed> {{EPOLLOUT, {u32=491268928, u64=139698597538624}}}, 128, 4294967295) = 1
[pid 22451] <... clock_gettime resumed> {193998, 2905215}) = 0
[pid 22387] epoll_wait(23,  <unfinished ...>
[pid 22451] getsockopt(852, SOL_SOCKET, SO_ERROR <unfinished ...>
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22451] <... getsockopt resumed> , [20679004809854976], [4]) = 0
[pid 22359] <... clock_gettime resumed> {193998, 3097844}) = 0
[pid 22451] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22451] <... clock_gettime resumed> {193998, 3206386}) = 0
[pid 22359] <... clock_gettime resumed> {1443129826, 3032185}) = 0
[pid 22451] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22451] <... clock_gettime resumed> {1443129826, 3132092}) = 0
[pid 22451] getsockname(852, {sa_family=AF_INET, sin_port=htons(33912), sin_addr=inet_addr("172.16.130.126")}, [16]) = 0
[pid 22451] getpeername(852,  <unfinished ...>
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22451] <... getpeername resumed> {sa_family=AF_INET, sin_port=htons(9900), sin_addr=inet_addr("172.16.131.220")}, [16]) = 0
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22451] setsockopt(852, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
[pid 22359] <... clock_gettime resumed> {193998, 3686713}) = 0
[pid 22451] <... setsockopt resumed> )  = 0
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22451] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] <... clock_gettime resumed> {1443129826, 3626346}) = 0
[pid 22451] <... clock_gettime resumed> {193998, 3875555}) = 0
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22451] clock_gettime(CLOCK_REALTIME, {1443129826, 3815747}) = 0
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22451] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22451] <... clock_gettime resumed> {193998, 4129848}) = 0
[pid 22359] <... clock_gettime resumed> {193998, 4165092}) = 0
[pid 22451] futex(0xc854f6c110, FUTEX_WAKE, 1 <unfinished ...>
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22451] <... futex resumed> )       = 1
[pid 22386] <... futex resumed> )       = 0
[pid 22451] read(852,  <unfinished ...>
[pid 22386] futex(0xc855551310, FUTEX_WAKE, 1 <unfinished ...>
[pid 22451] <... read resumed> 0xc8675c6000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 22461] <... futex resumed> )       = 0
[pid 22451] futex(0xc854f5f790, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22461] futex(0xc855551310, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22386] <... futex resumed> )       = 1
[pid 22359] <... clock_gettime resumed> {1443129826, 4094299}) = 0
[pid 22386] write(852, "GET /8080/stark/prometheuz HTTP/"..., 354 <unfinished ...>
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22386] <... write resumed> )       = 354
[pid 22386] futex(0xc855551310, FUTEX_WAKE, 1 <unfinished ...>
[pid 22461] <... futex resumed> )       = 0
[pid 22386] <... futex resumed> )       = 1
[pid 22461] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 22386] futex(0xc854f6c110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22359] clock_gettime(CLOCK_MONOTONIC, {193998, 5065977}) = 0
[pid 22359] clock_gettime(CLOCK_REALTIME, {1443129826, 4918634}) = 0
[pid 22461] <... select resumed> )      = 0 (Timeout)
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22461] futex(0xc855551310, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22359] clock_gettime(CLOCK_MONOTONIC, {193998, 5475129}) = 0
[pid 22359] futex(0x104d680, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 22549] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 22549] clock_gettime(CLOCK_MONOTONIC, {193998, 7695381}) = 0
[pid 22549] futex(0x104d680, FUTEX_WAKE, 1) = 1
[pid 22359] <... futex resumed> )       = 0
[pid 22549] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22549] <... clock_gettime resumed> {193998, 7888909}) = 0
[pid 22359] <... clock_gettime resumed> {193998, 7927299}) = 0
[pid 22549] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22549] <... clock_gettime resumed> {1443129826, 7810706}) = 0
[pid 22359] <... clock_gettime resumed> {1443129826, 7860176}) = 0
[pid 22549] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22549] <... clock_gettime resumed> {1443129826, 7957580}) = 0
[pid 22549] futex(0xc855551310, FUTEX_WAKE, 1) = 1
[pid 22461] <... futex resumed> )       = 0
[pid 22549] futex(0xc854f6c110, FUTEX_WAKE, 1 <unfinished ...>
[pid 22461] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 22549] <... futex resumed> )       = 1
[pid 22386] <... futex resumed> )       = 0
[pid 22549] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22549] <... clock_gettime resumed> {193998, 8543039}) = 0
[pid 22387] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT, {u32=491268928, u64=139698597538624}}}, 128, 4294967295) = 1
[pid 22549] futex(0x104d2f8, FUTEX_WAIT, 0, {0, 5027259} <unfinished ...>
[pid 22461] <... select resumed> )      = 0 (Timeout)
[pid 22387] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22461] futex(0xc855551310, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22387] <... clock_gettime resumed> {193998, 8755959}) = 0
[pid 22386] <... clock_gettime resumed> {1443129826, 8377270}) = 0
[pid 22387] futex(0xc855551310, FUTEX_WAKE, 1 <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22461] <... futex resumed> )       = 0
[pid 22387] <... futex resumed> )       = 1
[pid 22461] epoll_wait(23,  <unfinished ...>
[pid 22387] read(852,  <unfinished ...>
[pid 22461] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|0x2000, {u32=491268928, u64=139698597538624}}}, 128, 0) = 1
[pid 22387] <... read resumed> "HTTP/1.1 200 OK\r\nContent-Type: t"..., 4096) = 333
[pid 22461] epoll_wait(23,  <unfinished ...>
[pid 22387] futex(0xc854f5f790, FUTEX_WAKE, 1 <unfinished ...>
[pid 22386] <... clock_gettime resumed> {1443129826, 8733782}) = 0
[pid 22451] <... futex resumed> )       = 0
[pid 22387] <... futex resumed> )       = 1
[pid 22451] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 22387] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22387] <... clock_gettime resumed> {1443129826, 9186659}) = 0
[pid 22386] <... clock_gettime resumed> {1443129826, 9227549}) = 0
[pid 22387] epoll_ctl(23, EPOLL_CTL_DEL, 852, {0, {u32=0, u64=0}} <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22451] <... select resumed> )      = 0 (Timeout)
[pid 22387] <... epoll_ctl resumed> )   = 0
[pid 22451] futex(0xc854f5f790, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22387] close(852 <unfinished ...>
[pid 22386] <... clock_gettime resumed> {1443129826, 9376135}) = 0
[pid 22387] <... close resumed> )       = 0
[pid 22386] futex(0xc854f5f790, FUTEX_WAKE, 1 <unfinished ...>
[pid 22387] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 22451] <... futex resumed> )       = 0
[pid 22386] <... futex resumed> )       = 1
[pid 22451] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22386] <... clock_gettime resumed> {1443129826, 9793254}) = 0
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22387] <... select resumed> )      = 0 (Timeout)
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22387] futex(0xc854f82110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22451] <... select resumed> )      = 0 (Timeout)
[pid 22386] <... clock_gettime resumed> {1443129826, 9957817}) = 0
[pid 22451] futex(0xc854f5f790, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] <... clock_gettime resumed> {193998, 10103146}) = 0
[pid 22386] <... clock_gettime resumed> {1443129826, 10136974}) = 0
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22386] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid 22359] <... clock_gettime resumed> {1443129826, 10231244}) = 0
[pid 22386] <... socket resumed> )      = 852
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22386] setsockopt(852, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 22386] connect(852, {sa_family=AF_INET, sin_port=htons(9900), sin_addr=inet_addr("172.16.130.80")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22386] epoll_ctl(23, EPOLL_CTL_ADD, 852, {EPOLLIN|EPOLLOUT|EPOLLET|0x2000, {u32=491268928, u64=139698597538624}} <unfinished ...>
[pid 22359] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22386] <... epoll_ctl resumed> )   = 0
[pid 22359] <... clock_gettime resumed> {193998, 10902908}) = 0
[pid 22386] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22359] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22386] <... clock_gettime resumed> {193998, 11004195}) = 0
[pid 22359] <... clock_gettime resumed> {1443129826, 10827035}) = 0
[pid 22386] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 22359] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 22386] <... clock_gettime resumed> {1443129826, 10927914}) = 0
[pid 22386] clock_gettime(CLOCK_MONOTONIC, {193998, 11280571}) = 0
[pid 22386] futex(0xc854f6c110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22359] <... select resumed> )      = 0 (Timeout)
[pid 22359] clock_gettime(CLOCK_MONOTONIC, {193998, 11455658}) = 0
[pid 22461] <... epoll_wait resumed> {{EPOLLOUT, {u32=491268928, u64=139698597538624}}}, 128, 4294967295) = 1
[pid 22359] futex(0x104d680, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 22461] clock_gettime(CLOCK_MONOTONIC, {193998, 11594809}) = 0

After:

[pid 22491] sched_yield( <unfinished ...>
[pid 22560] <... inotify_add_watch resumed> ) = 1
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22560] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22560] <... clock_gettime resumed> {194294, 653808870}) = 0
[pid 22552] <... futex resumed> )       = 0
[pid 22560] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22491] sched_yield( <unfinished ...>
[pid 22560] <... clock_gettime resumed> {194294, 653933588}) = 0
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22560] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22560] <... clock_gettime resumed> {194294, 654055574}) = 0
[pid 22552] <... futex resumed> )       = 0
[pid 22560] epoll_wait(30,  <unfinished ...>
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22469] read(775,  <unfinished ...>
[pid 22476] <... futex resumed> )       = 0
[pid 22469] <... read resumed> "\0", 100) = 1
[pid 22552] sched_yield( <unfinished ...>
[pid 22476] sched_yield( <unfinished ...>
[pid 22552] <... sched_yield resumed> ) = 0
[pid 22476] <... sched_yield resumed> ) = 0
[pid 22491] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22491] <... futex resumed> )       = 0
[pid 22552] <... futex resumed> )       = 0
[pid 22476] futex(0xc88a2d3318, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 22469] close(776 <unfinished ...>
[pid 22476] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22552] sched_yield( <unfinished ...>
[pid 22491] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22552] <... sched_yield resumed> ) = 0
[pid 22491] <... futex resumed> )       = 0
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22491] sched_yield( <unfinished ...>
[pid 22476] <... futex resumed> )       = 0
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22469] <... close resumed> )       = 0
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22469] close(775 <unfinished ...>
[pid 22552] <... futex resumed> )       = 0
[pid 22469] <... close resumed> )       = 0
[pid 22491] sched_yield( <unfinished ...>
[pid 22469] close(774 <unfinished ...>
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22469] <... close resumed> )       = 0
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22469] close(773 <unfinished ...>
[pid 22476] <... futex resumed> )       = 0
[pid 22466] <... close resumed> )       = 0
[pid 22476] sched_yield( <unfinished ...>
[pid 22466] close(306 <unfinished ...>
[pid 22476] <... sched_yield resumed> ) = 0
[pid 22466] <... close resumed> )       = 0
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22476] futex(0xc88a2d3318, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 22552] <... futex resumed> )       = 0
[pid 22476] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22466] close(305 <unfinished ...>
[pid 22476] sched_yield( <unfinished ...>
[pid 22552] sched_yield( <unfinished ...>
[pid 22476] <... sched_yield resumed> ) = 0
[pid 22552] <... sched_yield resumed> ) = 0
[pid 22491] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22491] <... futex resumed> )       = 0
[pid 22476] <... futex resumed> )       = 0
[pid 22466] <... close resumed> )       = 0
[pid 22491] sched_yield( <unfinished ...>
[pid 22466] close(302 <unfinished ...>
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22465] write(391, "\0", 1 <unfinished ...>
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22531] <... epoll_wait resumed> {{EPOLLIN, {u32=390, u64=390}}}, 7, -1) = 1
[pid 22476] <... futex resumed> )       = 0
[pid 22531] futex(0xc856892590, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 22552] sched_yield( <unfinished ...>
[pid 22465] <... write resumed> )       = 1
[pid 22552] <... sched_yield resumed> ) = 0
[pid 22476] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 22476] <... futex resumed> )       = 0
[pid 22552] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22465] write(671, "\0", 1 <unfinished ...>
[pid 22552] sched_yield( <unfinished ...>
[pid 22526] <... epoll_wait resumed> {{EPOLLIN, {u32=670, u64=670}}}, 7, -1) = 1
[pid 22552] <... sched_yield resumed> ) = 0
[pid 22526] read(670,  <unfinished ...>
[pid 22476] sched_yield( <unfinished ...>
[pid 22526] <... read resumed> "\0", 100) = 1
[pid 22476] <... sched_yield resumed> ) = 0
[pid 22526] close(671 <unfinished ...>
[pid 22491] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22526] <... close resumed> )       = 0
[pid 22491] <... futex resumed> )       = 0
[pid 22526] close(670 <unfinished ...>
[pid 22476] sched_yield( <unfinished ...>
[pid 22526] <... close resumed> )       = 0
[pid 22476] <... sched_yield resumed> ) = 0
[pid 22526] close(669 <unfinished ...>
[pid 22491] sched_yield( <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22526] <... close resumed> )       = 0
[pid 22552] <... futex resumed> )       = 0
[pid 22526] close(668 <unfinished ...>
[pid 22491] <... sched_yield resumed> ) = 0
[pid 22476] futex(0xc88a2d3318, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22491] futex(0xc88a2d3318, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 22552] <... futex resumed> )       = 0
[pid 22491] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22476] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 22552] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22491] futex(0xc88a2d3318, FUTEX_WAKE, 1 <unfinished ...>
[pid 22552] <... futex resumed> )       = 0
[pid 22491] <... futex resumed> )       = 0
[pid 22465] <... write resumed> )       = 1
[pid 22463] <... epoll_wait resumed> {{EPOLLIN, {u32=547, u64=547}}}, 7, -1) = 1
[pid 22552] sched_yield( <unfinished ...>
[pid 22465] write(187, "\0", 1 <unfinished ...>
[pid 22552] <... sched_yield resumed> ) = 0

@fabxc fabxc added the Critical label Sep 24, 2015

@fabxc fabxc added this to the v0.16.0 milestone Sep 24, 2015

@fabxc fabxc added the bug label Sep 25, 2015

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

Thanks for the details! I managed to locally reproduce the bug now and got this pprof graph (via go tool pprof http://localhost:9090/debug/pprof/profile). Looking into that now.

pprof

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

So it it looks like when a FileSD shuts down, it closes its target groups channel, causing the nil check here to become true and break out of the select, but enter it right away again because of the for loop:

https://github.com/prometheus/prometheus/blob/master/retrieval/targetmanager.go#L142-L146

Even though tm.done is also closed already by that time and its select case should succeed as well, it looks like the select sometimes always chooses the first case and never the second one, causing a busy-loop.

@fabxc How about instead of breaking, we simply return also in the case where the tgc channel is closed? Then we have two signals for being "done", which is a bit dirty (they should always get closed at the same time anyways already though). Since we are going to supersede this entire scraping architecture with your simplified rewrite in 0.17.0, we probably shouldn't overengineer it.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

Actually, no. It looks like sometimes the tm.done channel is not being closed, although the tgc one is.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 25, 2015

The outstanding PR for 0.17.0 is not rewrite of this section.
It separates the scraping logic form target management, which is a different area. This pipeline will remain.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

@fabxc Hmm, ok. But looking at https://github.com/prometheus/prometheus/pull/1064/files, you actually did change exactly this code to always return in case of a closed target group channel. I still have to understand why the done channel wouldn't always be closed at the same time though.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 25, 2015

Yes, that's a line I touched (and probably should have backported in this way as we decided to defer this PR).
But it's not a fundamental rewrite of this section, so fixing this in master will not be overwritten by entirely different semantics later.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

Yeah, good to know and keep in mind.

Still trying to figure out why done isn't closed in this piece of code, even though logging clearly shows that done being closed earlier is what causes tgc to get closed. I must be missing something obvious...

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 25, 2015

Tell me when you found out – I feel like this is related to the mysterious out-of-order appending in #1064.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

Seems I found it out: since the target manager is reused and simply stopped and restarted, it gets a new done channel created when it restarts, and it can happen that the goroutine in https://github.com/prometheus/prometheus/blob/master/retrieval/targetmanager.go#L138-L151 is still running and now refers to a newly-created (not closed) done channel.

juliusv added a commit that referenced this issue Sep 25, 2015

Fix target manager CPU busyloop caused by bad done-channel handling.
Unfortunately this isn't nicely testable, as it's timing-dependent and
one would have to detect a stray goroutine doing a CPU busyloop...

Fixes #1114
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 25, 2015

@dan-cleinmark Could you take a look whether #1116 fixes this issue for you? It fixes it for me.

juliusv added a commit that referenced this issue Sep 28, 2015

Fix target manager CPU busyloop caused by bad done-channel handling.
Unfortunately this isn't nicely testable, as it's timing-dependent and
one would have to detect a stray goroutine doing a CPU busyloop...

Fixes #1114
@dan-cleinmark

This comment has been minimized.

Copy link
Author

dan-cleinmark commented Sep 28, 2015

@juliusv belatedly, this looks great! Thanks for the quick turn around.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

@dan-cleinmark Great to hear! No, thank you for catching this pre-0.16.0!

fabxc added a commit that referenced this issue Jan 11, 2016

Fix target manager CPU busyloop caused by bad done-channel handling.
Unfortunately this isn't nicely testable, as it's timing-dependent and
one would have to detect a stray goroutine doing a CPU busyloop...

Fixes #1114
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.