Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One scrape target is still marked as UP even though it's not been scraped for over a week #1776

Closed
banks opened this Issue Jun 30, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@banks
Copy link

banks commented Jun 30, 2016

Note restarting the host "fixed" this issue, but I thought I'd report it in case it's useful for debugging as there is clearly a bug somewhere to get into this state.

What did you do?

We have 2 prometheus servers with same scrape config scraping from several hundred nodes each with a telegraf instance exposing metrics.

What did you expect to see?

The same number of "UP" scrape targets on each prometheus server.

What did you see instead? Under which circumstances?

One server had one fewer target for a week.

After diffing the up query responses I found the target in question and then looked at it's status on the /status page:

image

I't not been scraped for over a week but is still "UP"

Indeed if I look at any of it metrics on a graph, they just stopped being collected 10 days ago.
image

Note that the target in question is UP and being correctly scrape by the secondary prometheus host the whole time.

I can also curl the scrape endpont from the prometheus host just fine:

paul@prometheus.0 ~ $curl -v -o /dev/null http://172.18.12.185:9126/metrics
* Hostname was NOT found in DNS cache
*   Trying 172.18.12.185...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 172.18.12.185 (172.18.12.185) port 9126 (#0)
> GET /metrics HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 172.18.12.185:9126
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Length: 1682046
< Content-Type: text/plain; version=0.0.4
< Date: Thu, 30 Jun 2016 14:26:45 GMT
<
{ [data not shown]
100 1642k  100 1642k    0     0  14.2M      0 --:--:-- --:--:-- --:--:-- 14.3M
* Connection #0 to host 172.18.12.185 left intact

So prometheus instance seems to just be stuck on this scrape and not trying at all. To confirm, restarting the prometheus process in question "fixed" it.

Environment

  • System information:

    Linux 3.13.0-74-generic x86_64

  • Prometheus version:

prometheus, version 0.18.0 (branch: release-0.18, revision: f12ebd6)
  build user:       root@ebaf628123aa
  build date:       20160418-08:20:43
  go version:       go1.5.4
  • Prometheus configuration file:
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

scrape_configs:
  - job_name: 'prometheus'
    target_groups:
      - targets:
          - prometheus.0.dblayer.com:9090
          - prometheus.1.dblayer.com:9090
        labels:
          scrape_host: prometheus.0

  - job_name: 'discovered'
    file_sd_configs:
      - names:
        - /opt/prometheus/conf.d/*.yml

files in /opt/prometheus/conf.d just list hosts updated from a cron script that queries Chef server.

  • Logs:
    I didn't see anything obviously relevant there were a bunch of messages
insert Prometheus and Alertmanager logs relevant to the issue here
@banks

This comment has been minimized.

Copy link
Author

banks commented Jun 30, 2016

FWIW this is not a one-off, after restart I noticed the other server now had fewer scrapes...

image

So in fact there were 2 "stuck" scrapes on prometheus.0 and one on prometheus.1 originally. Restarts fixed them both. We'll see how long for.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 30, 2016

Can you confirm the contents of the file sd configs are identical with respect to that host?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 30, 2016

We've also had a few fixes to code in this area recently, could you see if 0.20.0 fixes this?

@banks

This comment has been minimized.

Copy link
Author

banks commented Jun 30, 2016

@brian-brazil Thanks for prompt response.

Can you confirm the contents of the file sd configs are identical with respect to that host?

Yep. they are generated by same script from same Chef data. To be sure I copied them both to one host:

$shasum prom.0/telegraf_dblayer.yml prom.1/telegraf_dblayer.yml
6fb0ed19167f29901f79c41a7aba6440d7ab305a  prom.0/telegraf.yml
6fb0ed19167f29901f79c41a7aba6440d7ab305a  prom.1/telegraf.yml

We've also had a few fixes to code in this area recently, could you see if 0.20.0 fixes this?

Yeah I'll wait a week or so and see if we get another case before I upgrade as that might make a result more conclusive!

@fabxc fabxc modified the milestone: v1.0.0 Jul 3, 2016

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 23, 2016

@banks Any news here? Do you still get this with 1.0.1?

@banks

This comment has been minimized.

Copy link
Author

banks commented Jul 25, 2016

@juliusv Thanks for following up. I was intentionally waiting a little while to be more confident that an improvement in a newer version isn't just chance. In other words I've not upgraded yet.

That said I still do see live examples of this happening:

image

So I will upgrade as soon as I get a chance and will be somewhat confident that if it doen't recur within a few weeks it is "solved".

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 25, 2016

@banks Thanks!

@banks

This comment has been minimized.

Copy link
Author

banks commented Jul 25, 2016

FYI, I just upgraded. Everything good so far, I'll report back in a week and again in 2 weeks. If no more stalls found let's call this fixed.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 8, 2016

@banks did the problem reappear?

@banks

This comment has been minimized.

Copy link
Author

banks commented Aug 8, 2016

Thanks for the reminder - I even set myself a reminder to report back a week ago and then got side tracked before I did..

But I've not observed any more cases of this in 2 weeks, so I'm going to assume it's fixed.

Thanks for the help. Great job fixing bugs before they are even reported :)

@banks banks closed this Aug 8, 2016

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 8, 2016

🎉

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.