Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS SD: Old targets do not disappear #1610

Closed
beorn7 opened this Issue May 3, 2016 · 13 comments

Comments

Projects
None yet
4 participants
@beorn7
Copy link
Member

beorn7 commented May 3, 2016

I have just run into an issue where targets disappear from the DNS SRV record but stick around in Prometheus. The new targets have been added, but the old ones don't disappear. I couldn't detect any hung goroutine are something, but perhaps I didn't look carefully enough.

I have the full goroutine dump if needed.

I'll report here if I see the issue happening more often. (So far this was the only incident.)

This has happened with the stock 0.18.0 binary.

@fabxc Perhaps this is related to the occasional K8s SD hiccups reported elsewhere, i.e. something in the retrieval layer might get stuck that is not specific to K8s.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 3, 2016

Goroutine dump would help, yes.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 3, 2016

Sent via personal mail.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 5, 2016

Just FTR: This happened again now. But it's definitely a rare event. Targets at SC are coming and going all the time. (Perhaps it's triggered by a special pattern, e.g. all targets are replaced in one go?)

@mweiden This is the issue I was talking about. And to be fair, we never considered it fixed.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 15, 2016

Happened again, on v0.20... Is there any difference between 1.0 and 0.20 that could affect this?

(We should release 1.1 so that I can upgrade all of SC, and then I can create goroutine dumps for the current version for easier troubleshooting.)

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Aug 15, 2016

Replacing all the addresses in a job, without renaming the job, does the trick. Here is a script to reproduce this reliably.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 18, 2016

This sounds pretty relevant for normal operations.
@fabxc do you think you'll find time to fix this soonish? Otherwise, I'll give it a spin (distracting me from client_golang work. ;)

@beorn7 beorn7 added the priority/P1 label Aug 18, 2016

@beorn7 beorn7 self-assigned this Aug 22, 2016

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 22, 2016

OK, I'm working on this one now. Dropping client_golang work in the meantime.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 22, 2016

Thanks, lacking bandwidth for this right now.

@espenkm

This comment has been minimized.

Copy link

espenkm commented Aug 22, 2016

This may or may not be useful information on this: What I also see is that the scraping gets confused as it now scrape some other pod that has gotten the recycled ip from the deleted pod(s). This (also?) leads to the new pod gets scraped with the wrong labels as prometheus thinks it is the old pod.

Were on 1.0.1

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 22, 2016

Bug is understood. Now I have to figure out a solution…

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 22, 2016

What is the bug?

On Mon, Aug 22, 2016 at 5:42 PM Björn Rabenstein notifications@github.com
wrote:

Bug is understood. Now I have to figure out a solution…


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1610 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8tHOHJ5dpwChl_O2-JYwVfA4zbtCks5qicNegaJpZM4IWM2O
.

beorn7 added a commit that referenced this issue Aug 22, 2016

retrieval: Clean up target group map on config reload
Also, remove unused `providers` field in targetSet.

If the config file changes, we recreate all providers (by calling
`providersFromConfig`) and retrieve all targets anew from the newly
created providers. From that perspective, it cannot harm to clean up
the target group map in the targetSet. Not doing so (as it was the
case so far) keeps stale targets around. This mattered if an existing
key in the target group map was not overwritten in the initial fetch
of all targets from the providers. Examples where that mattered:

```
scrape_configs:
- job_name: "foo"
  static_configs:
  - targets: ["foo:9090"]
  - targets: ["bar:9090"]
```
updated to:
```
scrape_configs:
- job_name: "foo"
  static_configs:
  - targets: ["foo:9090"]
```

`bar:9090` would still be monitored. (The static provider just
enumerates the target groups. If the number of target groups
decreases, the old ones stay around.

```
scrape_configs:
- job_name: "foo"
  dns_sd_configs:
  - names:
    - "srv.name.one.example.org"
```
updated to:
```
scrape_configs:
- job_name: "foo"
  dns_sd_configs:
  - names:
    - "srv.name.two.example.org"
```

Now both SRV records are still monitored. The SRV name is part of the
key in the target group map, thus the new one is just added and the
old ane stays around.

Obviously, this should have tests, and should have tests before, not
only for this case. This is the quick fix. I have created
#1906 to track test
creation.

Fixes #1610 .
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 22, 2016

See PR.

@beorn7 beorn7 closed this Aug 24, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.