Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upKubernetes SD: Unrelated new pod recycling old IP shows up as old pod of configured job #2266
Comments
beorn7
added
component/service discovery
kind/bug
priority/P1
labels
Dec 8, 2016
beorn7
assigned
fabxc
Dec 8, 2016
This comment has been minimized.
This comment has been minimized.
|
@fabxc Assigning to you as you probably know best what might happening here. Something with the way targets are hashed/identified, and if the new pod is seen as the old ones, the labels are never updated? |
This comment has been minimized.
This comment has been minimized.
|
#2262 smells like a related issue, for a completely different SD mechanism. |
This comment has been minimized.
This comment has been minimized.
|
This has happened at least twice. It is kind of worrying because it can go unnoticed, so we might have severe underreporting. (In the two detected cases, we had a per-endpoint alert, and that one pod from a completely different service happened to have a similarly named metric that tickled the alert.) |
beorn7
added a commit
that referenced
this issue
Jan 6, 2017
beorn7
assigned
beorn7
and unassigned
fabxc
Feb 16, 2017
This comment has been minimized.
This comment has been minimized.
|
I'm investigating if #2323 fixed it (by detecting re-used pod IPs in SoundCloud production and check if any Prometheus servers still monitors them under their old use). |
This comment has been minimized.
This comment has been minimized.
|
Investigated a number of cases, and they all look good. Closing until proven otherwise. |
beorn7
closed this
Feb 16, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
beorn7 commentedDec 8, 2016
What did you do?
We are running a large scale K8s cluster with many Prometheis monitoring individual jobs.
What did you expect to see?
If a pod goes away and another pod with the same IP number but completely different labels gets created, it should now show the new labels in Prometheus SD, not the old ones.
What did you see instead? Under which circumstances?
What likely happened: A pod of job A was removed. Shortly after, a pod for job B was created that happened to get the same IP number as the just removed pod. On the Prometheus server for job B, all was fine. On the Prometheus server for job A, the old pod stayed around, with exactly the same
__meta_kubernetes_…labels as before. Of course, now the Prometheus for job A was scraping that pod as if it belonged to job A, although the pod was exposing metrics for job B, obviously. That created a huge confusion with the resulting metrics.A restart of the job-A-Prometheus fixed the issue.
This is either a problem with the caching in the K8s client code within Prometheus, or it is a problem with the way Prometheus identifies a change in the labels for a pod IP.
Environment
Linux 4.4.10+soundcloud x86_64Relevant portions:
Many more relabel configs to follow… But the important one is here: If the pod is not labeled as
api-mobile, it is dropped. The new pod recycling the IP number is labeled differently, but the api-mobile Prometheus sees it labeled as the old pod was labeled. As can be see in the tool tip on the targets list.Only checkpointing messages in the logs, no error messages at all.