Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing #10257

jutley · 2022-02-03T18:07:08Z

What did you see instead? Under which circumstances?

We have an alert that can fire when a target cannot be scraped. This began firing, and upon inspection, the target did not actually exist. The target was for a Kubernetes pod that was replaced. It no longer appeared in the Kubernetes apiserver, and its IP address was not in the relevant Services corresponding Endpoints resources.

We run redundant, identical Prometheuses, and this only happened in one of them.

To better understand this state, we manually deleted another Pod of the same Deployment. The affected Prometheus successfully removed the old Pod from its targets and added the new Pod. The original false Pod, however, was still in the target list.

Additionally, it's worth noting that this group of Pods are discovered a few times. We accidentally had two Services pointing to these Pods' metrics endpoint, and both were discovered using a single ServiceMonitor resource from the prometheus-operator. Only one of these Services shows the problem (four targets, including the false one). The other has the appropriate targets (three targets). We also probe these pods using the blackbox-exporter, which shows the exact same issues (seven targets, with three pairs of accidental duplicates, and one false target).

I have absolutely no idea how to reproduce this.

What did you do?

Nothing. The bug occurred while we were hands off.

What did you expect to see?

When the old Pod was removed and replaced, the Prometheus instance should have updated its targets accordingly by removing the old Pod.

Environment

System information:

Linux 5.4.155-flatcar x86_64

Prometheus version:

prometheus, version 2.32.1 (branch: HEAD, revision: 41f1a8125e664985dd30674e5bdf6b683eff5d32)
  build user:       root@54b6dbd48b97
  build date:       20211217-22:08:06
  go version:       go1.17.5
  platform:         linux/amd64

Prometheus configuration file:

Here is all the job configuration that targets/probes the relevant Pods/Services.

- job_name: serviceMonitor/kube-dns/core-dns/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (.*)
    target_label: __tmp_prometheus_job_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: kube-dns
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: metrics
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_container_name]
    separator: ;
    regex: (.*)
    target_label: container
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: (.+)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: metrics
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    modulus: 1
    target_label: __tmp_hash
    replacement: $1
    action: hashmod
  - source_labels: [__tmp_hash]
    separator: ;
    regex: "0"
    replacement: $1
    action: keep
  kubernetes_sd_configs:
  - role: endpoints
    kubeconfig_file: ""
    follow_redirects: true
    namespaces:
      names:
      - kube-dns
- job_name: serviceMonitor/kube-dns/dns-server-health/0
  honor_timestamps: true
  params:
    module:
    - dns_kubernetes_svc
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (.*)
    target_label: __tmp_prometheus_job_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: kube-dns
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: dns
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_container_name]
    separator: ;
    regex: (.*)
    target_label: container
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: dns
    action: replace
  - source_labels: [__meta_kubernetes_pod_ip]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: blackbox-exporter.prometheus:9115
    action: replace
  - separator: ;
    regex: (.*)
    target_label: job
    replacement: dns-kubernetes-svc-blackbox
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    modulus: 1
    target_label: __tmp_hash
    replacement: $1
    action: hashmod
  - source_labels: [__tmp_hash]
    separator: ;
    regex: "0"
    replacement: $1
    action: keep
  kubernetes_sd_configs:
  - role: endpoints
    kubeconfig_file: ""
    follow_redirects: true
    namespaces:
      names:
      - kube-dns
- job_name: serviceMonitor/kube-dns/dns-server-health/1
  honor_timestamps: true
  params:
    module:
    - dns_route53_record
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (.*)
    target_label: __tmp_prometheus_job_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: kube-dns
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: dns
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_container_name]
    separator: ;
    regex: (.*)
    target_label: container
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: dns
    action: replace
  - source_labels: [__meta_kubernetes_pod_ip]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: blackbox-exporter.prometheus:9115
    action: replace
  - separator: ;
    regex: (.*)
    target_label: job
    replacement: dns-route53-record-blackbox
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    modulus: 1
    target_label: __tmp_hash
    replacement: $1
    action: hashmod
  - source_labels: [__tmp_hash]
    separator: ;
    regex: "0"
    replacement: $1
    action: keep
  kubernetes_sd_configs:
  - role: endpoints
    kubeconfig_file: ""
    follow_redirects: true
    namespaces:
      names:
      - kube-dns

Logs:

We did not find any interesting logs, but here are the logs surrounding the start of the bug:

ts=2022-02-03T09:00:09.464Z caller=compact.go:518 level=info component=tsdb msg="write block" mint=1643868000025 maxt=16
43875200000 ulid=01FTZCZPTXY02H4J1RRGFC133J duration=9.242232921s
ts=2022-02-03T09:00:09.935Z caller=head.go:812 level=info component=tsdb msg="Head GC completed" duration=468.087779ms
ts=2022-02-03T09:00:09.999Z caller=checkpoint.go:98 level=info component=tsdb msg="Creating checkpoint" from_segment=747
 to_segment=749 mint=1643875200000
ts=2022-02-03T09:00:17.901Z caller=head.go:981 level=info component=tsdb msg="WAL checkpoint complete" first=747 last=74
9 duration=7.902867284s
ts=2022-02-03T11:00:09.509Z caller=compact.go:518 level=info component=tsdb msg="write block" mint=1643875200145 maxt=16
43882400000 ulid=01FTZKVE2Y697HQ7XP60J31F04 duration=9.286627633s
ts=2022-02-03T11:00:10.009Z caller=head.go:812 level=info component=tsdb msg="Head GC completed" duration=497.591114ms
ts=2022-02-03T11:00:10.062Z caller=checkpoint.go:98 level=info component=tsdb msg="Creating checkpoint" from_segment=750
 to_segment=752 mint=1643882400000
ts=2022-02-03T11:00:18.421Z caller=head.go:981 level=info component=tsdb msg="WAL checkpoint complete" first=750 last=75
2 duration=8.359750705s
ts=2022-02-03T11:00:48.763Z caller=compact.go:459 level=info component=tsdb msg="compact blocks" count=3 mint=1643846400
080 maxt=1643868000000 ulid=01FTZKVZVPMF7D3E9543S6MH6H sources="[01FTYRCH2Z443XHBNB2FF41DNN 01FTYZ88AYXA5K7VAPCMXBT7BB 0
1FTZ63ZJYAK50QS1GHT7DCXJE]" duration=30.340985514s
ts=2022-02-03T11:00:48.782Z caller=db.go:1279 level=info component=tsdb msg="Deleting obsolete block" block=01FTYRCH2Z44
3XHBNB2FF41DNN
ts=2022-02-03T11:00:48.797Z caller=db.go:1279 level=info component=tsdb msg="Deleting obsolete block" block=01FTYZ88AYXA
5K7VAPCMXBT7BB
ts=2022-02-03T11:00:48.814Z caller=db.go:1279 level=info component=tsdb msg="Deleting obsolete block" block=01FTZ63ZJYAK
50QS1GHT7DCXJE
ts=2022-02-03T13:00:14.248Z caller=compact.go:518 level=info component=tsdb msg="write block" mint=1643882400041 maxt=16
43889600000 ulid=01FTZTQ5AYKX9TKQJ0STSR6VG1 duration=14.026654896s
ts=2022-02-03T13:00:14.737Z caller=head.go:812 level=info component=tsdb msg="Head GC completed" duration=485.328335ms
ts=2022-02-03T13:00:14.806Z caller=checkpoint.go:98 level=info component=tsdb msg="Creating checkpoint" from_segment=753
 to_segment=755 mint=1643889600000
ts=2022-02-03T13:00:19.997Z caller=head.go:981 level=info component=tsdb msg="WAL checkpoint complete" first=753 last=75
5 duration=5.190818906s
ts=2022-02-03T15:00:15.284Z caller=compact.go:518 level=info component=tsdb msg="write block" mint=1643889600052 maxt=16
43896800000 ulid=01FV01JWJYNX4MDBA0Q4RFZ5YR duration=15.061840545s
ts=2022-02-03T15:00:15.935Z caller=head.go:812 level=info component=tsdb msg="Head GC completed" duration=647.428299ms
ts=2022-02-03T15:00:15.998Z caller=checkpoint.go:98 level=info component=tsdb msg="Creating checkpoint" from_segment=756
 to_segment=758 mint=1643896800000
ts=2022-02-03T15:00:21.498Z caller=head.go:981 level=info component=tsdb msg="WAL checkpoint complete" first=756 last=75
8 duration=5.499604249s

The text was updated successfully, but these errors were encountered:

brancz · 2022-02-06T17:43:37Z

What is the Kubernetes version involved here?

cc @simonpasquier @fpetkovski @philipgough

jutley · 2022-02-07T16:35:59Z

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

fpetkovski · 2022-02-09T11:07:40Z

I am not completely convinced this is strictly caused by kubernetes discovery. Target groups for the endpoints role are always generated from scratch, and if Prometheus is getting stale reads from the informer, it wouldn't see the new pods at all. It is also strange that there's only one stale pod in the target group while new deletions are properly picked up.

Could the problem be somewhere in the scraping pool instead?

fpetkovski · 2022-02-10T08:24:56Z

@jutley if you still have the faulty Kubernetes endpoint, you can try checking if there are any notReadyAddresses in the subsets field. These IPs will not be shown in the regular kubectl get endpoints output, you will need to add a -o yaml, but Prometheus will pick them up as well and try to scrape them.

jan--f · 2022-02-10T13:11:21Z

We're tracking similar symptoms as well in multiple cases (https://bugzilla.redhat.com/show_bug.cgi?id=1943860).
I agree with @fpetkovski, the k8s SD and Informer side looks unsuspicious.
The cases we have seen seem to coincide with either heavy load on the apiserver (i.e. request throttling is in effect) or temporary outage/churn of in the apiserver (due to say a node shutting down). Sometimes both those parameters coincide with the bug result here.
The bugzilla link also further links to other bugzillas and an open issue in kube-state-metrics: kubernetes/kube-state-metrics#1569

metalmatze · 2022-02-18T11:02:27Z

We've now discovered this on our PolarSignals GKE cluster too.
I'll take a closer look at it next week.

In the meantime I've deleted the old Pods with:

kubectl delete pod -n observability --field-selector=status.phase==Succeeded
kubectl delete pod -n observability --field-selector=status.phase==Failed

metalmatze · 2022-02-21T11:10:31Z

Alright, look a bit around since the issue started showing up over the weekend again.
It seems like Prometheus itself is doing just fine. I don't see any errors and all Prometheus Kubernetes SD looks just fine too.

Looking at the Service of one of our Deployments I could see too many endpoints and looking at the endpoint itself, I can see exactly that one Pod still showing up under notReadyAddresses:

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2022-02-21T10:11:17Z"
  creationTimestamp: "2021-01-04T13:41:25Z"
  labels:
    api.polarsignals.com/groupcache: profile-sharing
    app.kubernetes.io/component: api
    app.kubernetes.io/instance: api
    app.kubernetes.io/name: polarsignals-api
    app.kubernetes.io/version: 514d7d2c4818546255f2ba10beba7c4e348c27264c580116a87a936f0afee08
  name: api
  namespace: api
  resourceVersion: "289624038"
  uid: 918bdc6c-5d22-4ee9-9ce9-f280db4865b2
subsets:
- addresses:
  - ip: 10.0.0.47
    nodeName: gke-europe-west3-0-e2-medium-b3fa18ca-hg90
    targetRef:
      kind: Pod
      name: api-86d654d6b8-zkpr2
      namespace: api
      resourceVersion: "289623730"
      uid: fadfd43e-1648-45f8-bcb7-accd58f0780e
  - ip: 10.0.0.48
    nodeName: gke-europe-west3-0-e2-medium-b3fa18ca-hg90
    targetRef:
      kind: Pod
      name: api-86d654d6b8-gg7lf
      namespace: api
      resourceVersion: "289624035"
      uid: 83ad9041-0ea9-4d03-965a-2eabe05f9a33
  - ip: 10.0.1.48
    nodeName: gke-europe-west3-0-e2-medium-716d4dc9-l67q
    targetRef:
      kind: Pod
      name: api-86d654d6b8-6xsc8
      namespace: api
      resourceVersion: "289623693"
      uid: f5d3c0b7-6626-4d09-b1b1-7481da3f01d5
  - ip: 10.0.2.65
    nodeName: gke-europe-west3-0-e2-medium-54ff9fe7-pz7j
    targetRef:
      kind: Pod
      name: api-86d654d6b8-jdbr9
      namespace: api
      resourceVersion: "289623942"
      uid: 57144d01-39d5-4d0c-ab60-d7c47518f8e1
  - ip: 10.0.3.13
    nodeName: gke-europe-west3-0-preemptible-e2-hig-e392ee80-6twh
    targetRef:
      kind: Pod
      name: api-86d654d6b8-bp5gj
      namespace: api
      resourceVersion: "289623662"
      uid: 5a47f714-84d2-4aba-a7e9-21337f349c32
  - ip: 10.0.3.15
    nodeName: gke-europe-west3-0-preemptible-e2-hig-e392ee80-6twh
    targetRef:
      kind: Pod
      name: api-86d654d6b8-rmwfr
      namespace: api
      resourceVersion: "289623907"
      uid: 9e2e9920-a6f5-4099-86af-9fb307ffc704
  - ip: 10.0.4.46
    nodeName: gke-europe-west3-0-preemptible-e2-hig-17a4163d-rwsq
    targetRef:
      kind: Pod
      name: api-86d654d6b8-dxl9b
      namespace: api
      resourceVersion: "289623861"
      uid: 186c285e-1c50-4f4a-a432-6aba5fccd494
  - ip: 10.0.4.47
    nodeName: gke-europe-west3-0-preemptible-e2-hig-17a4163d-rwsq
    targetRef:
      kind: Pod
      name: api-86d654d6b8-bbhq7
      namespace: api
      resourceVersion: "289623886"
      uid: af3aa13e-7a2c-41d9-a096-e3d4a42ed52a
  - ip: 10.0.5.21
    nodeName: gke-europe-west3-0-preemptible-e2-hig-4e3c6d35-5xkh
    targetRef:
      kind: Pod
      name: api-86d654d6b8-t68c4
      namespace: api
      resourceVersion: "289623769"
      uid: cfbc1849-f472-41e8-93a6-0f94723b7435
  - ip: 10.0.5.22
    nodeName: gke-europe-west3-0-preemptible-e2-hig-4e3c6d35-5xkh
    targetRef:
      kind: Pod
      name: api-86d654d6b8-8lm6q
      namespace: api
      resourceVersion: "289624016"
      uid: 292823ae-ea56-4150-87bb-8717dc414313
  notReadyAddresses:
  - ip: 10.0.0.34
    nodeName: gke-europe-west3-0-e2-medium-b3fa18ca-hg90
    targetRef:
      kind: Pod
      name: api-674477959-pkrf2
      namespace: api
      resourceVersion: "288605097"
      uid: 1c46c3a7-7665-4997-b180-59bf4bbc8115
  ports:
  - name: grpc
    port: 10901
    protocol: TCP
  - name: http
    port: 8080
    protocol: TCP

This was visible on

Prometheus v2.33.1
Kubernetes: v1.22.4-gke.1501

Does it seem as if Prometheus isn't properly filtering the notReadyAddresses if some condition is met?

fpetkovski · 2022-02-21T11:35:09Z

To me it looks like notReadyAddresses are treated the same way as regular addresses, with the exception that the kubernetes_endpoint_ready label is set to false:

prometheus/discovery/kubernetes/endpoints.go

Lines 304 to 306 in e239e3e

    
           for _, addr := range ss.NotReadyAddresses { 
        
           	add(addr, port, "false") 
        
           }

Not sure if this is intended or not though.

A solution could be to exclude pods in the Succeeded phase from being discovered, but this changes the assumption that all pods backed by an endpoint are expected to be running at all times.

metalmatze · 2022-02-21T11:37:02Z

In the service discovery I can see the __meta_kubernetes_endpoint_ready="false" being set to false correctly.
Reading through prometheus-operator/prometheus-operator#3965, I'm more of the impression that it is a semantic decision on what makes sense for a given environment.

On my personal Scaleway cluster these unready Pods will be cleaned up quite quickly by the API server and won't be staying in the namespace for long, whereas it seems that GKE, for example, keeps the Status: Failed Pods around so they can be inspected later on (at least that's what I imagine the reasoning being).

Knowing that, I'm inclined to exclude all non-ready Pods in the Prometheus SD in our environment.

philipgough · 2022-02-21T11:58:48Z

whereas it seems that GKE, for example, keeps the Status: Failed Pods around so they can be inspected later on

Reading kubernetes/kubernetes#99986 your assumption seems correct.

GC is configurable with a flag in kube-controller-manager

metalmatze · 2022-02-21T12:21:17Z

This doesn't see to be configurable on GKE, so at least there we'll probably going to have to filter out all unready endpoints 🤷‍♂️
https://issuetracker.google.com/issues/172663707?pli=1

nathan-vp · 2022-06-30T09:06:09Z

In the service discovery I can see the __meta_kubernetes_endpoint_ready="false" being set to false correctly.

This happens for us as well, but here __meta_kubernetes_endpoint_ready="true" is set to true incorrectly (the Pod is not there anymore).

pharaujo · 2022-07-06T11:23:53Z

I just want to add another datapoint here; I'm seeing the same issue as the original post (prometheus 2.32.1, only one in the HA pair showing the issue), and I can confirm it's not unready endpoints (same as @nathan-vp, stale targets all have __meta_kubernetes_endpoint_ready="true"). From 55 scrape jobs in the affected prometheus instance, 5 are showing the issue. This started happening after we rotated the nodes in the cluster (upgrading EKS from 1.21 to 1.22); both pods had a restart.

SD seems to keep working though (adding and removing targets), as can be seen in the screenshot, apparently with the stale targets intact:

roidelapluie · 2022-07-06T13:12:25Z

That is super interesting. Can we see:

prometheus_target_scrape_pools_failed_total
prometheus_target_scrape_pool_sync_total

Thanks!

pharaujo · 2022-07-06T13:28:38Z

Let me know if you want different aggregations!

roidelapluie · 2022-07-06T13:33:04Z

I think the next step would be to get a go routine dump.

https://prometheus/debug/pprof/goroutine?debug=2 (You can put the output in e.g. a gist).

pharaujo · 2022-07-06T13:40:15Z

go routine for prometheus-k8s-1 (the instance that shows problems) here: https://gist.github.com/pharaujo/eb2d4697e8352883ce8c050f8666003c

roidelapluie · 2022-07-06T13:41:38Z

Do you have an idea for how long this instance is not updating its targets properly>

pharaujo · 2022-07-06T13:44:56Z

As far as I can tell, since the restart roughly 2 days ago (visible in the first screenshot I sent).

roidelapluie · 2022-07-08T12:41:11Z

What comes into my mind is that we could get to this specific return:

prometheus/discovery/kubernetes/endpoints.go

Line 194 in 44fcf87

return false

Without sending an empty targetgroup for those endpoints. I'll need to dig firter. i do wonder if there is an easy reproducible way.

TBeijen · 2022-08-01T06:57:11Z

Prometheus: v2.32.1
EKS 1.22

Occasionaly running into this as well, so another data point.

Most recent occurence:

Started when prometheus pod itself was replaced (node replacement)
It looks like the target got replaced at that moment as well (in case prometheus operator pod)
kubectl get endpoints holds just the new pod. Not the old pod (which no longer exists), also not in notReadyAddresses state via -0 yaml
Old target visible in /service-discovery, having __meta_kubernetes_endpoint_ready="true"
Only target present for that particular node, which is to be expected since the node no longer exists. (via __meta_kubernetes_pod_node_name)
Persists for ~14hr at this moment.

Trying to trigger refresh:

Restarting the target pod that incorrectly has its old target in list. No effect, the 'up' target gets replaced as it should, however the stale target persists.
Restarting prometheus itself: This 'fixes' the problem. Target list now correctly shows the single pod.

Fwiw (n=2): Occurences so far have been single pod targets.

Dump: https://gist.github.com/TBeijen/d99b88d0123d43a6e3b8d191bf4e34a6

In below screenshots stale target situation started ~17:30

pharaujo · 2022-08-09T10:46:04Z

Happened again in another Kubernetes cluster, right after the prometheus pod was moved to another node. Both pods in the HA pair restarted, only one show the problem. Only 1 in 57 scrape jobs has a "phantom" target.

hervenicol · 2022-12-09T15:00:31Z

Experienced it too.

From my tests, to force "forgeting" outdated targets:

No need to restart prometheus, you can send a SIGHUP to refresh its config. Has the same effect, with no interruption.
reload API enpoint would probably have the same effect (https://prometheus.io/docs/prometheus/latest/management_api/#reload)

Bhargavram3468 · 2023-01-10T15:08:01Z

Hi,

I have similar issue, when we are migrating from one version of own nodeexporter to another version, we have observed in endpoints ---> the pod ip of old version is tagged to new version of port number in kubectl get endpoints, and after some time the new pod ip is updated successfully. So do you have any idea of resolving the issue.

For clear understanding:
old pod ip starts with 10.xx.xx.xx:9690
new pod Ip starts with 192.XX.XX.XX:9100
but Iam getting 10.XX.XX.XX:9100 for some time.

breunigs mentioned this issue May 12, 2022

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. kubernetes/kubernetes#109414

Closed

lwithers mentioned this issue May 26, 2022

Kubernetes service discovery continues to scrape metrics from the IP of pods in an Error state #10755

Open

hanikesn mentioned this issue Nov 10, 2022

Allow discovery of terminating condition for kubernetes-sd endpointslice #11507

Closed

TheMeier mentioned this issue Sep 18, 2023

kubernetes_sd does not remove existing targets when an existing config is modified to produce an empty result #12858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing #10257

Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing #10257

jutley commented Feb 3, 2022 •

edited

brancz commented Feb 6, 2022

jutley commented Feb 7, 2022

fpetkovski commented Feb 9, 2022

fpetkovski commented Feb 10, 2022 •

edited

jan--f commented Feb 10, 2022

metalmatze commented Feb 18, 2022

metalmatze commented Feb 21, 2022

fpetkovski commented Feb 21, 2022

metalmatze commented Feb 21, 2022

philipgough commented Feb 21, 2022

metalmatze commented Feb 21, 2022

nathan-vp commented Jun 30, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022 •

edited

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 8, 2022

TBeijen commented Aug 1, 2022

pharaujo commented Aug 9, 2022

hervenicol commented Dec 9, 2022 •

edited

Bhargavram3468 commented Jan 10, 2023

Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing #10257

Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing #10257

Comments

jutley commented Feb 3, 2022 • edited

brancz commented Feb 6, 2022

jutley commented Feb 7, 2022

fpetkovski commented Feb 9, 2022

fpetkovski commented Feb 10, 2022 • edited

jan--f commented Feb 10, 2022

metalmatze commented Feb 18, 2022

metalmatze commented Feb 21, 2022

fpetkovski commented Feb 21, 2022

metalmatze commented Feb 21, 2022

philipgough commented Feb 21, 2022

metalmatze commented Feb 21, 2022

nathan-vp commented Jun 30, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022 • edited

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 6, 2022

pharaujo commented Jul 6, 2022

roidelapluie commented Jul 8, 2022

TBeijen commented Aug 1, 2022

pharaujo commented Aug 9, 2022

hervenicol commented Dec 9, 2022 • edited

Bhargavram3468 commented Jan 10, 2023

jutley commented Feb 3, 2022 •

edited

fpetkovski commented Feb 10, 2022 •

edited

pharaujo commented Jul 6, 2022 •

edited

hervenicol commented Dec 9, 2022 •

edited