Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s SD: Targets are not updated #4124

Closed
beorn7 opened this Issue Apr 30, 2018 · 38 comments

Comments

Projects
None yet
9 participants
@beorn7
Copy link
Member

beorn7 commented Apr 30, 2018

Bug Report

What did you do?

Scaled new K8s pods up, scaled old ones down.

What did you expect to see?

Old pods disappear from targets, new ones show up.

What did you see instead? Under which circumstances?

Prometheus doesn't see the new pods, tries to scrape the old ones.

Note that a proper target update can be triggered by sending a SIGHUP.

Environment

  • System information:

    Linux 4.4.10+soundcloud x86_64

  • Prometheus version:

    most recently investigated version is revision 2be543e (in the release-2.2 branch). However, the same happened in all 2.2 releases, most notably also before the grand K8s SD refactoring of #4117 .

  • Prometheus configuration file:

See the various comments below.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Apr 30, 2018

This issue is seen in production at SoundCloud. I'm still working on reproducing it with a minimal setup.

In particular, this issue can not be reproduced with the following setup (cf. #4117 (comment) ):

@krasi-georgiev wrote: "using the configs below seems to work as expected with minikube.
When scaling up and down the deployment pods the targets list get updated properly in the prometheus GUI."

global:
  scrape_interval:     500ms

scrape_configs:
- job_name: 'kubernetes-pods'
  tls_config:
    insecure_skip_verify: true
  
  kubernetes_sd_configs:
  - api_server: https://192.168.99.101:8443
    role: pod
    namespaces:
      names:
      - default
    tls_config:
      ca_file: /home/krasi/.minikube/ca.crt
      cert_file: /home/krasi/.minikube/client.crt
      key_file: /home/krasi/.minikube/client.key
apiVersion: apps/v1 
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 
  template: 
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Apr 30, 2018

Here is the job config that creates the problem. Note that the Prometheus server is deployed on bare-metal outside of the K8s cluster. The K8s API server is v1.10.1. The very same config works on Prometheus 1.8.2 (and IIRC on 2.0.0, although I might need to double-check that).

- job_name: myjob
  kubernetes_sd_configs:
  - api_server: https://api.mycluster.k8s.example.org
    role: pod
    tls_config:
      ca_file: "/srv/prometheus/k8s-certificates/mycluster/ca.crt"
      cert_file: "/srv/prometheus/k8s-certificates/mycluster/client.crt"
      key_file: "/srv/prometheus/k8s-certificates/mycluster/client.key"
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    regex: ".+;(?:[0-9]+,?)+|;"
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_label_system
    regex: ".+;myjob|;"
  - source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
    regex: "(https?)"
  - source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
    regex: "(.+)"
  - source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    target_label: __address__
    regex: "(.+?)(?::\\d+)?;(\\d+)"
    replacement: "$1:$2"
  - target_label: cluster
    replacement: mycluster
  - source_labels:
    - __meta_kubernetes_pod_label_system
    target_label: system
  - source_labels:
    - __meta_kubernetes_pod_label_env
    target_label: env
  - source_labels:
    - __meta_kubernetes_pod_label_component
    target_label: component
  - source_labels:
    - __meta_kubernetes_pod_label_version
    target_label: version
  - source_labels:
    - __meta_kubernetes_pod_label_track
    target_label: track
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: instance
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Apr 30, 2018

My general thoughts:

  • The config per se seems sane, as it works on 1.8/2.0.
  • The problem seems to be a missed update, as a SIGHUP gets it unstuck.
  • Perhaps it makes a difference if the Prometheus server is in-cluster or not (assuming that @krasi-georgiev 's test config uses an in-cluster Prometheus).
  • Another possible explanation for difficulties to reproduce: Perhaps this is a racey thing (not as in a data race detectable by the Go race detector, but in the more general meaning: it only happens now and then if the timing of events are "right"). @krasi-georgiev what happens if you let your test setup run for a longer time with continues scaling up and down?
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 30, 2018

my Prometheus is running on the local machine(bare metal) and the k8s minikube runs in a VM so it is the same as your setup.
Now trying with your config and will try scaling up and down continuously.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 30, 2018

here is the minimal deployment that works with this config, but still works as expected on my setup so will keep digging.

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: nginx-deployment

spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 
  template: 
    metadata:
      labels:
        app: nginx
        system: myjob
      annotations:
        prometheus.io/port: "8080"
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 24, 2018

Just to update current state: This only seems to happen with large K8s clusters. Initial full load works (i.e. after starting Prometheus or after SIGHUP, all is good), but later changes are (partially?) lost.

We couldn't recreate this in a smaller setup. It would be interesting to know if other users of large clusters (~10k pods or more) see similar problems.

I will be off work for the summer, so I cannot work on this from inside SoundCloud.

@beorn7 beorn7 added the priority/P1 label May 24, 2018

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 24, 2018

This is a total blocker for anything involving K8s SD for us. I mark it "only" as P1 assuming that only large-scale K8s users are affected (although those might be the most important users ;-).

On the other hand, as long as nobody else sees the same, this might be some weird cornercase we have accidentally created at SoundCloud.

FTR: Prometheus 2.0.0 works just fine in this regard.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented May 25, 2018

@beorn7 is there anything interesting if you look at the kubernetes_sd metrics? Do the prometheus_sd_kubernetes_events_total counters increase when the service discovery is "stuck"?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 25, 2018

@simonpasquier hmm, I sometimes see the missing targets appearing, and then I see "add" events. but only after the pods have been up for quite a while.

Interestingly, we removed some jobs from the server yesterday, and now even the job that was most problematic in terms of SD seems fine. But I'll wait a bit longer before making a call. It just adds to the idea that the SD starts to become flaky from a certain update load on.

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented May 28, 2018

Can it be related to broken watch calls in kube client? kubernetes/kubernetes#50102

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 31, 2018

Update (sorry for still being very high level here, no time to do a proper analysis):
As reported above, about a week ago, we removed a bunch of jobs from our large Prometheus server encountering the K8s SD problems regularly. Ever since we did that, no further problems happened. (The most problematic K8s job is still on the server.) This confirms the vague idea that it is some kind of overload that happens with large clusters and many monitored job or targets. (Our cluster has 375 bare metal nodes with 10k+ pods.)

Dimensions that changed a week ago (before/after):
Jobs: 20/10
Targets: 640/500
Samples/sec: 40k/16k
prometheus_sd_kubernetes_events/sec, events=update: 300/140

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 31, 2018

In different news: I'll be off work for the next three months. While I plan some Prometheus project work, I will focus on those parts that are not so much related to my day-job. So I definitely won't debug production issues at SoundCloud. @grobie is probably the best SoundClouder to take it from here, but he's probably even busier than I have been.

@sta-szek

This comment has been minimized.

Copy link

sta-szek commented Jun 7, 2018

Hi, I have similar issue, when I create / delete pods, prometheus still keeps old ones.
It takes about 40s for pod to be fully ready.

I have:

global:
      scrape_interval: '30s'
      scrape_timeout: '30s'

and it takes about 3h 25m for prometheus to update targets.

I have prometheus deployed to k8s instance using helm chart (from official repository).
We have 203 pods across 16 nodes.

This is misleading as we can see that we have 0 healthy pods in grafana.

Any ideas what can I check?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 7, 2018

@sta-szek is it the same as @beorn7 reported that reducing the total number of pods fixes the problem?

At the moment without a way to reproduce this the only real clue is this bug
kubernetes/kubernetes#50102

@sta-szek

This comment has been minimized.

Copy link

sta-szek commented Jun 7, 2018

Unfortunately I cannot reduce number of pods, but this:

curl -X POST http://prometheus.et-1.coreinfra.io/-/reload

solves the issue.

A little strange workaround.
Do you think I can add some container to invoke this every 2m? Will it bring any side effect?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 7, 2018

reloading touches many different components so I expect it should have some side effects. You can try as a temporary workaround, but we want to find the real culprit.

Can you think of any way to help us replicate this?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 8, 2018

@sta-szek "203 pods across 16 nodes" is a scale that the k8s SD should handle. Can you share your Prometheus configuration file? Which version do you run?

It would be helpful also if you could share the same metrics as in #4124 (comment):

  • number of jobs, sum(count(up) by (job))
  • number of targets, count(up)
  • samples/sec, sum(rate(scrape_samples_scraped[2m]))
  • k8s events/sec, rate(prometheus_sd_kubernetes_events[2m])
@sta-szek

This comment has been minimized.

Copy link

sta-szek commented Jun 11, 2018

sum(count(up) by (job)): 166
count(up): 166
sum(rate(scrape_samples_scraped[2m])): 11.135617427976747
rate(prometheus_sd_kubernetes_events[2m]): no data
(I'm not sure about this metrics as I am not familiar with prometheus)

Prometheus info

Version 2.1.0
85f23d82a045d103ea7f3c89a91fba4a93e6367a
HEAD
root@6e784304d3ff
20180119-12:01:23
go1.9.2

What I can say more is that we perform deployments using helm tool (2.8.2).
Our k8s provider is AWS.

BTW reloading works for us, for now. No side effects.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 11, 2018

@sta-szek thanks for the info! I made a typo and meant rate(prometheus_sd_kubernetes_events_total[2m]).

You may want to try with the latest version of Prometheus (v2.3.0) because it includes an update of the k8s client that could help with your issue (although it didn't work for Björn).

@sta-szek

This comment has been minimized.

Copy link

sta-szek commented Jun 11, 2018

Thanks
rate(prometheus_sd_kubernetes_events_total[2m]):

{event="add",instance="localhost:9090",job="prometheus",role="endpoints"} | 0.5000166672222408
{event="add",instance="localhost:9090",job="prometheus",role="node"} | 0.022222962987655144
{event="add",instance="localhost:9090",job="prometheus",role="pod"} | 6.833561118703956
{event="add",instance="localhost:9090",job="prometheus",role="service"} | 0.2444525928642066
{event="delete",instance="localhost:9090",job="prometheus",role="endpoints"} | 0
{event="delete",instance="localhost:9090",job="prometheus",role="node"} | 0
{event="delete",instance="localhost:9090",job="prometheus",role="pod"} | 0.009683656121870728
{event="delete",instance="localhost:9090",job="prometheus",role="service"} | 0
{event="update",instance="localhost:9090",job="prometheus",role="endpoints"} | 0
{event="update",instance="localhost:9090",job="prometheus",role="node"} | 0.46668222274075805
{event="update",instance="localhost:9090",job="prometheus",role="pod"} | 0.05555740746913786
{event="update",instance="localhost:9090",job="prometheus",role="service"} | 0

I create issue for update and let you know if that helped us or not.

@cwolfinger

This comment has been minimized.

Copy link

cwolfinger commented Jul 18, 2018

Does anyone think there is an issue in just issuing a reload every 30 seconds or so ? I have noticed numerous times where the K8s SD discovery misses pods.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 18, 2018

Does anyone think there is an issue in just issuing a reload every 30 seconds or so ?

This would be very expensive on big k8s clusters (tens of thousands of pods).

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 18, 2018

After updating another two dozen of our servers to v2.3.2 (from v2.0.0) we're experiencing this issue every day, on some servers with every deploy. There is a high correlation between number of kubernetes_sd_config jobs per server and the likelihood of the discovery being stuck. We haven't seen the issue on any server with a single kubernetes_sd_config job so far, it's working without issues.

On the other hand, on one of our servers with 14 kubernetes_sd_config jobs we can reliably reproduce the issue every time we change a single pod.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 18, 2018

To narrow things down a bit, if you were to start 14 Prometheus servers with one sd each then how does the problem manifest (if at all)?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 19, 2018

@grobie are the 14 jobs very similar? or do they select different namespaces?

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2018

All 14 jobs use identical kubernetes_sd_configs blocks, the jobs only differ in relabel_configs applied to them. The cluster has ~10k pods.

@sta-szek

This comment has been minimized.

Copy link

sta-szek commented Jul 19, 2018

just an update,
I reduced number of jobs by merging them into one kubernetes-common-pods job and in our case it helped.
I reduced from 40 jobs to 17 only.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 19, 2018

Interesting, I've got #3912 to avoid creating multiple SD mechanisms when they are identical. Maybe it can help there?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 19, 2018

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2018

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 20, 2018

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Aug 13, 2018

I think I may have finally been able to reproduce this to some degree. I created 200 Kubernetes SD entries in my config which discovers one service in a Kubernetes cluster and I scale the monitored service up and down a couple of times from 25 to 12 to 35 pods, and I can see that there are large amounts of timeouts, no route to host, etc. errors. I have yet to debug this more but here are the manifests I used to do this:

prom-sd-load.tar.gz

At first sight this could be something about lock contention, but that's just intuition, we should investigate.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 13, 2018

Great I will deploy and troubleshoot.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 13, 2018

I deployed @brancz's configuration to my local machine (equivalent of minikube). Even without scaling up/down the pods, I can't get Prometheus to discover all the pods if I start with 4 replicas (x 200 jobs = 800 targets in theory, I reach about 600 in practice). After a few minutes, the number of targets increases slowly but I haven't waited enough to see if it reaches the expected number.
With #3912, all the targets are there immediatly. Will test deeper tomorrow.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 15, 2018

@brancz in my case it gets stuck at some point when I change the replicas, but in 2-3mins latter it picks up again. Can you please confirm that in your case it stays stuck even when you leave it for 5-10min?

The SD manager in prom is ok and the delay comes either from the k8s provider or k8s api server so will keep digging.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Aug 15, 2018

I think the back-pressure could come just as easily from the SD infrastructure itself, only profiling will tell, but yes I did see it catch up eventually, but only when I actually stopped scaling up and down, the report from @grobie was in production system, that simply doesn't stop, so I can see how the lag would just become larger and larger and it would seem like it just never changes.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 19, 2018

after a long weekend of troubleshooting I finally figured it.

When scaling down or removing targets the receiving channel is blocked untill all old scrape loops are stopped.

prometheus/scrape/scrape.go

Lines 197 to 208 in 8fbe1b5

for fp, l := range sp.loops {
wg.Add(1)
go func(l loop) {
l.stop()
wg.Done()
}(l)
delete(sp.loops, fp)
delete(sp.targets, fp)
}
wg.Wait()

this also blocks the receiving channel

case <-ticker.C:
m.recentlyUpdatedMtx.Lock()
if m.recentlyUpdated {
m.syncCh <- m.allGroups()
m.recentlyUpdated = false
}
m.recentlyUpdatedMtx.Unlock()

and since we keep the m.recentlyUpdatedMtx.Lock() lock than the target update from each provider is also blocked.

case tgs, ok := <-updates:
// Handle the case that a target provider exits and closes the channel
// before the context is done.
if !ok {
return
}
m.updateGroup(poolKey, tgs)
m.recentlyUpdatedMtx.Lock()
m.recentlyUpdated = true
m.recentlyUpdatedMtx.Unlock()

The second problem is that when we process the changes in all targetSets/Jobs we read all targets for all jobs and than try to process each one.The problem is that if in the middle of this processing we get new updates than we are already processing old entries so Prometheus need to run a new loop to stop/start new loops.

I have managed to solve the first problem , but still trying few things for the second one.
My idea is to brake the scrape loop updates in batches so that the Scrape manager can quickly catch up. Instead of reading all jobs updates I am thinking to read one process and than read the next one so that if any changes happened in between that it will already read the most updated data.

I will also look at the old implementation before the refactoring and why it didn't suffer from this problem.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 25, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.