Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus not scraping metrics `adding stale sample failed` #4249

Open
anandsinghkunwar opened this Issue Jun 11, 2018 · 20 comments

Comments

Projects
None yet
4 participants
@anandsinghkunwar
Copy link

anandsinghkunwar commented Jun 11, 2018

Bug Report

What did you do?
I have setup prometheus v2.2.1 using the prometheus operator on my kubernetes cluster. I have deployed node-exporter, kube-state-metrics for cluster monitoring. I haven't been able to ingest samples on and off for hours. I have a 30s evaluation interval. I feel it is somehow related to #2894, but that seems to be fixed. I have 2 instances of prometheus as HA and only one of them is facing this issue. This doesn't seem to be a memory, cpu, network issue as other pods on that node seem to be working fine.

What did you expect to see?
I expected prometheus to scrape every 30s as usual.

What did you see instead? Under which circumstances?
It is going on and off every few hours. Off for hours then on for an hour and so all weekend.

  • System information:

    Linux 3.10.0-693.el7.x86_64 x86_64

  • Prometheus version:

    v2.2.1

  • Logs:

736851:level=warn ts=2018-06-11T06:33:33.91862231Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.117.2:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-d9cww\", service=\"node-exporter\"}" err="out of bounds"
736852:level=warn ts=2018-06-11T06:33:33.918633374Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.73.85:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-zcvfq\", service=\"node-exporter\"}" err="out of bounds"
736853:level=warn ts=2018-06-11T06:33:33.918644683Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.143.2:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-fbjs6\", service=\"node-exporter\"}" err="out of bounds"
736854:level=warn ts=2018-06-11T06:33:33.918655824Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.102.9:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-28q2r\", service=\"node-exporter\"}" err="out of bounds"
736855:level=warn ts=2018-06-11T06:33:33.918666712Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.41.48:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-94nh9\", service=\"node-exporter\"}" err="out of bounds"
736856:level=warn ts=2018-06-11T06:33:33.918684142Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"instance:node_cpu:ratio\", endpoint=\"https\", instance=\"172.16.19.43:9100\", job=\"node-exporter\", namespace=\"monitoring\", pod=\"node-exporter-r9ss6\", service=\"node-exporter\"}" err="out of bounds"
736859:level=warn ts=2018-06-11T06:33:34.573197797Z caller=manager.go:393 component="rule manager" group=node.rules msg="adding stale sample failed" sample="{__name__=\"cluster:node_cpu:sum_rate5m\"}" err="out of bounds"
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

Is the time on your machine correct? If it's turning on and off for an hour at a time, I'm smelling something weird happening on your machine with timezones.

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

@brian-brazil I checked, time seems correct on all my nodes of the cluster. However the timezone of the cluster is EDT and of my pods is UTC. Can that be an issue? I have been using it for a while, and it was working fine, not sure why only one of the two promethei suddenly started exhibiting this.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

No, Prometheus itself ignores timezones completely. If only one of them has the issue it's likely something up with that machine.

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

I verified the time in all the machines, they don't seem to be wrong.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

Can you share your full configuration?

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

It's quite large. Also attaching a graph indicative of how prometheus was responding. The straight lines indicate when it was up. Link Here

global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: monitoring/k8s
    prometheus_replica: $(POD_NAME)
rule_files:
- /etc/prometheus/config_out/rules/rules-0/*
- /etc/prometheus/config_out/rules/rules-1/*
- /etc/prometheus/config_out/rules/rules-2/*
- /etc/prometheus/config_out/rules/rules-3/*
- /etc/prometheus/config_out/rules/rules-4/*
- /etc/prometheus/config_out/rules/rules-5/*
- /etc/prometheus/config_out/rules/rules-6/*
- /etc/prometheus/config_out/rules/rules-7/*
- /etc/prometheus/config_out/rules/rules-8/*
- /etc/prometheus/config_out/rules/rules-9/*
scrape_configs:
- job_name: monitoring/apiserver/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: false
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    server_name: kubernetes
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_component
    regex: apiserver
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_provider
    regex: kubernetes
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_component
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https
- job_name: monitoring/atlas-api/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - atlas-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: atlas-api
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-api
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: http-api
- job_name: monitoring/kube-state-metrics/0
  honor_labels: true
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-state-metrics
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https-main
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https-main
- job_name: monitoring/kube-state-metrics/1
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-state-metrics
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https-self
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https-self
- job_name: monitoring/node-exporter/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: node-exporter
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https
- job_name: monitoring/etcd-k8s/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  scheme: http
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: etcd
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: api
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: api
- job_name: monitoring/kube-controller-manager/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-controller-manager
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/kube-dns/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-dns
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics-skydns
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics-skydns
- job_name: monitoring/kube-dns/1
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-dns
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics-dnsmasq
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics-dnsmasq
- job_name: monitoring/kube-scheduler/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: kube-scheduler
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/kubelet/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kubelet
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https-metrics
- job_name: monitoring/kubelet/1
  honor_labels: true
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  metrics_path: /metrics/cadvisor
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kubelet
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https-metrics
- job_name: monitoring/metallb/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - metallb-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: metallb
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: monitoring
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: monitoring
- job_name: monitoring/prometheus/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: prometheus
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: web
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: web
- job_name: monitoring/prometheus-operator/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: prometheus-operator
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http
- job_name: monitoring/pushgateway/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: pushgateway
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: web
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: web
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers:
  - path_prefix: /
    scheme: http
    kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitoring
    relabel_configs:
    - action: keep
      source_labels:
      - __meta_kubernetes_service_name
      regex: alertmanager-main
    - action: keep
      source_labels:
      - __meta_kubernetes_endpoint_port_name
      regex: web
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

What does up look like? Is this only for one instance? Can you share a graph of node_time_seconds for this instance?

As an aside, the pushgateway should have honor_labels: true but nothing else should.

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

It is up constantly. Couldn't find node_time_seconds, however there is a node_time, sharing that here

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

That doesn't make sense, stale markers won't be produced if up=1 all the time. Are you sure it's always 1?

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

You mean up from one instance(the prometheus that is scraping) to another(the prometheus that is not working), right? Graph here

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

I mean the up of the node exporter mentioned in the log.

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

I don't just have my node_exporter metrics going on and off, it is the prometheus itself which is either scraping all targets or none. The second prometheus is able to scrape off the node_exporter, therefore that is indeed up. Interestingly, there is high load on the particular node which has my prometheus which is going on and off. FYI there are 20 cores on that node. Graph here

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

As this point I'm suspecting a dodgy CPU in the Prometheus machine that's messing up timestamps. Does this still happen if the Prometheus is moved to another machine?

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 11, 2018

I believe the problem will stop, but we will still not be able to pin point the issue. Why can't prometheus just wait in the queue like the other processes in case the CPU is the bottleneck (I think it isn't, as we have enough cores)?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2018

The information you have provided does not support that conclusion. The gaps would have to be much larger for that to make sense.

@anandsinghkunwar

This comment has been minimized.

Copy link
Author

anandsinghkunwar commented Jun 18, 2018

Also, an other observation, when killing the pod to restart somewhere else, it took like 40 minutes to die.

@vears91

This comment has been minimized.

Copy link

vears91 commented Feb 8, 2019

I'm also running into this. Running Prometheus 2.7.1 on Kubernetes 1.13.2 in AWS. Prometheus stops scraping all targets, I get an alert for heartbeat lost from my alerting system, and the container cannot be killed during this time.

When this happens, I see in the logs that the configuration begins to reload and takes way longer than usual to complete. I see the out of bounds errors described here for my recording rules. In this particular example, it begins reloading at 22:24 and completes reloading at 22:36, but it can also take hours. The scraping can also stop for a few minutes or for hours when this happens.

level=info ts=2019-02-07T22:24:31.045611017Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=error ts=2019-02-07T22:36:23.772159922Z caller=scrape.go:913 component="scrape manager" scrape_pool=cassandra target=http://10.40.57.59:7073/metrics msg="stale report failed" err="out of bounds"
level=error ts=2019-02-07T22:36:23.77249718Z caller=scrape.go:913 component="scrape manager" scrape_pool=cassandra target=http://10.40.84.193:7073/metrics msg="stale report failed" err="out of bounds"
level=error ts=2019-02-07T22:36:23.833443455Z caller=scrape.go:913 component="scrape manager" scrape_pool=cassandra target=http://10.40.71.250:7073/metrics msg="stale report failed" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.842644688Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/service-level-prometheus msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:objective:ratio\", name=\"Prometheus Availability\", owner=\"infrastructure\", service=\"prometheus\", type=\"uptime\"} => 0.999 @[1549577470398]"
level=warn ts=2019-02-07T22:36:23.842827779Z caller=manager.go:527 component="rule manager" group=monitoring-prometheus/service-level-prometheus msg="adding stale sample failed" sample="{__name__=\"service_level:objective:ratio\", name=\"Prometheus Availability\", owner=\"infrastructure\", service=\"prometheus\", type=\"uptime\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.856548193Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:agreement:ratio\", name=\"Search API latency\", owner=\"data\", service=\"main\", type=\"latency\"} => 0.95 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.856614682Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:agreement:ratio\", name=\"Search API latency\", owner=\"data\", service=\"main\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.864852678Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", name=\"Search API availability\", owner=\"data\", range=\"1d\", service=\"main\", type=\"availability\"} => 1 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.86491161Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", name=\"Search API availability\", owner=\"data\", range=\"1d\", service=\"main\", type=\"availability\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.86499189Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:objective:ratio\", name=\"Search API availability\", owner=\"data\", service=\"main\", type=\"availability\"} => 0.99 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.865019359Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:objective:ratio\", name=\"Search API availability\", owner=\"data\", service=\"main\", type=\"availability\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.86506904Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:agreement:ratio\", name=\"Search API availability\", owner=\"data\", service=\"main\", type=\"availability\"} => 0.95 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.865092392Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:agreement:ratio\", name=\"Search API availability\", owner=\"data\", service=\"main\", type=\"availability\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.865699145Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", latency=\"0.5\", name=\"Main API success minute\", owner=\"redacted\", percentile=\"95\", range=\"1d\", service=\"main\", type=\"latency\"} => 1 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.865748873Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", latency=\"0.5\", name=\"Main API success minute\", owner=\"redacted\", percentile=\"95\", range=\"1d\", service=\"main\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.865815277Z caller=manager.go:504 component="rule manager" group=main/service-level-1d-main msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:objective:ratio\", name=\"Main API success minute\", owner=\"redacted\", service=\"main\", type=\"latency\"} => 0.99 @[1549577502459]"
level=warn ts=2019-02-07T22:36:23.865843342Z caller=manager.go:527 component="rule manager" group=main/service-level-1d-main msg="adding stale sample failed" sample="{__name__=\"service_level:objective:ratio\", name=\"Main API success minute\", owner=\"redacted\", service=\"main\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.867832043Z caller=manager.go:504 component="rule manager" group=ingress-nginx/service-level-ingress msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", alias=\"kubernetes-internal-ingress\", instance=\"https://ingress\", job=\"probe-kubernetes-internal-ingress\", maintainer=\"infrastructure\", monitoring=\"up,latency\", name=\"Internal Ingress Availability\", owner=\"infrastructure\", prober=\"blackbox-exporter:9115\", range=\"1d\", service=\"nginx-ingress\", type=\"uptime\"} => 0.9994411846884604 @[1549577506973]"
level=warn ts=2019-02-07T22:36:23.86793513Z caller=manager.go:527 component="rule manager" group=ingress-nginx/service-level-ingress msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", alias=\"kubernetes-internal-ingress\", instance=\"https://ingress\", job=\"probe-kubernetes-internal-ingress\", maintainer=\"infrastructure\", monitoring=\"up,latency\", name=\"Internal Ingress Availability\", owner=\"infrastructure\", prober=\"blackbox-exporter:9115\", range=\"1d\", service=\"nginx-ingress\", type=\"uptime\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.870676391Z caller=manager.go:504 component="rule manager" group=ingress-nginx/service-level-ingress msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", alias=\"kubernetes-internal-ingress\", instance=\"https://ingress\", job=\"probe-kubernetes-internal-ingress\", maintainer=\"infrastructure\", monitoring=\"up,latency\", name=\"Internal Ingress Availability\", owner=\"infrastructure\", prober=\"blackbox-exporter:9115\", range=\"28d\", service=\"nginx-ingress\", type=\"uptime\"} => 0.9994411846884604 @[1549577506973]"
level=warn ts=2019-02-07T22:36:23.870734693Z caller=manager.go:527 component="rule manager" group=ingress-nginx/service-level-ingress msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", alias=\"kubernetes-internal-ingress\", instance=\"https://ingress\", job=\"probe-kubernetes-internal-ingress\", maintainer=\"infrastructure\", monitoring=\"up,latency\", name=\"Internal Ingress Availability\", owner=\"infrastructure\", prober=\"blackbox-exporter:9115\", range=\"28d\", service=\"nginx-ingress\", type=\"uptime\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.939007439Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS\", alertname=\"K8sControlPlaneControllerMgrUnavailable\", alertstate=\"pending\", container=\"kube-controller-manager\", severity=\"urgent\", who=\"infrastructure\"} => 1 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.939240955Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"K8sControlPlaneControllerMgrUnavailable\", container=\"kube-controller-manager\", severity=\"urgent\", who=\"infrastructure\"} => 1549577800 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.939750852Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS\", alertname=\"K8sControlPlaneSchedulerUnavailable\", alertstate=\"pending\", container=\"kube-scheduler\", severity=\"urgent\", who=\"infrastructure\"} => 1 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.939961872Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"K8sControlPlaneSchedulerUnavailable\", container=\"kube-scheduler\", severity=\"urgent\", who=\"infrastructure\"} => 1549577800 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.940167938Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:objective:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", service=\"k8s-api-server\", type=\"latency\"} => 0.999 @[1549577480392]"
level=warn ts=2019-02-07T22:36:23.940333748Z caller=manager.go:527 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="adding stale sample failed" sample="{__name__=\"service_level:objective:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", service=\"k8s-api-server\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.940890755Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS\", alertname=\"KubeStateMetricsMissing\", alertstate=\"pending\", severity=\"urgent\", who=\"infrastructure\"} => 1 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.941219121Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/infrastructure-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"ALERTS_FOR_STATE\", alertname=\"KubeStateMetricsMissing\", severity=\"urgent\", who=\"infrastructure\"} => 1549577800 @[1549577800274]"
level=warn ts=2019-02-07T22:36:23.941397854Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", range=\"1d\", service=\"k8s-api-server\", type=\"latency\"} => 0.9996115669175366 @[1549577480392]"
level=warn ts=2019-02-07T22:36:23.941557173Z caller=manager.go:527 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", range=\"1d\", service=\"k8s-api-server\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.944663053Z caller=manager.go:504 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="Rule evaluation result discarded" err="out of bounds" sample="{__name__=\"service_level:indicator:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", range=\"28d\", service=\"k8s-api-server\", type=\"latency\"} => 0.9996115669175366 @[1549577480392]"
level=warn ts=2019-02-07T22:36:23.944825493Z caller=manager.go:527 component="rule manager" group=monitoring-prometheus/service-level-kubernetes msg="adding stale sample failed" sample="{__name__=\"service_level:indicator:ratio\", name=\"Kubernetes API latency\", owner=\"infrastructure\", range=\"28d\", service=\"k8s-api-server\", type=\"latency\"}" err="out of bounds"
level=warn ts=2019-02-07T22:36:23.95601373Z caller=manager.go:511 component="rule manager" group=ingress-nginx/service-level-ingress msg="Error on ingesting out-of-order result from rule evaluation" numDropped=1
level=warn ts=2019-02-07T22:36:25.107700089Z caller=manager.go:511 component="rule manager" group=monitoring-prometheus/service-level-1d-kubernetes msg="Error on ingesting out-of-order result from rule evaluation" numDropped=1
level=warn ts=2019-02-07T22:36:25.107815677Z caller=manager.go:511 component="rule manager" group=monitoring-prometheus/service-level-1d-kubernetes msg="Error on ingesting out-of-order result from rule evaluation" numDropped=1
level=info ts=2019-02-07T22:36:26.510687304Z caller=compact.go:443 component=tsdb msg="write block" mint=1549575000000 maxt=1549576800000 ulid=01D351NPAQBKAVE78QTFFZT0PD
level=info ts=2019-02-07T22:36:27.665599636Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.666314806Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.667005222Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.66763384Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.668344484Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.669120786Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-07T22:36:27.698992571Z caller=main.go:722 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

@vears91 did you find the culprit for your issue?

@vears91

This comment has been minimized.

Copy link

vears91 commented Feb 28, 2019

@krasi-georgiev I still see the issue with scraping stopping. It was happening very often when we upgraded our Kubernetes networking component (kube-router), up to once a day. We downgraded to a previous version but it still happens from time to time, maybe once a week. As described in #4736, scraping stops but the web UI is still accessible. The container can't be killed when this is happening.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 28, 2019

would you mind starting the container with an evn DEBUG=1
this will enable some extra profiling so at the time it stops scraping you can run
promtool debug all SERVER-IP
the promtool is in the release tar archive.

the profile doesn't include any sensitive data so you can attach it here.

also it might be worth starting Prometheus with the extra flag --log.level=debug this will enable the debug logs.

DEBUG=1 ./prometheus --log.level=debug ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.