Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus hangs and crashes - without any error #5205

Closed
arielb135 opened this Issue Feb 12, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@arielb135
Copy link

arielb135 commented Feb 12, 2019

Bug Report

What did you do?
Started prometheus on a dockerized EKS environment

What did you expect to see?
Prometheus working

What did you see instead? Under which circumstances?
After a while, the prometheus stopped responding, the readiness probe gives 503.

Environment
EKS 1.10 environment, m5.large linux EKS optimized machines.

  • System information:

    Linux 4.14.72-73.55.amzn2.x86_64 x86_64

  • Prometheus version:

    prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4)
    build user: root@1e42b46043e9
    build date: 20181004-08:42:02
    go version: go1.11.1

  • Prometheus configuration file:

global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: default/prometheus-prometheus-oper-prometheus
    prometheus_replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-prometheus-oper-prometheus-rulefiles-0/*.yaml
scrape_configs:
- job_name: .... [list of jobs] ...
...
...
...
- job_name: default/prometheus-prometheus-oper-prometheus/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  scrape_interval: 30s
  metrics_path: /metrics
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: prometheus-operator-prometheus
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_release
    regex: prometheus
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: web
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: web
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers:
  - path_prefix: /
    scheme: http
    kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - default
    relabel_configs:
    - action: keep
      source_labels:
      - __meta_kubernetes_service_name
      regex: prometheus-prometheus-oper-alertmanager
    - action: keep
      source_labels:
      - __meta_kubernetes_endpoint_port_name
      regex: web
  • Logs:
level=info ts=2019-02-12T14:17:52.927449311Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2019-02-12T14:17:52.927519895Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2019-02-12T14:17:52.927543989Z caller=main.go:240 host_details="(Linux 4.14.72-73.55.amzn2.x86_64 #1 SMP Thu Sep 27 23:37:24 UTC 2018 x86_64 prometheus-prometheus-prometheus-oper-prometheus-0 (none))"
level=info ts=2019-02-12T14:17:52.927570408Z caller=main.go:241 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-02-12T14:17:52.927586039Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-02-12T14:17:52.928830818Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2019-02-12T14:17:52.929064882Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-02-12T14:17:52.929596536Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1549951200000 maxt=1549958400000 ulid=01D3GF3Y0HJG9FFBTY2VQZVKTZ
level=info ts=2019-02-12T14:17:52.930139904Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1549958400000 maxt=1549965600000 ulid=01D3GNXCGB808Y3CJHSCMWT412
level=info ts=2019-02-12T14:17:52.930659145Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1549900800000 maxt=1549951200000 ulid=01D3GP6DNAS46EWTBFHN02SMJF
level=info ts=2019-02-12T14:17:52.936288664Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1549965600000 maxt=1549972800000 ulid=01D3GWNWTX35AP7T5JNCXHT2E3
level=warn ts=2019-02-12T14:20:49.3252457Z caller=head.go:371 component=tsdb msg="unknown series references" count=38
level=info ts=2019-02-12T14:20:49.825530024Z caller=main.go:564 msg="TSDB started"
level=info ts=2019-02-12T14:20:49.825592839Z caller=main.go:624 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-02-12T14:20:50.020945292Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-12T14:20:50.022330046Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-02-12T14:20:50.023358726Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=debug ts=2019-02-12T14:20:50.024323072Z caller=manager.go:158 component="discovery manager scrape" msg="Starting provider" provider=*kubernetes.SDConfig/0 subs="[default/prometheus-prometheus-oper-kubelet/0 default/prometheus-prometheus-oper-kubelet/1 default/prometheus-prometheus-oper-kube-controller-manager/0 default/prometheus-prometheus-oper-coredns/0 default/prometheus-prometheus-oper-kube-etcd/0 default/prometheus-prometheus-oper-kube-scheduler/0]"
level=debug ts=2019-02-12T14:20:50.024380726Z caller=manager.go:158 component="discovery manager scrape" msg="Starting provider" provider=*kubernetes.SDConfig/1 subs="[default/handler-hash-musl-tihandler/0 default/hash-high-priority-hashtihandler/0 default/carabbitmq/0 default/handler-hash-high-priority-tihandler/0 default/handler-hash-low-priority-tihandler/0]"
level=debug ts=2019-02-12T14:20:50.024440117Z caller=manager.go:158 component="discovery manager scrape" msg="Starting provider" provider=*kubernetes.SDConfig/2 subs="[default/prometheus-prometheus-oper-node-exporter/0 default/prometheus-prometheus-oper-operator/0 default/prometheus-prometheus-oper-alertmanager/0 default/prometheus-prometheus-oper-kube-state-metrics/0 default/prometheus-prometheus-oper-apiserver/0 default/prometheus-prometheus-oper-prometheus/0]"
level=info ts=2019-02-12T14:20:50.024635445Z caller=kubernetes.go:187 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=debug ts=2019-02-12T14:20:50.025445818Z caller=manager.go:158 component="discovery manager notify" msg="Starting provider" provider=*kubernetes.SDConfig/0 subs=[1750fb4d52d6873959dd9d03a4ad00c2]
level=info ts=2019-02-12T14:20:50.42231141Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-02-12T14:20:50.422354785Z caller=main.go:523 msg="Server is ready to receive web requests."
level=debug ts=2019-02-12T14:20:55.128589704Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:20:55.128814777Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:00.12858519Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:21:05.128583327Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:10.621834266Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:16.428125437Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:20.518182244Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:30.51821032Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:35.021527076Z caller=manager.go:568 component="rule manager" group=kubernetes-apps msg="'for' state restored" alertname=KubePodCrashLooping restored_time="Tuesday, 12-Feb-19 14:14:05 UTC" labels="{alertname=\"KubePodCrashLooping\", container=\"prometheus\", endpoint=\"http\", instance=\"172.31.38.209:8080\", job=\"kube-state-metrics\", namespace=\"default\", pod=\"prometheus-prometheus-prometheus-oper-prometheus-0\", service=\"prometheus-kube-state-metrics\", severity=\"critical\"}"
level=debug ts=2019-02-12T14:21:35.128566381Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:21:40.128552371Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:21:45.128526776Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:21:50.128550194Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:21:55.128585434Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:00.128576219Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:05.128642968Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:10.128540954Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:15.128626726Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:20.128569998Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:22:25.128632028Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:22:30.128577993Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:22:35.128555106Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:40.128596027Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/0
level=debug ts=2019-02-12T14:22:45.718399735Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1
level=debug ts=2019-02-12T14:22:50.618122499Z caller=manager.go:204 component="discovery manager scrape" msg="discovery receiver's channel was full so will retry the next cycle" provider=*kubernetes.SDConfig/1

Also, in kubernetes i see a crash, but only the above logs exist.
image

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 12, 2019

Can you try again with the latest version (v2.7.1)? #4526 might help here.

@arielb135

This comment has been minimized.

Copy link
Author

arielb135 commented Feb 14, 2019

Actually my bad - the resource limit was 1GB and prometheus actually took 1GB.
so the scheduler killed it. closing it.

@arielb135 arielb135 closed this Feb 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.