Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s prometheus container get OOM Killed every 5-10minutes #5019

Closed
alanh0vx opened this Issue Dec 20, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@alanh0vx
Copy link

alanh0vx commented Dec 20, 2018

Proposal

Use case. Why is this important?

“Nice to have” is not a good use case. :)

Bug Report

What did you do?
Running prometheus container (v.2.5.0), deployed using k8s

there are 5 nodes of k8s cluster, around 80 pods running

What did you expect to see?
prometheus running smoothly

What did you see instead? Under which circumstances?
K8s prometheus container get OOM Killed every 10minutes

Environment
prometheus v2.5.0

k8s: v1.11.0

cluster node: 32G RAM

cluster node: Linux 4.4.0-139-generic x86_64

  • Prometheus version:
prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
  build user:       root@578ab108d0b9
  build date:       20181106-11:40:44
  go version:       go1.11.1
  • Alertmanager version:
alertmanager, version 0.12.0 (branch: HEAD, revision: fc33cc78036f82ef8d4734c197a96f7cb6c952a3)
  build user:       root@c9169eb10d06
  build date:       20171215-14:13:20
  go version:       go1.9.2
  • Prometheus configuration file:
global:
  evaluation_interval: 60s
  scrape_interval: 60s
  external_labels: {}
rule_files:
- /etc/prometheus/rules/rules-0/*
scrape_configs:
- job_name: monitoring/alertmanager/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_alertmanager
    regex: main
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: web
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: web
- job_name: monitoring/kube-apiserver/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  scrape_interval: 30s
  scheme: https
  tls_config:
    insecure_skip_verify: false
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    server_name: kubernetes
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_component
    regex: apiserver
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_provider
    regex: kubernetes
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: https
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_component
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: https
- job_name: monitoring/kube-controller-manager/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 120s
  scrape_timeout: 120s
  metrics_path: /metrics
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kube-controller-manager
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/kube-dns/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kube-dns
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics-coredns
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics-coredns
- job_name: monitoring/kube-scheduler/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 30s
  metrics_path: /metrics
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kube-scheduler
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/kube-state-metrics/0
  honor_labels: true
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 120s
  scrape_timeout: 120s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kube-state-metrics
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/kubelet/0
  honor_labels: true
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
  scrape_interval: 120s
  scrape_timeout: 120s
  metrics_path: /metrics
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: kubelet
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/node-exporter/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: node-exporter
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http-metrics
- job_name: monitoring/prometheus/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_prometheus
    regex: k8s
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: web
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: web
- job_name: monitoring/prometheus-operator/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_k8s_app
    regex: prometheus-operator
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: http
- job_name: monitoring/snowdrop-sidekiq-mon/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - snowdrop
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: snowdrop
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_tier
    regex: sidekiq
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: sd-skiq-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: sd-skiq-metrics
- job_name: monitoring/snowdrop-web-mon/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - snowdrop
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: snowdrop
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_tier
    regex: web
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: sd-web-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: sd-web-metrics
- job_name: monitoring/snowdrop-crypto-mon/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - snowdrop
  scrape_interval: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: snowdrop
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_tier
    regex: crypto-bal-metrics
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: sd-cr-metrics
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_k8s_app
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: sd-cr-metrics
alerting:
  alertmanagers:
  - path_prefix: /
    scheme: http
    kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitoring
    relabel_configs:
    - action: keep
      source_labels:
      - __meta_kubernetes_service_name
      regex: alertmanager-main
    - action: keep
      source_labels:
      - __meta_kubernetes_endpoint_port_name
      regex: web
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
level=info ts=2018-12-20T04:05:18.460806871Z caller=main.go:244 msg="Starting Prometheus" version="(version=2.5.0, branch=HEAD, revision=67dc912ac8b24f94a1fc478f352d25179c94ab9b)"
level=info ts=2018-12-20T04:05:18.460865833Z caller=main.go:245 build_context="(go=go1.11.1, user=root@578ab108d0b9, date=20181106-11:40:44)"
level=info ts=2018-12-20T04:05:18.460886303Z caller=main.go:246 host_details="(Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 prometheus-k8s-0 (none))"
level=info ts=2018-12-20T04:05:18.460900782Z caller=main.go:247 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-12-20T04:05:18.460912444Z caller=main.go:248 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-12-20T04:05:18.461410451Z caller=main.go:562 msg="Starting TSDB ..."
level=info ts=2018-12-20T04:05:18.4614568Z caller=web.go:399 component=web msg="Start listening for connections" address=0.0.0.0:9090
root@vps5512-ad:~/blocksq/monitoring-center/prometheus# kubectl logs prometheus-k8s-0  -c prometheus -n monitoring
level=info ts=2018-12-20T04:05:18.460806871Z caller=main.go:244 msg="Starting Prometheus" version="(version=2.5.0, branch=HEAD, revision=67dc912ac8b24f94a1fc478f352d25179c94ab9b)"
level=info ts=2018-12-20T04:05:18.460865833Z caller=main.go:245 build_context="(go=go1.11.1, user=root@578ab108d0b9, date=20181106-11:40:44)"
level=info ts=2018-12-20T04:05:18.460886303Z caller=main.go:246 host_details="(Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 prometheus-k8s-0 (none))"
level=info ts=2018-12-20T04:05:18.460900782Z caller=main.go:247 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-12-20T04:05:18.460912444Z caller=main.go:248 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-12-20T04:05:18.461410451Z caller=main.go:562 msg="Starting TSDB ..."
level=info ts=2018-12-20T04:05:18.4614568Z caller=web.go:399 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-12-20T04:05:21.894760153Z caller=main.go:572 msg="TSDB started"
level=info ts=2018-12-20T04:05:21.894850892Z caller=main.go:632 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-12-20T04:05:21.897884806Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-20T04:05:21.898665051Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-20T04:05:21.899353058Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-20T04:05:21.900058361Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-20T04:05:21.900907356Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-20T04:05:21.912435404Z caller=main.go:658 msg="Completed loading of configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-12-20T04:05:21.912471125Z caller=main.go:531 msg="Server is ready to receive web requests."

pod description:

Controlled By:  StatefulSet/prometheus-k8s
Containers:
  prometheus:
    Container ID:  docker://21e889edd7115b639a1b2dd957b47c33515180b28a9a9e6aaa92790927e3ca53
    Image:         quay.io/prometheus/prometheus:v2.5.0
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:478d0b68432ea289a2e8455cbc30ee38b7ade6d13b4f73877203184c64914d9b
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --config.file=/etc/prometheus/config/prometheus.yaml
      --storage.tsdb.path=/var/prometheus/data
      --storage.tsdb.retention=72h
      --web.enable-lifecycle
      --web.external-url=http://127.0.0.1:9090
      --web.route-prefix=/
    State:          Running
      Started:      Thu, 20 Dec 2018 12:05:18 +0800
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 20 Dec 2018 11:56:50 +0800
      Finished:     Thu, 20 Dec 2018 12:04:59 +0800
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     500m
      memory:  24Gi
    Requests:
      cpu:        100m
      memory:     1Gi
    Liveness:     http-get http://:web/-/healthy delay=30s timeout=3s period=5s #success=1 #failure=10
    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=6
    Environment:  <none>
    Mounts:
      /etc/prometheus/config from config (ro)
      /etc/prometheus/rules from rules (ro)
      /var/prometheus/data from prometheus-k8s-db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-ggnmz (ro)
@alanh0vx

This comment has been minimized.

Copy link
Author

alanh0vx commented Dec 20, 2018

trying to check the heavy queries (https://www.robustperception.io/which-are-my-biggest-metrics)

but eventually it just context deadline exceeded

tried to drop metrics like apiserver_request_latencies, apiserver latencies related metrics, it still regulary OOM Killed

logs only showing info, cannot find useful information

@alanh0vx

This comment has been minimized.

Copy link
Author

alanh0vx commented Dec 20, 2018

grafana chart over the past 3 hours

screen shot 2018-12-20 at 6 03 00 pm

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 20, 2018

It feels like you have metrics with high-cardinality. What's the graph for prometheus_tsdb_head_series?

@alanh0vx

This comment has been minimized.

Copy link
Author

alanh0vx commented Dec 20, 2018

screen shot 2018-12-20 at 6 10 49 pm

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 20, 2018

Yep you seem to have metrics with unbounded cardinality. 8M timeseries is a lot for Prometheus and would require roughly 64GB just for storing them into memory. Maybe try to enable scrape configurations one after the other to find the culprit.

@alanh0vx

This comment has been minimized.

Copy link
Author

alanh0vx commented Dec 20, 2018

thanks! i am checking the configuration one by one, hopefully can the one which caused heavy memory usage, will let you know the result

@alanh0vx

This comment has been minimized.

Copy link
Author

alanh0vx commented Dec 21, 2018

turn out this job job_name: monitoring/kube-controller-manager/0 failed to get metrics and resulted in hugh memory usage, i removed this job and it works smoothly again.

thanks for pointing to the right direction

@alanh0vx alanh0vx closed this Dec 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.