Skip to content

Prometheus starts and hangs, returns 503 #5814

@danking

Description

@danking

What did you do?

Start a Prometheus server with X GB of accessible RAM. Have on-disk data of size X+epsilon GB. (I suspect this also occurs if you start with <X GB of data and add enough data to surpass X GB).

What did you expect to see?

I expect Prometheus to return a non-503 exit code to GET /.

What did you see instead? Under which circumstances?

Prometheus returns a 503 exit code. Moreover, top indicates that it slowly claims more memory until its used all the available memory on the underlying node. In the one case I monitored, it took approximately ten minutes to claim 31.5 GB of RAM. It does not, however, fail once it's claimed all the memory. Neither does it start successfully serving requests.

What's the fix?

We increased available memory and Prometheus became available. We observe it using 47.5 GB of RAM when the data directory is 27.8 GB in size.

There exist two issues (#5727, #4324) which appear to report the same problem, but discussion is locked so I cannot comment with my solution. I created this issue so that others who search for this issue will find this issue and this solutoin. I will immediately close this issue, though it would be nice if Prometheus reported a memory issue instead of hanging.

Environment

  • System information:

We're running in a pod on k8s. The container is prom/prometheus:v2.10.0. The YAML spec is:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    name: prometheus
  name: prometheus
  namespace: monitoring
spec:
  serviceName: "prometheus"
  selector:
    matchLabels:
      app: prometheus
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
        fsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccountName: prometheus
      containers:
       - name: prometheus
         image: prom/prometheus:v2.10.0
         imagePullPolicy: Always
         command:
          - "/bin/prometheus"
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--web.console.libraries=/usr/share/prometheus/console_libraries"
          - "--web.console.templates=/usr/share/prometheus/consoles"
          - "--web.external-url=https://internal.hail.is/monitoring/prometheus"
         ports:
          - containerPort: 9090
            protocol: TCP
         volumeMounts:
          - mountPath: "/etc/prometheus"
            name: etc-prometheus
          - mountPath: "/prometheus"
            name: prometheus-storage
      volumes:
       - name: etc-prometheus
         configMap:
           name: etc-prometheus
  volumeClaimTemplates:
    - metadata:
        name: prometheus-storage
        namespace: monitoring
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
insert output of `uname -srm` here
/prometheus $ uname -srm
Linux 4.14.127+ x86_64
  • Prometheus version:

    insert output of prometheus --version here

/prometheus $ prometheus --version
prometheus, version 2.10.0 (branch: HEAD, revision: d20e84d0fb64aff2f62a977adc8cfb656da4e286)
  build user:       root@a49185acd9b0
  build date:       20190525-12:28:13
  go version:       go1.12.5
  • Prometheus configuration file:
apiVersion: v1
kind: ConfigMap
metadata:
  name: etc-prometheus
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
     - job_name: 'kubernetes-kubelet'
       scheme: https
       tls_config:
         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - role: node
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc.cluster.local:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics
     - job_name: 'kubernetes-cadvisor'
       scheme: https
       tls_config:
         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - role: node
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc.cluster.local:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
     - job_name: 'kubernetes-kube-state'
       kubernetes_sd_configs:
        - role: pod
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
        - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
          regex: .*true.*
          action: keep
        - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
          regex: 'node-exporter;(.*)'
          action: replace
          target_label: nodename
     - job_name: 'kubernetes-apiservers'
       scheme: https
       tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - api_server: null
          role: endpoints
          namespaces:
            names: []
       relabel_configs:
       - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
         separator: ;
         regex: default;kubernetes;https
         replacement: $1
         action: keep
  • Logs:
level=info ts=2019-07-31T15:45:51.990Z caller=main.go:286 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:322 msg="Starting Prometheus" version="(version=2.10.0, branch=HEAD, revision=d20e84d0fb64aff2f62a977adc8cfb656da4e286)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:323 build_context="(go=go1.12.5, user=root@a49185acd9b0, date=20190525-12:28:13)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:324 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 18:32:10 PDT 2019 x86_64 prometheus-0 (none))"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:325 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:326 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-31T15:45:51.993Z caller=main.go:645 msg="Starting TSDB ..."
level=info ts=2019-07-31T15:45:51.994Z caller=web.go:417 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-07-31T15:45:51.996Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563105600000 maxt=1563170400000 ulid=01DFTDRJHCX1S9B0KPJTG8CRGW
level=info ts=2019-07-31T15:45:51.997Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563170400000 maxt=1563235200000 ulid=01DFWBK0336Z71ZCRRKS79T18P
level=info ts=2019-07-31T15:45:51.997Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563235200000 maxt=1563300000000 ulid=01DFY9C92NRA1S7FDVHFRFMFPF
level=info ts=2019-07-31T15:45:51.998Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563300000000 maxt=1563364800000 ulid=01DG075GN2MME91GM1DA5G3H07
level=info ts=2019-07-31T15:45:51.999Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563364800000 maxt=1563429600000 ulid=01DG24Z1SDJ7VXW96YYSY1FC8Y
level=info ts=2019-07-31T15:45:51.999Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563429600000 maxt=1563494400000 ulid=01DG42SDMFEK1AJPRJ5YWKZFJ8
level=info ts=2019-07-31T15:45:52.000Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563494400000 maxt=1563559200000 ulid=01DG60K1ADH2GGZ6ZHYVRQA7PQ
level=info ts=2019-07-31T15:45:52.001Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563559200000 maxt=1563624000000 ulid=01DG7YBCA5FFBKYXX7EADE91TP
level=info ts=2019-07-31T15:45:52.001Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563624000000 maxt=1563688800000 ulid=01DG9W4WYEDBQ32Q112S7EPMEP
level=info ts=2019-07-31T15:45:52.002Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563688800000 maxt=1563753600000 ulid=01DGBSYJDGQ8NY58106XGFT7CS
level=info ts=2019-07-31T15:45:52.002Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563753600000 maxt=1563818400000 ulid=01DGDQRCZ949B46BNYWP2S5F02
level=info ts=2019-07-31T15:45:52.003Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563818400000 maxt=1563883200000 ulid=01DGFNHWFPXDCFKMJ45R22VST6
level=info ts=2019-07-31T15:45:52.004Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563883200000 maxt=1563948000000 ulid=01DGHKFKSD0THQ0VWGY9MM01GG
level=info ts=2019-07-31T15:45:52.004Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563948000000 maxt=1564012800000 ulid=01DGKH5AFC7KQ1CN0JE7AA3G6Y
level=info ts=2019-07-31T15:45:52.005Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564012800000 maxt=1564077600000 ulid=01DGNEZMRJM9XKV1N0SA1Y9S3F
level=info ts=2019-07-31T15:45:52.005Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564077600000 maxt=1564142400000 ulid=01DGQCQYYCJDT3DQTVTBHS7N6G
level=info ts=2019-07-31T15:45:52.006Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564142400000 maxt=1564207200000 ulid=01DGSAYGB5QEMACHZK7ZE89H70
level=info ts=2019-07-31T15:45:52.006Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564207200000 maxt=1564272000000 ulid=01DGV8AWDDHKPSN407Z08FBHVZ
level=info ts=2019-07-31T15:45:52.007Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564272000000 maxt=1564336800000 ulid=01DGX64B6GVNNF4P09GB4YM3TV
level=info ts=2019-07-31T15:45:52.007Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564401600000 maxt=1564408800000 ulid=01DGZ3XMWQ04MNAJTWJNXGHNZ5
level=info ts=2019-07-31T15:45:52.008Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564336800000 maxt=1564401600000 ulid=01DGZ426ED23NR5759BDAQM0H6
level=info ts=2019-07-31T15:45:52.008Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564408800000 maxt=1564416000000 ulid=01DGZASC8TRTFRM61J3MX4PHX4
level=info ts=2019-07-31T15:45:52.009Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564416000000 maxt=1564423200000 ulid=01DGZHN38ENTPENE3MM35HVS42
level=info ts=2019-07-31T15:45:52.020Z caller=web.go:461 component=web msg="router prefix" prefix=/monitoring/prometheus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions