Prometheus starts and hangs, returns 503

**What did you do?**

Start a Prometheus server with X GB of accessible RAM. Have on-disk data of size X+epsilon GB. (I suspect this also occurs if you start with <X GB of data and add enough data to surpass X GB).

**What did you expect to see?**

I expect Prometheus to return a non-503 exit code to `GET /`.

**What did you see instead? Under which circumstances?**

Prometheus returns a 503 exit code. Moreover, `top` indicates that it slowly claims more memory until its used all the available memory on the underlying node. In the one case I monitored, it took approximately ten minutes to claim 31.5 GB of RAM. It does not, however, fail once it's claimed all the memory. Neither does it start successfully serving requests.

**What's the fix?**

We increased available memory and Prometheus became available. We observe it using 47.5 GB of RAM when the data directory is 27.8 GB in size.

There exist two issues (prometheus/prometheus#5727, prometheus/prometheus#4324) which appear to report the same problem, but discussion is locked so I cannot comment with my solution. I created this issue so that others who search for this issue will find this issue and this solutoin. I will immediately close this issue, though it would be nice if Prometheus reported a memory issue instead of hanging.

**Environment**

* System information:

We're running in a pod on k8s. The container is `prom/prometheus:v2.10.0`. The YAML spec is:
```
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    name: prometheus
  name: prometheus
  namespace: monitoring
spec:
  serviceName: "prometheus"
  selector:
    matchLabels:
      app: prometheus
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
        fsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccountName: prometheus
      containers:
       - name: prometheus
         image: prom/prometheus:v2.10.0
         imagePullPolicy: Always
         command:
          - "/bin/prometheus"
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--web.console.libraries=/usr/share/prometheus/console_libraries"
          - "--web.console.templates=/usr/share/prometheus/consoles"
          - "--web.external-url=https://internal.hail.is/monitoring/prometheus"
         ports:
          - containerPort: 9090
            protocol: TCP
         volumeMounts:
          - mountPath: "/etc/prometheus"
            name: etc-prometheus
          - mountPath: "/prometheus"
            name: prometheus-storage
      volumes:
       - name: etc-prometheus
         configMap:
           name: etc-prometheus
  volumeClaimTemplates:
    - metadata:
        name: prometheus-storage
        namespace: monitoring
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
```

	insert output of `uname -srm` here

```
/prometheus $ uname -srm
Linux 4.14.127+ x86_64
```

* Prometheus version:

	insert output of `prometheus --version` here
```
/prometheus $ prometheus --version
prometheus, version 2.10.0 (branch: HEAD, revision: d20e84d0fb64aff2f62a977adc8cfb656da4e286)
  build user:       root@a49185acd9b0
  build date:       20190525-12:28:13
  go version:       go1.12.5
```

* Prometheus configuration file:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: etc-prometheus
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
     - job_name: 'kubernetes-kubelet'
       scheme: https
       tls_config:
         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - role: node
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc.cluster.local:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics
     - job_name: 'kubernetes-cadvisor'
       scheme: https
       tls_config:
         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - role: node
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc.cluster.local:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
     - job_name: 'kubernetes-kube-state'
       kubernetes_sd_configs:
        - role: pod
       relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
        - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
          regex: .*true.*
          action: keep
        - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
          regex: 'node-exporter;(.*)'
          action: replace
          target_label: nodename
     - job_name: 'kubernetes-apiservers'
       scheme: https
       tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
       kubernetes_sd_configs:
        - api_server: null
          role: endpoints
          namespaces:
            names: []
       relabel_configs:
       - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
         separator: ;
         regex: default;kubernetes;https
         replacement: $1
         action: keep
```


* Logs:
```
level=info ts=2019-07-31T15:45:51.990Z caller=main.go:286 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:322 msg="Starting Prometheus" version="(version=2.10.0, branch=HEAD, revision=d20e84d0fb64aff2f62a977adc8cfb656da4e286)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:323 build_context="(go=go1.12.5, user=root@a49185acd9b0, date=20190525-12:28:13)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:324 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 18:32:10 PDT 2019 x86_64 prometheus-0 (none))"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:325 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-31T15:45:51.991Z caller=main.go:326 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-31T15:45:51.993Z caller=main.go:645 msg="Starting TSDB ..."
level=info ts=2019-07-31T15:45:51.994Z caller=web.go:417 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-07-31T15:45:51.996Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563105600000 maxt=1563170400000 ulid=01DFTDRJHCX1S9B0KPJTG8CRGW
level=info ts=2019-07-31T15:45:51.997Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563170400000 maxt=1563235200000 ulid=01DFWBK0336Z71ZCRRKS79T18P
level=info ts=2019-07-31T15:45:51.997Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563235200000 maxt=1563300000000 ulid=01DFY9C92NRA1S7FDVHFRFMFPF
level=info ts=2019-07-31T15:45:51.998Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563300000000 maxt=1563364800000 ulid=01DG075GN2MME91GM1DA5G3H07
level=info ts=2019-07-31T15:45:51.999Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563364800000 maxt=1563429600000 ulid=01DG24Z1SDJ7VXW96YYSY1FC8Y
level=info ts=2019-07-31T15:45:51.999Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563429600000 maxt=1563494400000 ulid=01DG42SDMFEK1AJPRJ5YWKZFJ8
level=info ts=2019-07-31T15:45:52.000Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563494400000 maxt=1563559200000 ulid=01DG60K1ADH2GGZ6ZHYVRQA7PQ
level=info ts=2019-07-31T15:45:52.001Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563559200000 maxt=1563624000000 ulid=01DG7YBCA5FFBKYXX7EADE91TP
level=info ts=2019-07-31T15:45:52.001Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563624000000 maxt=1563688800000 ulid=01DG9W4WYEDBQ32Q112S7EPMEP
level=info ts=2019-07-31T15:45:52.002Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563688800000 maxt=1563753600000 ulid=01DGBSYJDGQ8NY58106XGFT7CS
level=info ts=2019-07-31T15:45:52.002Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563753600000 maxt=1563818400000 ulid=01DGDQRCZ949B46BNYWP2S5F02
level=info ts=2019-07-31T15:45:52.003Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563818400000 maxt=1563883200000 ulid=01DGFNHWFPXDCFKMJ45R22VST6
level=info ts=2019-07-31T15:45:52.004Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563883200000 maxt=1563948000000 ulid=01DGHKFKSD0THQ0VWGY9MM01GG
level=info ts=2019-07-31T15:45:52.004Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1563948000000 maxt=1564012800000 ulid=01DGKH5AFC7KQ1CN0JE7AA3G6Y
level=info ts=2019-07-31T15:45:52.005Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564012800000 maxt=1564077600000 ulid=01DGNEZMRJM9XKV1N0SA1Y9S3F
level=info ts=2019-07-31T15:45:52.005Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564077600000 maxt=1564142400000 ulid=01DGQCQYYCJDT3DQTVTBHS7N6G
level=info ts=2019-07-31T15:45:52.006Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564142400000 maxt=1564207200000 ulid=01DGSAYGB5QEMACHZK7ZE89H70
level=info ts=2019-07-31T15:45:52.006Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564207200000 maxt=1564272000000 ulid=01DGV8AWDDHKPSN407Z08FBHVZ
level=info ts=2019-07-31T15:45:52.007Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564272000000 maxt=1564336800000 ulid=01DGX64B6GVNNF4P09GB4YM3TV
level=info ts=2019-07-31T15:45:52.007Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564401600000 maxt=1564408800000 ulid=01DGZ3XMWQ04MNAJTWJNXGHNZ5
level=info ts=2019-07-31T15:45:52.008Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564336800000 maxt=1564401600000 ulid=01DGZ426ED23NR5759BDAQM0H6
level=info ts=2019-07-31T15:45:52.008Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564408800000 maxt=1564416000000 ulid=01DGZASC8TRTFRM61J3MX4PHX4
level=info ts=2019-07-31T15:45:52.009Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1564416000000 maxt=1564423200000 ulid=01DGZHN38ENTPENE3MM35HVS42
level=info ts=2019-07-31T15:45:52.020Z caller=web.go:461 component=web msg="router prefix" prefix=/monitoring/prometheus
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus starts and hangs, returns 503 #5814

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prometheus starts and hangs, returns 503 #5814

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions