Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus not responding after a time period #5353
Comments
This comment has been minimized.
This comment has been minimized.
|
The RSS includes files that are memory-mapped and the kernel should reclaim the memory whenever needed. Regarding the startup time, Prometheus needs to sanity-check the data (blocks+WAL) before being ready. Long start time + compactions time would usually mean that your storage isn't fast enough and/or you have too many time series/samples. What's the value of the |
This comment has been minimized.
This comment has been minimized.
|
Hi @simonpasquier : prometheus_tsdb_head_series : 10.35 Mil |
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 19, 2019
|
I'm experiencing the same issue. Started to happen when I migrated to 2.7.1. Tried to downgrade to 2.6.0 but the problem persists. I noticed a huge spike on memory usage every 2 hours (tsdb related jobs) after the upgrade. In my case, Prometheus is completely offline since the upgrade. |
This comment has been minimized.
This comment has been minimized.
|
A Prometheus server with 10M series is a large setup. You may consider sharding your targets across multiple servers. @allangood if you see the same behavior after rolling back to v2.6, it seems unrelated to the version change. Most probably, the dataset has grown causing higher load. Every 2 hours, Prometheus will write the WAL to a block so a memory usage increase is expected here. |
simonpasquier
added
the
kind/more-info-needed
label
Mar 20, 2019
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 21, 2019
•
|
Hi @simonpasquier, I would like to agree, but currently my Prometheus server doesn't have any target defined (I've disabled everything in a hope to put it back online, without luck). I don't know if the upgrade have changed something in the TSDB, but even without any target defined, my Prometheus just can't get online. It starts to consume all available RAM memory, then hits hard the swap, OOM kill the process, systemd restart it and repeat. Sadly enough, my server is down and I can't show some Grafana graphs from "before" and "after" the upgrade. Before, I had a straight line of memory usage |
This comment has been minimized.
This comment has been minimized.
|
@allangood Anything in the logs? have you tried starting Prometheus with |
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 21, 2019
|
Hi @simonpasquier, Nothing really useful in the logs:
I will post the output of Thank you very much! |
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 21, 2019
|
Here the outputs: TSDB
Debug
Then the service stuck here forever and is killed by OOM. I'm in the process to migrate the TSDB files to another server with more resources, when I'm done, I will return with the graphs. Thank you again @simonpasquier |
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 21, 2019
•
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 22, 2019
This comment has been minimized.
This comment has been minimized.
|
That's a cardinality issue, that happens to coincide with the upgrade.
…On Fri 22 Mar 2019, 01:21 Allan Gomes GooD, ***@***.***> wrote:
Some more information:
These graphs came from the Prometheus Benchmark 2.x dashboard:
Blue line marks the upgrade from 2.6.0 to 2.7.2
This is the most strange graph. The GC climbed from avg 4GB to 10Gb then
to 14GB:
[image: image]
<https://user-images.githubusercontent.com/757086/54794454-920f2100-4c15-11e9-977e-26ae4ba6eee3.png>
[image: image]
<https://user-images.githubusercontent.com/757086/54794477-ac48ff00-4c15-11e9-9913-fc31cc44ebb9.png>
[image: image]
<https://user-images.githubusercontent.com/757086/54794585-342f0900-4c16-11e9-92ea-8b44cd93de7a.png>
The samples appended didn't change during the time:
[image: image]
<https://user-images.githubusercontent.com/757086/54794602-4446e880-4c16-11e9-9938-438c3be4de5e.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5353 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdouPrhtuYAH19o1BbEpP_EBKY2Q7ks5vZDAOgaJpZM4buCAG>
.
|
This comment has been minimized.
This comment has been minimized.
allangood
commented
Mar 22, 2019
|
Hi @brian-brazil, Can you help me to understand this issue? What piece of information made you to find it out? And what is more important for me, how can I prevent this to happen again? Another problem was the troubleshoot process, the Prometheus server itself didn't come back with any useful piece of information in the logs and the database was completely offline. The Thank you very much for you time and information. |
This comment has been minimized.
This comment has been minimized.
Natalique
commented
Mar 26, 2019
•
|
Having the same issue. it takes 20 mins for tsdb to start and like 2 minutes after prometheus gets OOM killed |








vishksaj commentedMar 13, 2019
•
edited
Prometheus not responding after a time period. There is no error in logs. Tsdb is taking time to load (6 to 8 minutes).
#################################
level=info ts=2019-03-13T16:46:22.724044759Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-03-13T16:46:22.724121743Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-03-13T16:46:22.724147937Z caller=main.go:304 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-56b988f5fd-nsp4x (none))"
level=info ts=2019-03-13T16:46:22.724171144Z caller=main.go:305 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-03-13T16:46:22.724191949Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-13T16:46:22.72506834Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-03-13T16:46:22.725143201Z caller=web.go:416 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-13T16:46:22.725780061Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1544313600000 maxt=1544896800000 ulid=01CYSTPRKKYREEAHGAKY6GBZ2H
level=info ts=2019-03-13T16:46:22.725912535Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1544896800000 maxt=1545480000000 ulid=01CZB6WJSKS3WPPG7QZSJFDE2J
level=info ts=2019-03-13T16:46:22.726020136Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1545480000000 maxt=1546063200000 ulid=01CZWK2JQE5YFAKXD6F0K4RVSK
level=info ts=2019-03-13T16:46:22.727592765Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546063200000 maxt=1546646400000 ulid=01D0DZ8DP7PEQFP5KQVEB1X1RP
level=info ts=2019-03-13T16:46:22.727753373Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546646400000 maxt=1547229600000 ulid=01D0ZBE7PNFH3KA0H3AVG94FHA
level=info ts=2019-03-13T16:46:22.727868524Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547229600000 maxt=1547812800000 ulid=01D1GQM4KNZAVRWYGRD26QQBS4
level=info ts=2019-03-13T16:46:22.727973518Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547812800000 maxt=1548396000000 ulid=01D223SP3Z3TRAJS0B6C5BWYCA
level=info ts=2019-03-13T16:46:22.728075739Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548396000000 maxt=1548979200000 ulid=01D2KFZYKQGBCH31KYYP26F5RP
level=info ts=2019-03-13T16:46:22.728176019Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548979200000 maxt=1549562400000 ulid=01D34W6BWAC1C0NG9QSGF0NCDA
level=info ts=2019-03-13T16:46:22.728276802Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1549562400000 maxt=1550145600000 ulid=01D3P8CFZ3RHDHB3K9E2AQHSBY
level=info ts=2019-03-13T16:46:22.728409327Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550728800000 ulid=01D47MHV86ZTZ900QB8P49CCHT
level=info ts=2019-03-13T16:46:22.728515501Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1551312000000 ulid=01D4S1FYMJPQ04ZDJKJXW9YCT2
level=info ts=2019-03-13T16:46:22.72859195Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551312000000 maxt=1551506400000 ulid=01D4YTZ4T7XWRXT1QZ1ES4X1SX
level=info ts=2019-03-13T16:46:22.728661721Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54M16FF80DNP2NYYD7WVFPE
level=info ts=2019-03-13T16:46:22.728728353Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5AEBVHHKX9TMQ60YK863QTX
level=info ts=2019-03-13T16:46:22.728794867Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G7MJGZ0JSRYK7W2HHP98DD
level=info ts=2019-03-13T16:46:22.728860682Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5P6E1H2Z3ARDHD9MZF7J5A6
level=info ts=2019-03-13T16:46:22.728918109Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552348800000 ulid=01D5QZPRX8QTED41V247Z33BHA
level=info ts=2019-03-13T16:46:22.728971869Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552348800000 maxt=1552413600000 ulid=01D5SW38M2EAZ17GHMCRGMJD6N
level=info ts=2019-03-13T16:46:22.729011805Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552485600000 ulid=01D5VS2ED2K6GPSHEG9RMQ60BP
level=info ts=2019-03-13T16:46:22.729064622Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552413600000 maxt=1552478400000 ulid=01D5VSN059YMBNVQ6DH7YR9G08
level=warn ts=2019-03-13T16:53:51.738147446Z caller=head.go:440 component=tsdb msg="unknown series references" count=1328
level=info ts=2019-03-13T16:54:06.072122896Z caller=main.go:635 msg="TSDB started"
level=info ts=2019-03-13T16:54:06.07272842Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-03-13T16:54:06.076760211Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T16:54:06.078139656Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T16:54:06.079844733Z caller=main.go:722 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-03-13T16:54:06.079869225Z caller=main.go:589 msg="Server is ready to receive web requests."
#################################
Environment
Production
Linux 3.10.0-957.1.3.el7.x86_64 x86_64
2.7.1
We have allocated 10 CPU Cores and 180 Gb memory for the container as we are processing 400K samples/s , prometheus_tsdb_head_samples_appended_total = 400k

I can see there is huge difference between Allocated and RSS.