Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus not responding after a time period #5353

Open
vishksaj opened this Issue Mar 13, 2019 · 13 comments

Comments

Projects
None yet
5 participants
@vishksaj
Copy link

vishksaj commented Mar 13, 2019

Prometheus not responding after a time period. There is no error in logs. Tsdb is taking time to load (6 to 8 minutes).
#################################
level=info ts=2019-03-13T16:46:22.724044759Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-03-13T16:46:22.724121743Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-03-13T16:46:22.724147937Z caller=main.go:304 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-56b988f5fd-nsp4x (none))"
level=info ts=2019-03-13T16:46:22.724171144Z caller=main.go:305 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-03-13T16:46:22.724191949Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-13T16:46:22.72506834Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-03-13T16:46:22.725143201Z caller=web.go:416 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-13T16:46:22.725780061Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1544313600000 maxt=1544896800000 ulid=01CYSTPRKKYREEAHGAKY6GBZ2H
level=info ts=2019-03-13T16:46:22.725912535Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1544896800000 maxt=1545480000000 ulid=01CZB6WJSKS3WPPG7QZSJFDE2J
level=info ts=2019-03-13T16:46:22.726020136Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1545480000000 maxt=1546063200000 ulid=01CZWK2JQE5YFAKXD6F0K4RVSK
level=info ts=2019-03-13T16:46:22.727592765Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546063200000 maxt=1546646400000 ulid=01D0DZ8DP7PEQFP5KQVEB1X1RP
level=info ts=2019-03-13T16:46:22.727753373Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546646400000 maxt=1547229600000 ulid=01D0ZBE7PNFH3KA0H3AVG94FHA
level=info ts=2019-03-13T16:46:22.727868524Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547229600000 maxt=1547812800000 ulid=01D1GQM4KNZAVRWYGRD26QQBS4
level=info ts=2019-03-13T16:46:22.727973518Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547812800000 maxt=1548396000000 ulid=01D223SP3Z3TRAJS0B6C5BWYCA
level=info ts=2019-03-13T16:46:22.728075739Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548396000000 maxt=1548979200000 ulid=01D2KFZYKQGBCH31KYYP26F5RP
level=info ts=2019-03-13T16:46:22.728176019Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548979200000 maxt=1549562400000 ulid=01D34W6BWAC1C0NG9QSGF0NCDA
level=info ts=2019-03-13T16:46:22.728276802Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1549562400000 maxt=1550145600000 ulid=01D3P8CFZ3RHDHB3K9E2AQHSBY
level=info ts=2019-03-13T16:46:22.728409327Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550728800000 ulid=01D47MHV86ZTZ900QB8P49CCHT
level=info ts=2019-03-13T16:46:22.728515501Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1551312000000 ulid=01D4S1FYMJPQ04ZDJKJXW9YCT2
level=info ts=2019-03-13T16:46:22.72859195Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551312000000 maxt=1551506400000 ulid=01D4YTZ4T7XWRXT1QZ1ES4X1SX
level=info ts=2019-03-13T16:46:22.728661721Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54M16FF80DNP2NYYD7WVFPE
level=info ts=2019-03-13T16:46:22.728728353Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5AEBVHHKX9TMQ60YK863QTX
level=info ts=2019-03-13T16:46:22.728794867Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G7MJGZ0JSRYK7W2HHP98DD
level=info ts=2019-03-13T16:46:22.728860682Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5P6E1H2Z3ARDHD9MZF7J5A6
level=info ts=2019-03-13T16:46:22.728918109Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552348800000 ulid=01D5QZPRX8QTED41V247Z33BHA
level=info ts=2019-03-13T16:46:22.728971869Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552348800000 maxt=1552413600000 ulid=01D5SW38M2EAZ17GHMCRGMJD6N
level=info ts=2019-03-13T16:46:22.729011805Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552485600000 ulid=01D5VS2ED2K6GPSHEG9RMQ60BP
level=info ts=2019-03-13T16:46:22.729064622Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552413600000 maxt=1552478400000 ulid=01D5VSN059YMBNVQ6DH7YR9G08
level=warn ts=2019-03-13T16:53:51.738147446Z caller=head.go:440 component=tsdb msg="unknown series references" count=1328
level=info ts=2019-03-13T16:54:06.072122896Z caller=main.go:635 msg="TSDB started"
level=info ts=2019-03-13T16:54:06.07272842Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-03-13T16:54:06.076760211Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T16:54:06.078139656Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T16:54:06.079844733Z caller=main.go:722 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-03-13T16:54:06.079869225Z caller=main.go:589 msg="Server is ready to receive web requests."
#################################

Environment

Production

  • System information:

Linux 3.10.0-957.1.3.el7.x86_64 x86_64

  • Prometheus version:

2.7.1

  • Logs:
level=info ts=2019-03-13T11:03:58.318187579Z caller=compact.go:398 component=tsdb msg="write block" mint=1552464000000 maxt=1552471200000 ulid=01D5VBB04RACFBS5GV9AEWE3KX
level=info ts=2019-03-13T11:04:13.696640067Z caller=head.go:488 component=tsdb msg="head GC completed" duration=10.04167455s
level=info ts=2019-03-13T11:06:12.978569659Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=21213 high=21385 duration=1m59.281863593s
level=info ts=2019-03-13T13:04:15.507270512Z caller=compact.go:398 component=tsdb msg="write block" mint=1552471200000 maxt=1552478400000 ulid=01D5VJ6QCS3CE4RDMAN7D2CGPV
level=info ts=2019-03-13T13:04:31.808217786Z caller=head.go:488 component=tsdb msg="head GC completed" duration=11.028815577s
level=info ts=2019-03-13T13:06:34.248671945Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=21386 high=21559 duration=2m2.440389499s
level=info ts=2019-03-13T15:04:29.414211092Z caller=compact.go:398 component=tsdb msg="write block" mint=1552478400000 maxt=1552485600000 ulid=01D5VS2EMR46MZNPZQT2TTZMBE
level=info ts=2019-03-13T15:04:52.223551702Z caller=head.go:488 component=tsdb msg="head GC completed" duration=17.153899949s
level=info ts=2019-03-13T15:06:55.439356735Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=21560 high=21735 duration=2m3.215750709s
level=info ts=2019-03-13T15:10:16.512350727Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1552456800000 maxt=1552478400000 ulid=01D5VSF998PPKF5NMBCPH8YRHX sources="[01D5V4F8WRC81F5MQX60ZYHW1C 01D5VBB04RACFBS5GV9AEWE3KX 01D5VJ6QCS3CE4RDMAN7D2CGPV]"
level=info ts=2019-03-13T15:17:06.405744Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1552413600000 maxt=1552478400000 ulid=01D5VSNKTP8R0P3CFYHTJ555ZK sources="[01D5THV5QV6J9BY3TW4RRAWWS4 01D5V4TDBP7C32NWZ8D439QTEF 01D5VSF998PPKF5NMBCPH8YRHX]"
level=info ts=2019-03-13T15:52:06.945656904Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1552284000000 maxt=1552478400000 ulid=01D5VT28M3HQTPFF1D264S4M31 sources="[01D5QZ2H71B2B82YX6ST1A0QC4 01D5SVZ1ADFT2MXZXKTGH8RPJS 01D5VSNKTP8R0P3CFYHTJ555ZK]"

We have allocated 10 CPU Cores and 180 Gb memory for the container as we are processing 400K samples/s , prometheus_tsdb_head_samples_appended_total = 400k
I can see there is huge difference between Allocated and RSS.
image

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 15, 2019

The RSS includes files that are memory-mapped and the kernel should reclaim the memory whenever needed. Regarding the startup time, Prometheus needs to sanity-check the data (blocks+WAL) before being ready. Long start time + compactions time would usually mean that your storage isn't fast enough and/or you have too many time series/samples.

What's the value of the prometheus_tsdb_head_series metrics? What kind of storage are you using?

@vishksaj

This comment has been minimized.

Copy link
Author

vishksaj commented Mar 19, 2019

Hi @simonpasquier :

prometheus_tsdb_head_series : 10.35 Mil
I am using EBS GP2 Volume with 1500 GiB storage and 4500 IOPS

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 19, 2019

I'm experiencing the same issue. Started to happen when I migrated to 2.7.1. Tried to downgrade to 2.6.0 but the problem persists. I noticed a huge spike on memory usage every 2 hours (tsdb related jobs) after the upgrade. In my case, Prometheus is completely offline since the upgrade.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 20, 2019

A Prometheus server with 10M series is a large setup. You may consider sharding your targets across multiple servers.

@allangood if you see the same behavior after rolling back to v2.6, it seems unrelated to the version change. Most probably, the dataset has grown causing higher load. Every 2 hours, Prometheus will write the WAL to a block so a memory usage increase is expected here.

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 21, 2019

Hi @simonpasquier, I would like to agree, but currently my Prometheus server doesn't have any target defined (I've disabled everything in a hope to put it back online, without luck). I don't know if the upgrade have changed something in the TSDB, but even without any target defined, my Prometheus just can't get online. It starts to consume all available RAM memory, then hits hard the swap, OOM kill the process, systemd restart it and repeat. Sadly enough, my server is down and I can't show some Grafana graphs from "before" and "after" the upgrade. Before, I had a straight line of memory usage above [edited] below 9GB (like this: "-"), after I have an always growing (like this: "/") line with huge spikes every 2 hours.
I'm still trying to put the server back online to fill this post with some useful information.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 21, 2019

@allangood Anything in the logs? have you tried starting Prometheus with --log.level=debug?

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 21, 2019

Hi @simonpasquier,

Nothing really useful in the logs:

Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.575835541Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552672800000 maxt=1552737600000 ulid=01D63GR3FESME164T55FCJK0F3
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.577263756Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552737600000 maxt=1552802400000 ulid=01D65EA7QEPZ7GQ3QNDVD3KZ41
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.57868187Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552867200000 maxt=1552874400000 ulid=01D67BVP183WNC8SRJ8QEB5S6V
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.58071069Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552802400000 maxt=1552867200000 ulid=01D67CBSQ51NMMEQ9R6C5YXSRW
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.582475008Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552874400000 maxt=1552881600000 ulid=01D6F65YETXDCYYCDARZY7S366
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.583817322Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552881600000 maxt=1552888800000 ulid=01D6FQ2DPC27B2V6XB7H7B030N
Mar 21 09:41:08 grafana prometheus[28770]: level=info ts=2019-03-21T14:41:08.595289537Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552888800000 maxt=1552896000000 ulid=01D6FX0A57Q464GE1P4N0RF6VV
Mar 21 09:41:08 grafana prometheus[28770]: level=warn ts=2019-03-21T14:41:08.811363705Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/var/lib/databases/prometheus/wal/00017776

I will post the output of tsdb analyze /var/lib/prometheus to give you more information.
And I didn't try to run with --log.level=debug yet, I will post the output with the tsdb output.

Thank you very much!

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 21, 2019

Here the outputs:

TSDB

tsdb analyze /var/lib/prometheus/
Block path: /var/lib/prometheus/01D6FX0A57Q464GE1P4N0RF6VV
Duration: 2h0m0s
Series: 232523
Label names: 153
Postings (unique label pairs): 10848
Postings entries (total label pairs): 3020635

Label pairs most involved in churning:
160196 collector=telegraf
160196 job=prometheus
156610 exported_collector=telegraf
105062 exported_location=wpg
94247 ostype=linux
81032 sla=production
76856 sla=internal
74868 pattern=.*
65767 user=root
62362 ostype=windows
53231 location=wpg
38684 location=azure
28416 location=eu
21523 exported_location=azure
20116 location=raspberry_pi
19720 tv_mode=NOC
19720 system_identifier=raspberry_pi
17642 location=na
16815 exported_location=eu
13117 objectname=Process
9994 collector=snmp

Label names most involved in churning:
175707 __name__
175680 instance
175360 job
174896 collector
164358 sla
161410 location
157731 host
156610 ostype
156610 exported_location
156610 exported_collector
156610 fqdn
75462 pattern
75457 user
75457 process_name
32030 exported_instance
30031 objectname
19720 tv_mode
19720 system_identifier
15411 service_name
15411 display_name
8425 ifIndex

Most common label pairs:
214475 collector=telegraf
214475 job=prometheus
209305 exported_collector=telegraf
145451 exported_location=wpg
113875 ostype=linux
111829 sla=internal
99735 sla=production
95430 ostype=windows
90372 pattern=.*
83339 location=wpg
79450 user=root
48267 location=azure
31312 location=eu
27657 exported_location=azure
25512 location=na
24148 location=raspberry_pi
23627 tv_mode=NOC
23627 system_identifier=raspberry_pi
19083 objectname=Process
17997 exported_location=eu
12594 exported_location=na

Highest cardinality labels:
2814 exported_instance
2180 __name__
517 physical_filename
436 display_name
428 service_name
410 process_name
398 instance
305 database_name
289 exchange
258 logical_filename
239 pid
215 ifName
173 ifDescr
158 ifIndex
143 host
129 fqdn
119 index
112 ifAlias
111 wait_type
108 message_type
74 version_configstring

Highest cardinality metric names:
12184 win_services_startup_mode
12184 win_services_state
4302 procstat_memory_rss
4301 procstat_cpu_time_user
4272 procstat_cpu_usage
4055 procstat_pid
4051 procstat_memory_locked
4051 procstat_cpu_time_stolen
4051 procstat_cpu_time_idle
4051 procstat_memory_data
4051 procstat_cpu_time_nice
4051 procstat_memory_stack
4051 procstat_num_threads
4051 procstat_cpu_time_irq
4050 procstat_cpu_time_guest
4050 procstat_involuntary_context_switches
4050 procstat_memory_swap
4050 procstat_cpu_time_guest_nice
4050 procstat_cpu_time_soft_irq
4050 procstat_memory_vms
4050 procstat_cpu_time_steal

Debug

./prometheus --config.file=/etc/prometheus/prometheus.yaml --storage.tsdb.path=/var/lib/prometheus --log.level=debug --storage.tsdb.retention.size="140GB"
level=info ts=2019-03-21T18:12:56.884239761Z caller=main.go:321 msg="Starting Prometheus" version="(version=2.8.0, branch=HEAD, revision=59369491cfdfe8dcb325723d6d28a837887a07b9)"
level=info ts=2019-03-21T18:12:56.884323651Z caller=main.go:322 build_context="(go=go1.11.5, user=root@4c4d5c29b71f, date=20190312-07:46:58)"
level=info ts=2019-03-21T18:12:56.884359294Z caller=main.go:323 host_details="(Linux 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 prometheus (none))"
level=info ts=2019-03-21T18:12:56.884393024Z caller=main.go:324 fd_limits="(soft=1024, hard=1048576)"
level=info ts=2019-03-21T18:12:56.884421931Z caller=main.go:325 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-21T18:12:56.885306915Z caller=main.go:640 msg="Starting TSDB ..."
level=info ts=2019-03-21T18:12:56.885391948Z caller=web.go:418 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-21T18:12:56.890933332Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54KN0RE4H1Z1M44H9QHFK71
level=info ts=2019-03-21T18:12:56.903193246Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5ADJMQNZYAY87G6GN49AN75
level=info ts=2019-03-21T18:12:56.922783573Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5KC6KV53PXFBAGGMKA3FR56
level=info ts=2019-03-21T18:12:56.941118165Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552176000000 maxt=1552284000000 ulid=01D5P75V1PJP96W658WZ3TJY4W
level=info ts=2019-03-21T18:12:56.960027187Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552478400000 ulid=01D5VS8ZX6T24HNZYE7BENAP1S
level=info ts=2019-03-21T18:12:56.963473354Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552672800000 ulid=01D61K18N5A7P7VKPB2D5JNM3K
level=info ts=2019-03-21T18:12:56.968559887Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552672800000 maxt=1552737600000 ulid=01D63GR3FESME164T55FCJK0F3
level=info ts=2019-03-21T18:12:56.972487403Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552737600000 maxt=1552802400000 ulid=01D65EA7QEPZ7GQ3QNDVD3KZ41
level=info ts=2019-03-21T18:12:56.975940521Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552867200000 maxt=1552874400000 ulid=01D67BVP183WNC8SRJ8QEB5S6V
level=info ts=2019-03-21T18:12:56.978950026Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552802400000 maxt=1552867200000 ulid=01D67CBSQ51NMMEQ9R6C5YXSRW
level=info ts=2019-03-21T18:12:56.983586059Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552874400000 maxt=1552881600000 ulid=01D6F65YETXDCYYCDARZY7S366
level=info ts=2019-03-21T18:12:56.988529977Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552881600000 maxt=1552888800000 ulid=01D6FQ2DPC27B2V6XB7H7B030N
level=info ts=2019-03-21T18:12:56.992330002Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552888800000 maxt=1552896000000 ulid=01D6FX0A57Q464GE1P4N0RF6VV
level=info ts=2019-03-21T18:12:56.994809201Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552896000000 maxt=1552903200000 ulid=01D6GQ3177J430W10RVA03T5F8

Then the service stuck here forever and is killed by OOM.

I'm in the process to migrate the TSDB files to another server with more resources, when I'm done, I will return with the graphs.

Thank you again @simonpasquier

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 21, 2019

TSDB back online.

This image is the memory consumption before and after the upgrade.
The first blue line (around 3/6) marks the upgrade to 2.7.2 from 2.6.0 and the second the upgrade to 2.8.0:
image

This is what is happening now:
image

And this is the behaviour with Prometheus 2.6.0:
image

This is another server, in another datacenter with the same behaviour:
Blue line marks upgrade from 2.6.0 to 2.7.2
image

Nothing was changed other than Prometheus version.

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 22, 2019

Some more information:

These graphs came from the Prometheus Benchmark 2.x dashboard:
Blue line marks the upgrade from 2.6.0 to 2.7.2

This is the most strange graph. The GC climbed from avg 4GB to 10Gb then to 14GB:
image

image

image

The samples appended didn't change during the time:
image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 22, 2019

@allangood

This comment has been minimized.

Copy link

allangood commented Mar 22, 2019

Hi @brian-brazil,

Can you help me to understand this issue? What piece of information made you to find it out? And what is more important for me, how can I prevent this to happen again?
This issue made Prometheus to behaviour very erratically, consuming all available memory and making the whole machine unavailable (I couldn't even login to these machines due to OOM).
As you can see, the problem started "out of the blue" (in my point of view, but obviously something happened).

Another problem was the troubleshoot process, the Prometheus server itself didn't come back with any useful piece of information in the logs and the database was completely offline. The tsdb program didn't work either. How can I deal with this problem if it happen again?

Thank you very much for you time and information.

@Natalique

This comment has been minimized.

Copy link

Natalique commented Mar 26, 2019

Having the same issue. it takes 20 mins for tsdb to start and like 2 minutes after prometheus gets OOM killed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.