Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upConsul issues causing Prometheus panic #1206
Comments
This comment has been minimized.
This comment has been minimized.
|
Hey, I'm wondering a bit about that stack trace. Is it complete? It looks cut off at the top. Normally a panic stack trace starts with a "panic [...]" message and then right after that lists the goroutine in which it panicked. Also, there are still log lines after the stack trace, which means that the panic that generated that stack dump must have happend in a component that catches panics and the program kept running (like we do with panics during queries or other HTTP requests). |
This comment has been minimized.
This comment has been minimized.
|
I can give you a longer log if you need. |
This comment has been minimized.
This comment has been minimized.
|
Yes, please. Without the very beginning of the stack trace, the trace isn't very useful, unfortunately. Question: does the Prometheus server really fully crash/stop after this? Because the continuing log lines after the stack trace indicate that it continues running (but not working properly anymore?). |
This comment has been minimized.
This comment has been minimized.
|
You can hit the graph page, it stops collecting, status page breaks and all queries show no data. |
This comment has been minimized.
This comment has been minimized.
|
That's intriguing. We don't |
This comment has been minimized.
This comment has been minimized.
|
Here's the output. Weirdly, it actually fully crashed this time. |
This comment has been minimized.
This comment has been minimized.
|
That still doesn't look like the full crash dump though. Those always start with a line that contains the word "panic". Do your logs provide more backwards history? |
This comment has been minimized.
This comment has been minimized.
|
It's Supervisord log output to srtderr. I think that's what you get to
|
This comment has been minimized.
This comment has been minimized.
gfliker
commented
Nov 10, 2015
|
Hi, |
This comment has been minimized.
This comment has been minimized.
|
A panic is independent of our logging library and always goes to stderr (but our logging also logs to stderr by default). However, it is clear from your latest gist (https://gist.github.com/nikgrok/252180ffa1613ee4a803) that it starts even in the middle of a line, so it's really just cut off at the beginning. So if you could run your Prometheus in a way that enables you to capture more stderr history, that would be very helpful. |
This comment has been minimized.
This comment has been minimized.
gfliker
commented
Nov 10, 2015
|
I gave the log files more space and lets hope we can catch it now. |
This comment has been minimized.
This comment has been minimized.
|
Hey Julius, Here is the Gist. You should see the whole stack trace. Around line 70293, the service was restarted by supervisord and you should be able to see the standard error output I mentioned before. I had to attach it because gist was melting |
This comment has been minimized.
This comment has been minimized.
Looks like you need to tune your memory options. |
This comment has been minimized.
This comment has been minimized.
|
While I noticed that as well, I think the question is more in line with why On Wed, Nov 11, 2015 at 9:09 AM, Brian Brazil notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
|
Can you do a heap and growth profile to see where the memory is coming from? See http://blog.golang.org/profiling-go-programs |
This comment has been minimized.
This comment has been minimized.
|
And read http://prometheus.io/docs/operating/storage/#memory-usage . The default memory settings are meant for a 4GiB machine. With 2GiB of RAM, you will run out of memory after some time of collecting metrics. |
This comment has been minimized.
This comment has been minimized.
|
I think the memory issues are a symptom of the overall problem. We tuned past the default settings. Do you think these values should be -storage.local.memory-chunks=3600000 On Wed, Nov 11, 2015 at 9:54 AM, Björn Rabenstein notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
|
Here's the memory profile
Mem: 62 61 1 0 5 10 On Wed, Nov 11, 2015 at 10:29 AM, Nikita Ostrovsky <
|
This comment has been minimized.
This comment has been minimized.
|
On Wed, Nov 11, 2015 at 4:32 PM, nikgrok notifications@github.com wrote:
That's GiB, I assume? (Did you run Then something went mental, indeed. You could take a heap profile to check. See https://golang.org/pkg/net/http/pprof/ . It's enabled on your Prometheus server by default. go tool pprof http://your-prometheus-servern:9090/debug/pprof/heap And then', at the svg > heap.svg Then you can look at the heap.svg file or send it here to the list. Björn Rabenstein, Engineer SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany |
This comment has been minimized.
This comment has been minimized.
|
Here is the heap.stg |
This comment has been minimized.
This comment has been minimized.
|
On Thu, Nov 12, 2015 at 3:30 PM, nikgrok notifications@github.com wrote:
Looks not horribly off to me. The largest chunk of memory is used by Could you check the prometheus_local_storage_memory_chunks metric on But there is definitely not a massive amount of memory disappearing in I've just realized that your ps -eo comm,rss | grep prometheus Or look at the process_resident_memory_bytes{instance=~"localhost"} In your case, it should be around 25GiB. (In my experience, RSS is Björn Rabenstein, Engineer SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
So you have 22GiB free at the moment. You are not close to running out of
memory.
However, you have way more chunks in memory than configured (3.6M vs.
5.6M). And your Prometheus server is using even more memory than I'd expect
from the (already increased) numbers of chunks in memory.
That sounds like you either run a lot of queries that touch many chunks (so
the server has to keep more chunks in memory than configured, and it would
also explain additional memory usage) or you have too many time series in
memory (like multiple millions - obviously, every time series needs at
least one chunk in memory...) or your server has trouble persisting chunks
to disk and therefore has to keep them in memory instead of evicting them.
Please check the following metrics:
- prometheus_local_storage_memory_series
- prometheus_local_storage_chunks_to_persist
- prometheus_local_storage_memory_chunkdescs
Their values will tell us which of my assumptions above might be true...
|
This comment has been minimized.
This comment has been minimized.
|
prometheus_local_storage_memory_series 3.376371e+06 |
This comment has been minimized.
This comment has been minimized.
|
OK, so it looks like variant 2. You have a lot of time series in memory (3.4M). As a rule of thumb, I'd set Of course, running a server so close to the memory limit might easily be reached upon any kind of hiccup. I'm pretty confident that this issue has nothing to do with the Consul integration. I'm closing the issue, but feel free to discuss more here (or on the mailing list, where it's perhaps more useful for others with the same memory tweaking problems). #455 is also kind of relevant here. |
beorn7
closed this
Nov 12, 2015
This comment has been minimized.
This comment has been minimized.
|
And BTW, also increase |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
nikgrok commentedNov 10, 2015
Hey guys,
Our Prometheus servers are all fine except for the machine that handles a few thousand servers. Check out the log output. The service crashes every ~12 hours.
https://gist.github.com/nikgrok/ef6309e83bd79db74211