Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul issues causing Prometheus panic #1206

Closed
nikgrok opened this Issue Nov 10, 2015 · 27 comments

Comments

Projects
None yet
5 participants
@nikgrok
Copy link

nikgrok commented Nov 10, 2015

Hey guys,

Our Prometheus servers are all fine except for the machine that handles a few thousand servers. Check out the log output. The service crashes every ~12 hours.

https://gist.github.com/nikgrok/ef6309e83bd79db74211

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 10, 2015

Hey, I'm wondering a bit about that stack trace. Is it complete? It looks cut off at the top. Normally a panic stack trace starts with a "panic [...]" message and then right after that lists the goroutine in which it panicked. Also, there are still log lines after the stack trace, which means that the panic that generated that stack dump must have happend in a component that catches panics and the program kept running (like we do with panics during queries or other HTTP requests).

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 10, 2015

I can give you a longer log if you need.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 10, 2015

Yes, please. Without the very beginning of the stack trace, the trace isn't very useful, unfortunately. Question: does the Prometheus server really fully crash/stop after this? Because the continuing log lines after the stack trace indicate that it continues running (but not working properly anymore?).

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 10, 2015

You can hit the graph page, it stops collecting, status page breaks and all queries show no data.
I need to wait for the next failure and I'll send the whole log.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 10, 2015

That's intriguing. We don't recover() any panics in the SD+scraping parts of Prometheus, so I would expect any panic in those parts to not only stop the ingestion, but take down the entire server. So let's see where the panic is really coming from. Will await the logs.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 10, 2015

Here's the output. Weirdly, it actually fully crashed this time.

https://gist.github.com/nikgrok/252180ffa1613ee4a803

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 10, 2015

That still doesn't look like the full crash dump though. Those always start with a line that contains the word "panic". Do your logs provide more backwards history?

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 10, 2015

It's Supervisord log output to srtderr. I think that's what you get to
stderr
On Nov 10, 2015 4:20 PM, "Julius Volz" notifications@github.com wrote:

That still doesn't look like the full crash dump though. Those always
start with a line that contains the word "panic". Do your logs provide more
backwards history?


Reply to this email directly or view it on GitHub
#1206 (comment)
.

@gfliker

This comment has been minimized.

Copy link

gfliker commented Nov 10, 2015

Hi,
That's the complete log file that we got.
Were running with "-log.level error" maybe that masks a few things ?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 10, 2015

A panic is independent of our logging library and always goes to stderr (but our logging also logs to stderr by default). However, it is clear from your latest gist (https://gist.github.com/nikgrok/252180ffa1613ee4a803) that it starts even in the middle of a line, so it's really just cut off at the beginning. So if you could run your Prometheus in a way that enables you to capture more stderr history, that would be very helpful.

@gfliker

This comment has been minimized.

Copy link

gfliker commented Nov 10, 2015

I gave the log files more space and lets hope we can catch it now.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 11, 2015

Hey Julius,

Here is the Gist. You should see the whole stack trace. Around line 70293, the service was restarted by supervisord and you should be able to see the standard error output I mentioned before. I had to attach it because gist was melting

last-crashed-prom-log.txt

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 11, 2015

fatal error: runtime: out of memory

Looks like you need to tune your memory options.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 11, 2015

While I noticed that as well, I think the question is more in line with why
the consul errors are causing Prometheus to consume so much.

On Wed, Nov 11, 2015 at 9:09 AM, Brian Brazil notifications@github.com
wrote:

fatal error: runtime: out of memory

Looks like you need to tune your memory options.


Reply to this email directly or view it on GitHub
#1206 (comment)
.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 11, 2015

Can you do a heap and growth profile to see where the memory is coming from? See http://blog.golang.org/profiling-go-programs

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 11, 2015

And read http://prometheus.io/docs/operating/storage/#memory-usage . The default memory settings are meant for a 4GiB machine. With 2GiB of RAM, you will run out of memory after some time of collecting metrics.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 11, 2015

I think the memory issues are a symptom of the overall problem.

We tuned past the default settings. Do you think these values should be
tuned further?

-storage.local.memory-chunks=3600000
-storage.local.max-chunks-to-persist=2880000
-query.max-concurrency=20
-query.staleness-delta=5m0s

On Wed, Nov 11, 2015 at 9:54 AM, Björn Rabenstein notifications@github.com
wrote:

And read http://prometheus.io/docs/operating/storage/#memory-usage . The
default memory settings are meant for a 4GiB machine. With 2GiB of RAM, you
will run out of memory after some time of collecting metrics.


Reply to this email directly or view it on GitHub
#1206 (comment)
.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 11, 2015

Here's the memory profile

         total       used       free     shared    buffers     cached

Mem: 62 61 1 0 5 10

On Wed, Nov 11, 2015 at 10:29 AM, Nikita Ostrovsky <
nikita.ostrovsky@gmail.com> wrote:

I think the memory issues are a symptom of the overall problem.

We tuned past the default settings. Do you think these values should be
tuned further?

-storage.local.memory-chunks=3600000
-storage.local.max-chunks-to-persist=2880000
-query.max-concurrency=20
-query.staleness-delta=5m0s

On Wed, Nov 11, 2015 at 9:54 AM, Björn Rabenstein <
notifications@github.com> wrote:

And read http://prometheus.io/docs/operating/storage/#memory-usage . The
default memory settings are meant for a 4GiB machine. With 2GiB of RAM, you
will run out of memory after some time of collecting metrics.


Reply to this email directly or view it on GitHub
#1206 (comment)
.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 12, 2015

On Wed, Nov 11, 2015 at 4:32 PM, nikgrok notifications@github.com wrote:

total used free shared buffers cached
Mem: 62 61 1 0 5 10

That's GiB, I assume? (Did you run free -g?)

Then something went mental, indeed.

You could take a heap profile to check.

See https://golang.org/pkg/net/http/pprof/ .

It's enabled on your Prometheus server by default.
Run the following:

go tool pprof http://your-prometheus-servern:9090/debug/pprof/heap

And then', at the pprof prompt, run

svg > heap.svg

Then you can look at the heap.svg file or send it here to the list.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 12, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 12, 2015

On Thu, Nov 12, 2015 at 3:30 PM, nikgrok notifications@github.com wrote:

Here is the heap.stg

https://gist.github.com/nikgrok/b596ef15725e6045378b

Looks not horribly off to me. The largest chunk of memory is used by
the time series chunks. ~5GiB, which is a bit on the high side. In my
experience, the heap profiling displays pretty much precisely the
memory the chunks need, i.e. it should be ~3.6GiB.

Could you check the prometheus_local_storage_memory_chunks metric on
your Prometheus to see how many chunks are actually in memory? (Grep
for it in the /metrics output or look at it in the expression browser,
if your server is monitoring itself.)

But there is definitely not a massive amount of memory disappearing in
Consul-related code.

I've just realized that your free -g output is incomplete. The 2nd
row is way more interesting. Also, check what the actual RSS memory
size of your Prometheus process is. Run

ps -eo comm,rss | grep prometheus

Or look at the process_resident_memory_bytes{instance=~"localhost"}
metric in the expression browser if your prometheus server is
monitoring itself.

In your case, it should be around 25GiB. (In my experience, RSS is
about twice as the heap size reported by the heap profiling. Don't ask
me why...)

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 12, 2015

prometheus_local_storage_memory_chunks 5.690125e+06

# free -g
             total       used       free     shared    buffers     cached
Mem:            62         58          4          0          7         10
-/+ buffers/cache:         40         22
Swap:            7          0          7

# ps -eo comm,rss | grep prometheus
prometheus      36948532
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 12, 2015

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Nov 12, 2015

prometheus_local_storage_memory_series 3.376371e+06
prometheus_local_storage_chunks_to_persist 2.304041e+06
prometheus_local_storage_memory_chunkdescs 4.3417645e+07

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 12, 2015

OK, so it looks like variant 2. You have a lot of time series in memory (3.4M). As a rule of thumb, I'd set -storage.local.memory-chunk to ~3x the number of memory time series, see the storage-tuning documentation. In your case, that means ~10M memory chunks. Then you will regularly use most of your 64GiB of RAM. However, I could imagine that your memory usage will be less spiky. With your current configuration, the servers tries frantically to get chunks out of memory and might do more harm than good by doing so.

Of course, running a server so close to the memory limit might easily be reached upon any kind of hiccup.

I'm pretty confident that this issue has nothing to do with the Consul integration. I'm closing the issue, but feel free to discuss more here (or on the mailing list, where it's perhaps more useful for others with the same memory tweaking problems).

#455 is also kind of relevant here.

@beorn7 beorn7 closed this Nov 12, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 12, 2015

And BTW, also increase -storage.local.max-chunks-to-persist to something at least half as many as -storage.local.memory-chunk.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.