Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upErrors observed in logs #4177
Comments
This comment has been minimized.
This comment has been minimized.
|
the |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I did not do anything unusual to cause the PATH/prometheus/prometheus --config.file /config/prometheus.yml --web.listen-address=127.0.0.1:8180 --storage.tsdb.path=/data/ --storage.tsdb.retention=90d --storage.tsdb.no-lockfile --log.level=debug --web.external-url=/prometheus --web.console.libraries=PATH/prometheus/console_libraries --web.console.templates=PATH/prometheus/consoles --web.enable-admin-api --web.enable-lifecycle
|
dhbarman
changed the title
Error observed in logs
Errors observed in logs
May 23, 2018
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier do you have any ideas? I looked briefly but can't see what is causing this second issue. |
This comment has been minimized.
This comment has been minimized.
|
It looks like a bug in a third-party package causing a panic which then caught by the HTTP handler. @dhbarman you're not able to relate the error message to a specific URL? |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier Which third party package might be causing this panic ? I am not able to relate to a specific URL ? Is it possible to log that url in the log ? |
This comment has been minimized.
This comment has been minimized.
|
it happens here prometheus/vendor/github.com/go-kit/kit/log/stdlib.go Lines 89 to 91 in 18e6fa7 |
This comment has been minimized.
This comment has been minimized.
|
what I find strange is why the panic shows random ports panic serving 127.0.0.1:58206.... |
This comment has been minimized.
This comment has been minimized.
|
Is it possible to improve the logging when err != nil so that more information can be captured ? |
This comment has been minimized.
This comment has been minimized.
|
I did a bit of research and found go-kit/kit#233. Basically the go-kit log package mangles the log message and stack trace when something panics in an HTTP handler. I've hacked a change that catches the panic, logs the error + stack trace and then retriggers the error. @dhbarman can you build it and try it in your environment? It has at https://github.com/simonpasquier/prometheus/tree/log-stack-trace. If you prefer, I can build the binary. |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier You may build the binary. Last time I tried to build prometheus from source, I ran into many dependency issues. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Could see the panic with the instrumented image:
|
This comment has been minimized.
This comment has been minimized.
|
Interesting, this smells like a data race. I'm going to compile a prometheus binary with race detection enabled. It would be super cool if you could give it a try! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
looks like we are getting somewhere. the race looks like somewhere in the promql package. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev Do not have an easy way to replicate the issue. I deploy the binary on test servers and wait for the issue to get recreated. Sometimes it happens immediately, sometimes it takes more time. Let me know if you have a new binary from the latest master. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
Although one stack trace is incomplete, it seems that the race is between the query handler and the rule evaluator. |
This comment has been minimized.
This comment has been minimized.
|
hm I compiled it with |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
Thanks @dhbarman. Given the stack traces, I suspect that too many queries may be running at the same time. Can you post the following metrics?
Regarding your configuration, I notice that you partition the scrapes across several jobs. Is the reason that a single scrape takes too much time to complete? |
This comment has been minimized.
This comment has been minimized.
|
No matter how many queries are running at once, we shouldn't be having races. |
This comment has been minimized.
This comment has been minimized.
|
was just in the middle of writing this @simonpasquier let me know if you find out how to replicate and I will be interested to look into this as well. Races are interesting |
This comment has been minimized.
This comment has been minimized.
|
Ah, I know what this is. This was incorrect management of the matrixes that are reused across queries. I noticed this, and removed that code as part of #3966. This is fixed in 2.3.0. |
This comment has been minimized.
This comment has been minimized.
Sure, I was just wondering why it happens with @dhbarman's setup and hasn't been reported by other users. |
This comment has been minimized.
This comment has been minimized.
|
hm attached a new binary just in case for another test. |
This comment has been minimized.
This comment has been minimized.
|
@dhbarman not sure if it will help with the |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev It can't have, the stack trace mentions functions that were removed in my PR. |
This comment has been minimized.
This comment has been minimized.
|
Have seen the following errors so far:
|
This comment has been minimized.
This comment has been minimized.
|
The initial report here is fixed. Please open a new issue for anything else so that we can keep things organised. |
brian-brazil
closed this
Jun 12, 2018
brian-brazil
added
kind/bug
component/ui
component/promql
component/api
and removed
component/ui
labels
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil the PR to fix the original issue hasn't been merged yet. @dhbarman thanks for testing it again. Seems all races are fixed. |
This comment has been minimized.
This comment has been minimized.
|
Oh sorry, confused this with a different bug. |
brian-brazil
reopened this
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
|
Wait now actually I didn't. There are 3 distinct issues now in this bug, this is confusing. |
This comment has been minimized.
This comment has been minimized.
|
yeah we thought it is all part of the same bug, but it seems it is all good now. |
This comment has been minimized.
This comment has been minimized.
|
Okay, now both of those are fixed. @dhbarman Can you file the HELP log as a new bug please? |
brian-brazil
closed this
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
dhbarman commentedMay 21, 2018
•
edited
Bug Report
What did you do?
We see these messages in our prometheus logs.
What did you expect to see?
No error or warnings in the log messages.
What did you see instead? Under which circumstances?
Environment
System information:
Linux 4.1.12-124.14.3.el7uek.x86_64 x86_64
Prometheus version:
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c)
build user: root@149e5b3f0829
build date: 20180314-14:15:45
go version: go1.10
Prometheus configuration file:
Logs:
For the logging statement below:
level=debug ts=2018-04-26T05:08:32.100997431Z caller=file.go:289 component="discovery manager scrape" discovery=file msg="Stopping file discovery..." paths="unsupported value type"
The following stop() function gets called when file watcher needs to be stopped.
github.com/prometheus/prometheus/discovery/file/file.go
The logging function does not handle arrays/slices.
github.com/go-logfmt/logfmt/encode.go