Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors observed in logs #4177

Closed
dhbarman opened this Issue May 21, 2018 · 38 comments

Comments

Projects
None yet
4 participants
@dhbarman
Copy link

dhbarman commented May 21, 2018

Bug Report

What did you do?
We see these messages in our prometheus logs.

What did you expect to see?
No error or warnings in the log messages.

What did you see instead? Under which circumstances?

Environment

  • System information:

    Linux 4.1.12-124.14.3.el7uek.x86_64 x86_64

  • Prometheus version:

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c)
build user: root@149e5b3f0829
build date: 20180314-14:15:45
go version: go1.10

  • Prometheus configuration file:

  • Logs:

level=error ts=2018-04-26T05:28:00.428178689Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:40071" msg="reflect: slice index out of range"
level=error ts=2018-04-26T06:47:56.498513884Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:47619" msg="reflect: slice index out of range"

level=debug ts=2018-04-26T05:08:32.100997431Z caller=file.go:289 component="discovery manager scrape" discovery=file msg="Stopping file discovery..." paths="unsupported value type"
level=debug ts=2018-04-26T05:08:32.101002955Z caller=file.go:289 component="discovery manager scrape" discovery=file msg="Stopping file discovery..." paths="unsupported value type"


For the logging statement below:
level=debug ts=2018-04-26T05:08:32.100997431Z caller=file.go:289 component="discovery manager scrape" discovery=file msg="Stopping file discovery..." paths="unsupported value type"

The following stop() function gets called when file watcher needs to be stopped.
github.com/prometheus/prometheus/discovery/file/file.go

// stop shuts down the file watcher.
func (d *Discovery) stop() {
        level.Debug(d.logger).Log("msg", "Stopping file discovery...", "paths", d.paths)

The logging function does not handle arrays/slices.
github.com/go-logfmt/logfmt/encode.go

func writeValue(w io.Writer, value interface{}) error {
   :
default:
                rvalue := reflect.ValueOf(value)
                switch rvalue.Kind() {
                case reflect.Array, reflect.Chan, reflect.Func, reflect.Map, reflect.Slice, reflect.Struct:
                        return ErrUnsupportedValueType
}```

Am I looking at the correct code ? 
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 22, 2018

the unsupported value type was easy to replicate and @simonpasquier fixed in the referenced PR, but how did you cause the reflect: slice index out of range as I am unable to reproduce that one.

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented May 23, 2018

@krasi-georgiev I did not do anything unusual to cause the reflect: slice index out of range. I am still seeing the errors on different servers but prometheus is still running though:

PATH/prometheus/prometheus --config.file /config/prometheus.yml --web.listen-address=127.0.0.1:8180 --storage.tsdb.path=/data/ --storage.tsdb.retention=90d --storage.tsdb.no-lockfile --log.level=debug --web.external-url=/prometheus --web.console.libraries=PATH/prometheus/console_libraries --web.console.templates=PATH/prometheus/consoles --web.enable-admin-api --web.enable-lifecycle

level=error ts=2018-05-23T13:43:52.944992938Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:59468" msg="reflect: slice index out of range"
level=error ts=2018-05-23T14:43:55.105230013Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:41311" msg="reflect: slice index out of range"
level=error ts=2018-05-23T16:53:55.236837682Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:43166" msg="reflect: slice index out of range"
level=error ts=2018-05-23T17:33:53.527636553Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:58206" msg="reflect: slice index out of range"

@dhbarman dhbarman changed the title Error observed in logs Errors observed in logs May 23, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 24, 2018

@simonpasquier do you have any ideas? I looked briefly but can't see what is causing this second issue.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented May 24, 2018

It looks like a bug in a third-party package causing a panic which then caught by the HTTP handler. @dhbarman you're not able to relate the error message to a specific URL?

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented May 24, 2018

@simonpasquier Which third party package might be causing this panic ? I am not able to relate to a specific URL ? Is it possible to log that url in the log ?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 24, 2018

it happens here

if err := a.Logger.Log(keyvals...); err != nil {
return 0, err
}

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 24, 2018

what I find strange is why the panic shows random ports

panic serving 127.0.0.1:58206....

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented May 25, 2018

Is it possible to improve the logging when err != nil so that more information can be captured ?
Tried to find out which queries could be causing such errors, although such errors are not seeing all the time. One such example query is:
http://:8080/prometheus/api/v1/query?query=ifInErrors[300s]

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented May 25, 2018

I did a bit of research and found go-kit/kit#233. Basically the go-kit log package mangles the log message and stack trace when something panics in an HTTP handler.

I've hacked a change that catches the panic, logs the error + stack trace and then retriggers the error. @dhbarman can you build it and try it in your environment? It has at https://github.com/simonpasquier/prometheus/tree/log-stack-trace. If you prefer, I can build the binary.

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented May 30, 2018

@simonpasquier You may build the binary. Last time I tried to build prometheus from source, I ran into many dependency issues.

@simonpasquier

This comment has been minimized.

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 5, 2018

Could see the panic with the instrumented image:

level=info ts=2018-06-02T13:00:19.341638989Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=4.723177095s
level=error ts=2018-06-02T14:52:04.887413762Z caller=web.go:79 component=web msg="panic while serving request" client=127.0.0.1:59069 url="/prometheus/api/v1/query?query=ifInLinkUtilizationPercentage%5B300s%5D" err="reflect: slice inde
x out of range" stack="goroutine 1358725 [running]:\ngithub.com/prometheus/prometheus/web.logStackTrace.func1.1(0x1c36f40, 0xc4203f2240, 0xc4a16edc00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:78 +0xbc\np
anic(0x1744aa0, 0xc46e0e24b0)\n\t/home/simon/go/src/runtime/panic.go:502 +0x229\nencoding/json.(*encodeState).marshal.func1(0xc4ac969680)\n\t/home/simon/go/src/encoding/json/encode.go:293 +0x124\npanic(0x1744aa0, 0x1c348b0)\n\t/home/si
mon/go/src/runtime/panic.go:502 +0x229\nreflect.Value.Index(0x1717720, 0xc4bb135c38, 0x197, 0x3, 0x1920100, 0xc4afa00730, 0x199)\n\t/home/simon/go/src/reflect/value.go:894 +0x201\nencoding/json.(*arrayEncoder).encode(0xc44f810250, 0xc4
626ee0b0, 0x1717720, 0xc4bb135c38, 0x197, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:761 +0x90\nencoding/json.(*arrayEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x1717720, 0xc4bb135c38, 0x197, 0xc4626e0100)\n\t/home/simo
n/go/src/encoding/json/encode.go:768 +0x64\nencoding/json.(*sliceEncoder).encode(0xc44f810260, 0xc4626ee0b0, 0x1717720, 0xc4bb135c38, 0x197, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:735 +0xc1\nencoding/json.(*sliceEncoder).
(encoding/json.encode)-fm(0xc4626ee0b0, 0x1717720, 0xc4bb135c38, 0x197, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:747 +0x64\nencoding/json.(*structEncoder).encode(0xc456dbf7a0, 0xc4626ee0b0, 0x19007e0, 0xc4bb135c20, 0x199, 0
x100)\n\t/home/simon/go/src/encoding/json/encode.go:639 +0x255\nencoding/json.(*structEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x19007e0, 0xc4bb135c20, 0x199, 0x1900100)\n\t/home/simon/go/src/encoding/json/encode.go:653 +0x64\n
encoding/json.(*arrayEncoder).encode(0xc44f810270, 0xc4626ee0b0, 0x18f4c80, 0xc4ac92c160, 0x97, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:761 +0xdc\nencoding/json.(*arrayEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x18f
4c80, 0xc4ac92c160, 0x97, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:768 +0x64\nencoding/json.(*sliceEncoder).encode(0xc44f810280, 0xc4626ee0b0, 0x18f4c80, 0xc4ac92c160, 0x97, 0x18f0100)\n\t/home/simon/go/src/encoding/json/en
code.go:735 +0xc1\nencoding/json.(*sliceEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x18f4c80, 0xc4ac92c160, 0x97, 0xc4ac920100)\n\t/home/simon/go/src/encoding/json/encode.go:747 +0x64\nencoding/json.(*encodeState).reflectValue(0x
c4626ee0b0, 0x18f4c80, 0xc4ac92c160, 0x97, 0xc4ac920100)\n\t/home/simon/go/src/encoding/json/encode.go:325 +0x82\nencoding/json.interfaceEncoder(0xc4626ee0b0, 0x187b3c0, 0xc4878777e0, 0x194, 0x100)\n\t/home/simon/go/src/encoding/json/e
ncode.go:611 +0xdb\nencoding/json.(*structEncoder).encode(0xc49db48750, 0xc4626ee0b0, 0x19240c0, 0xc4878777d0, 0x199, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:639 +0x255\nencoding/json.(*structEncoder).(encoding/json.encode
)-fm(0xc4626ee0b0, 0x19240c0, 0xc4878777d0, 0x199, 0xc487870100)\n\t/home/simon/go/src/encoding/json/encode.go:653 +0x64\nencoding/json.(*ptrEncoder).encode(0xc44f998b58, 0xc4626ee0b0, 0x16eebc0, 0xc4878777d0, 0x16, 0x16e0100)\n\t/home
/simon/go/src/encoding/json/encode.go:780 +0xe3\nencoding/json.(*ptrEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x16eebc0, 0xc4878777d0, 0x16, 0xc487870100)\n\t/home/simon/go/src/encoding/json/encode.go:785 +0x64\nencoding/json.(*
encodeState).reflectValue(0xc4626ee0b0, 0x16eebc0, 0xc4878777d0, 0x16, 0xc487870100)\n\t/home/simon/go/src/encoding/json/encode.go:325 +0x82\nencoding/json.interfaceEncoder(0xc4626ee0b0, 0x17f2d20, 0xc4ac92a890, 0x194, 0x100)\n\t/home/
simon/go/src/encoding/json/encode.go:611 +0xdb\nencoding/json.(*structEncoder).encode(0xc49db48690, 0xc4626ee0b0, 0x196ad60, 0xc4ac92a880, 0x199, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:639 +0x255\nencoding/json.(*structEn
coder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x196ad60, 0xc4ac92a880, 0x199, 0xc4ac920100)\n\t/home/simon/go/src/encoding/json/encode.go:653 +0x64\nencoding/json.(*ptrEncoder).encode(0xc44f998a28, 0xc4626ee0b0, 0x16eec00, 0xc4ac92a88
0, 0x16, 0x16e0100)\n\t/home/simon/go/src/encoding/json/encode.go:780 +0xe3\nencoding/json.(*ptrEncoder).(encoding/json.encode)-fm(0xc4626ee0b0, 0x16eec00, 0xc4ac92a880, 0x16, 0xc4ac920100)\n\t/home/simon/go/src/encoding/json/encode.go
:785 +0x64\nencoding/json.(*encodeState).reflectValue(0xc4626ee0b0, 0x16eec00, 0xc4ac92a880, 0x16, 0x100)\n\t/home/simon/go/src/encoding/json/encode.go:325 +0x82\nencoding/json.(*encodeState).marshal(0xc4626ee0b0, 0x16eec00, 0xc4ac92a8
80, 0x100, 0x0, 0x0)\n\t/home/simon/go/src/encoding/json/encode.go:298 +0xa5\nencoding/json.Marshal(0x16eec00, 0xc4ac92a880, 0xc, 0x1aea926, 0x10, 0xc4878777d0, 0x0)\n\t/home/simon/go/src/encoding/json/encode.go:161 +0x5f\ngithub.com/p
rometheus/prometheus/web/api/v1.respond(0x1c45b00, 0xc467b58c00, 0x16eebc0, 0xc4878777d0)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/web/api/v1/api.go:758 +0xf4\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Reg
ister.func1.1(0x1c45b00, 0xc467b58c00, 0xc4a16edf00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/web/api/v1/api.go:162 +0x90\nnet/http.HandlerFunc.ServeHTTP(0xc4206a2440, 0x1c45b00, 0xc467b58c00, 0xc4a16edf00)\n\t/hom
e/simon/go/src/net/http/server.go:1947 +0x44\ngithub.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP(0x1c3b640, 0xc4206a2440, 0x1c47340, 0xc497ebc960, 0xc4a16edf00)\n\t/home/simon/workspace/src/github.com/prometheu
s/prometheus/util/httputil/compression.go:90 +0x7c\nnet/http.(Handler).ServeHTTP-fm(0x1c47340, 0xc497ebc960, 0xc4a16edf00)\n\t/home/simon/go/src/net/http/h2_bundle.go:5475 +0x4d\ngithub.com/prometheus/prometheus/vendor/github.com/prome
theus/client_golang/prometheus.InstrumentHandlerFuncWithOpts.func1(0x1c47280, 0xc467b58bc0, 0xc4a16edf00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/http.go:309 +
0x229\ngithub.com/prometheus/prometheus/web.(*Handler).testReady.func1(0x1c47280, 0xc467b58bc0, 0xc4a16edf00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:373 +0x55\ngithub.com/prometheus/prometheus/vendor/g
ithub.com/prometheus/common/route.(*Router).handle.func1(0x1c47280, 0xc467b58bc0, 0xc4a16ede00, 0x0, 0x0, 0x0)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route/route.go:50 +0x26a\n
github.com/prometheus/prometheus/vendor/github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc4200daf80, 0x1c47280, 0xc467b58bc0, 0xc4a16ede00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/vendor/github.com/julien
schmidt/httprouter/router.go:299 +0x6d1\ngithub.com/prometheus/prometheus/vendor/github.com/prometheus/common/route.(*Router).ServeHTTP(0xc4202e5360, 0x1c47280, 0xc467b58bc0, 0xc4a16ede00)\n\t/home/simon/workspace/src/github.com/promet
heus/prometheus/vendor/github.com/prometheus/common/route/route.go:88 +0x4c\nnet/http.StripPrefix.func1(0x1c47280, 0xc467b58bc0, 0xc4a16edd00)\n\t/home/simon/go/src/net/http/server.go:1986 +0x19a\nnet/http.HandlerFunc.ServeHTTP(0xc4203
411d0, 0x1c47280, 0xc467b58bc0, 0xc4a16edd00)\n\t/home/simon/go/src/net/http/server.go:1947 +0x44\nnet/http.(*ServeMux).ServeHTTP(0xc42021cc60, 0x1c47280, 0xc467b58bc0, 0xc4a16edd00)\n\t/home/simon/go/src/net/http/server.go:2337 +0x130
\ngithub.com/prometheus/prometheus/vendor/github.com/opentracing-contrib/go-stdlib/nethttp.Middleware.func2(0x1c4ebc0, 0xc4760da2a0, 0xc4a16edc00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/vendor/github.com/opentrac
ing-contrib/go-stdlib/nethttp/server.go:74 +0x3f3\nnet/http.HandlerFunc.ServeHTTP(0xc420341290, 0x1c4ebc0, 0xc4760da2a0, 0xc4a16edc00)\n\t/home/simon/go/src/net/http/server.go:1947 +0x44\ngithub.com/prometheus/prometheus/web.logStackTr
ace.func1(0x1c4ebc0, 0xc4760da2a0, 0xc4a16edc00)\n\t/home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:83 +0x9d\nnet/http.HandlerFunc.ServeHTTP(0xc4203412c0, 0x1c4ebc0, 0xc4760da2a0, 0xc4a16edc00)\n\t/home/simon/go/s
rc/net/http/server.go:1947 +0x44\nnet/http.serverHandler.ServeHTTP(0xc4200b8ea0, 0x1c4ebc0, 0xc4760da2a0, 0xc4a16edc00)\n\t/home/simon/go/src/net/http/server.go:2694 +0xbc\nnet/http.(*conn).serve(0xc459870aa0, 0x1c50a80, 0xc46a902880)\
n\t/home/simon/go/src/net/http/server.go:1830 +0x651\ncreated by net/http.(*Server).Serve\n\t/home/simon/go/src/net/http/server.go:2795 +0x27b\n"
level=error ts=2018-06-02T14:52:04.888371944Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:59069" msg="reflect: slice index out of range"```
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 5, 2018

Interesting, this smells like a data race. I'm going to compile a prometheus binary with race detection enabled. It would be super cool if you could give it a try!

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 5, 2018

@dhbarman can you try again with this new binary? Thanks for the testing!

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 5, 2018

==================
WARNING: DATA RACE
Write at 0x00c462b4e000 by goroutine 270:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:892 +0x9bf
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:738 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:659 +0x5f
  github.com/prometheus/prometheus/promql.extrapolatedRate()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:54 +0x7d
  github.com/prometheus/prometheus/promql.funcIncrease()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:141 +0x71
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:735 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:706 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:689 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:370 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:327 +0x4f1
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:130 +0xdd
  github.com/prometheus/prometheus/rules.EngineQueryFunc.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:115 +0xf0
  github.com/prometheus/prometheus/rules.(*AlertingRule).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/alerting.go:177 +0x127
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:334 +0x1b5
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:402 +0xcf
  github.com/prometheus/prometheus/rules.(*Group).run.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:204 +0xce
  github.com/prometheus/prometheus/rules.(*Group).run()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:230 +0x30d
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:505 +0xa6

Previous read at 0x00c462b4e000 by goroutine 324:
  [failed to restore the stack]

Goroutine 270 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:500 +0x63

Goroutine 324 (running) created at:
  net/http.(*Server).Serve()
      /home/simon/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:471 +0x53
==================
==================
WARNING: DATA RACE
Write at 0x00c488886000 by goroutine 270:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:892 +0x9bf
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:738 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:659 +0x5f
  github.com/prometheus/prometheus/promql.extrapolatedRate()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:54 +0x7d
  github.com/prometheus/prometheus/promql.funcRate()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:136 +0x71
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:735 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:706 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:706 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:689 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:370 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:327 +0x4f1
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:130 +0xdd
  github.com/prometheus/prometheus/rules.EngineQueryFunc.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:115 +0xf0
  github.com/prometheus/prometheus/rules.(*AlertingRule).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/alerting.go:177 +0x127
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:334 +0x1b5
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:402 +0xcf
  github.com/prometheus/prometheus/rules.(*Group).run.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:204 +0xce
  github.com/prometheus/prometheus/rules.(*Group).run()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:230 +0x30d
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:505 +0xa6

Previous read at 0x00c488886000 by goroutine 290:
  [failed to restore the stack]

Goroutine 270 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:500 +0x63

Goroutine 290 (running) created at:
  net/http.(*Server).Serve()
      /home/simon/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:471 +0x53
==================

==================
==================
WARNING: DATA RACE
Write at 0x00c49f226000 by goroutine 273:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:892 +0x9bf
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:738 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:659 +0x5f
  github.com/prometheus/prometheus/promql.aggrOverTime()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:368 +0x8e
  github.com/prometheus/prometheus/promql.funcMaxOverTime()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:416 +0x66
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:735 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:706 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:707 +0x9f5
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:689 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:370 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:327 +0x4f1
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:130 +0xdd
  github.com/prometheus/prometheus/rules.EngineQueryFunc.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:115 +0xf0
  github.com/prometheus/prometheus/rules.(*RecordingRule).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/recording.go:57 +0xf4
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:334 +0x1b5
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:402 +0xcf
  github.com/prometheus/prometheus/rules.(*Group).run.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:204 +0xce
  github.com/prometheus/prometheus/rules.(*Group).run()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:230 +0x30d
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:505 +0xa6

Previous read at 0x00c49f226000 by goroutine 475:
  [failed to restore the stack]

Goroutine 273 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:500 +0x63


Goroutine 475 (finished) created at:
  net/http.(*Server).Serve()
      /home/simon/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:471 +0x53
==================
==================
WARNING: DATA RACE
Write at 0x00c49cfbbfe0 by goroutine 273:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:878 +0x5f3
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:738 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:659 +0x5f
  github.com/prometheus/prometheus/promql.aggrOverTime()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:368 +0x8e
  github.com/prometheus/prometheus/promql.funcMaxOverTime()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/functions.go:416 +0x66
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:735 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:706 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:679 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:707 +0x9f5
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:689 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:370 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:327 +0x4f1
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/simon/workspace/src/github.com/prometheus/prometheus/promql/engine.go:130 +0xdd
  github.com/prometheus/prometheus/rules.EngineQueryFunc.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:115 +0xf0
  github.com/prometheus/prometheus/rules.(*RecordingRule).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/recording.go:57 +0xf4
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:334 +0x1b5
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:402 +0xcf
  github.com/prometheus/prometheus/rules.(*Group).run.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:204 +0xce
  github.com/prometheus/prometheus/rules.(*Group).run()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:230 +0x30d
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:505 +0xa6

Previous read at 0x00c49cfbbfe0 by goroutine 475:
  [failed to restore the stack]


Goroutine 273 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /home/simon/workspace/src/github.com/prometheus/prometheus/rules/manager.go:500 +0x63

Goroutine 475 (finished) created at:
  net/http.(*Server).Serve()
      /home/simon/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/simon/workspace/src/github.com/prometheus/prometheus/web/web.go:471 +0x53
==================
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 5, 2018

looks like we are getting somewhere.
@dhbarman did you manage to come up with an easy way to replicate this so we don't have to ask you for a test on every change.

the race looks like somewhere in the promql package.
The promql had a big refactoring so maybe worth testing latest master so see if it will replicate again.

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 5, 2018

@krasi-georgiev Do not have an easy way to replicate the issue. I deploy the binary on test servers and wait for the issue to get recreated. Sometimes it happens immediately, sometimes it takes more time. Let me know if you have a new binary from the latest master.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 5, 2018

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 6, 2018

WARNING: DATA RACE
Write at 0x00c478d3a000 by goroutine 214:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:902 +0x9bf
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:748 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:669 +0x5f
  github.com/prometheus/prometheus/promql.extrapolatedRate()
      /home/krasi/src/github.com/prometheus/prometheus/promql/functions.go:54 +0x7d
  github.com/prometheus/prometheus/promql.funcRate()
      /home/krasi/src/github.com/prometheus/prometheus/promql/functions.go:136 +0x71
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:745 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:689 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:716 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).evalOneOf()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:689 +0x6a
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:716 +0x971
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:699 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:379 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:336 +0x5a7
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:131 +0xb9
  github.com/prometheus/prometheus/rules.EngineQueryFunc.func1()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:116 +0xf0
  github.com/prometheus/prometheus/rules.(*AlertingRule).Eval()
      /home/krasi/src/github.com/prometheus/prometheus/rules/alerting.go:177 +0x127
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:338 +0x2eb
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:406 +0xcf
  github.com/prometheus/prometheus/rules.(*Group).run.func1()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:206 +0x104
  github.com/prometheus/prometheus/rules.(*Group).run()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:235 +0x413
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:509 +0xa6

Previous read at 0x00c478d3a000 by goroutine 75:
  [failed to restore the stack]

Goroutine 214 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /home/krasi/src/github.com/prometheus/prometheus/rules/manager.go:504 +0x63

Goroutine 75 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:502 +0x4c
==================

==================
WARNING: DATA RACE
Write at 0x00c4708e6000 by goroutine 304:
  github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:902 +0x9bf
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:748 +0x7ff
  github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:669 +0x5f
  github.com/prometheus/prometheus/promql.extrapolatedRate()
      /home/krasi/src/github.com/prometheus/prometheus/promql/functions.go:54 +0x7d
  github.com/prometheus/prometheus/promql.funcDelta()
      /home/krasi/src/github.com/prometheus/prometheus/promql/functions.go:131 +0x71
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:745 +0x19b7
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:699 +0xb7
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:379 +0x1d39
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:336 +0x5a7
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/krasi/src/github.com/prometheus/prometheus/promql/engine.go:131 +0xb9
  github.com/prometheus/prometheus/web/api/v1.(*API).query()
      /home/krasi/src/github.com/prometheus/prometheus/web/api/v1/api.go:236 +0x387
  github.com/prometheus/prometheus/web/api/v1.(*API).(github.com/prometheus/prometheus/web/api/v1.query)-fm()
      /home/krasi/src/github.com/prometheus/prometheus/web/api/v1/api.go:174 +0x4f
  github.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1()
      /home/krasi/src/github.com/prometheus/prometheus/web/api/v1/api.go:159 +0x6c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP()
      /home/krasi/src/github.com/prometheus/prometheus/util/httputil/compression.go:90 +0x89
  github.com/prometheus/prometheus/util/httputil.(CompressionHandler).ServeHTTP-fm()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:273 +0x8a
  github.com/prometheus/prometheus/web.(*Handler).testReady.func1()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:401 +0x81
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1()
      /home/krasi/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:196 +0x112
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2()
      /home/krasi/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:76 +0xcf
  github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route.(*Router).handle.func1()
      /home/krasi/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route/route.go:60 +0x293
  github.com/prometheus/prometheus/vendor/github.com/julienschmidt/httprouter.(*Router).ServeHTTP()
      /home/krasi/src/github.com/prometheus/prometheus/vendor/github.com/julienschmidt/httprouter/router.go:299 +0xbad
  github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route.(*Router).ServeHTTP()
      /home/krasi/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route/route.go:98 +0x67
  net/http.StripPrefix.func1()
      /usr/local/go/src/net/http/server.go:1986 +0x31c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2337 +0x9f
  github.com/prometheus/prometheus/web.secureHeadersMiddleware.func1()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:81 +0x1ac
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  github.com/prometheus/prometheus/vendor/github.com/opentracing-contrib/go-stdlib/nethttp.Middleware.func2()

net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2694 +0xb9
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1830 +0x7dc
:
Previous read at 0x00c4708e6000 by goroutine 45:
  [failed to restore the stack]
:
Goroutine 304 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:502 +0x4c
:
Goroutine 45 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:2795 +0x364
  github.com/prometheus/prometheus/web.(*Handler).Run.func5()
      /home/krasi/src/github.com/prometheus/prometheus/web/web.go:502 +0x4c
==================
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 6, 2018

Although one stack trace is incomplete, it seems that the race is between the query handler and the rule evaluator.
@dhbarman can you share your rules and Prometheus configurations?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 6, 2018

hm I compiled it with GORACE=history_size=7 go build --race
not sure if this works only for test and run or this affects the size const size = 64 << 10

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 7, 2018

global:
  scrape_interval: 60s  # How often to scrape each target
  scrape_timeout: 30s   # When to timeout if the target does not respond quick enough
  evaluation_interval: 1m  #How often to evaluate alarm and recording rules

rule_files:
  # List of locations to look for  alarm rules and recording rules
  - "/rules/specific-region-name/prometheus/alarms/*.yml"
  - "/rules/specific-region-name/prometheus/recordings/*.yml"

scrape_configs:
  # Job for scraping all prometheus servers, this allows us to cross
  # monitor/alarm on Prometheus health
  - job_name: prometheus
    scrape_interval: 30s
    metrics_path: /prometheus/metrics
    scheme: https
    tls_config:
      ca_file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
      cert_file: /certs/cert_chain.pem
      key_file: /certs/key.pem
    file_sd_configs:
      - files:
        - /config/specific-region-name/prometheus-targets.yml


# Job for scraping all AlertManager servers, this allows us to cross
  # monitor/alarm on Prometheus health
  - job_name: alertmanager
    scrape_interval: 30s
    metrics_path: /alertmanager/metrics
    scheme: https
    tls_config:
      ca_file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
      cert_file: /certs/cert_chain.pem
      key_file: /certs/key.pem
    file_sd_configs:
      - files:
        - /config/specific-region-name/prometheus-targets.yml

# Job for scraping alertmanager webhook metrics
  - job_name: alertmanager_webhook
    scrape_interval: 30s
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ["localhost:8091"]
    relabel_configs:
      # Relabel the instance name to be the same no matter the source of the
      # metric
      - target_label: instance
        replacement: local_alertmanager_webhook
        action: replace

 # Job for scraping server1
  - job_name: server1
    scrape_interval: 1200s
    metrics_path: /
    scheme: http
    static_configs:
      - targets: ["server_name:8095"]
    relabel_configs:
      # Relabel the instance name to be the same no matter the source of the
      # metric to account for redundancy model of the poller.
      # Also relabel the job name to maintain consistency after the change to
      # partition metrics
      - target_label: instance
        replacement: server1
        action: replace

# Job for scraping alertmanager webhook metrics
  - job_name: alertmanager_webhook
    scrape_interval: 30s
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ["localhost:8091"]
    relabel_configs:
      # Relabel the instance name to be the same no matter the source of the
      # metric
      - target_label: instance
        replacement: local_alertmanager_webhook
        action: replace

# Jobs to scrape a specific partition from the application pollers
  # Poller hashes device name and puts the metric on a specific partition.  We
  # use 20 total_partitions for the hash ring and have a seperate job for each
  # partition
  - job_name: application_poller_partition0
    metrics_path: /metrics
    params: # url parameters: partition and total partition
        partition: [0]
        total_partitions: [20]
    scheme: http
    file_sd_configs:
      - files:
        - /config/specific-region-name/poller-targets.yml
    relabel_configs:
       # Relabel the instance name to be the same no matter the source of the
       # metric to account for redundancy model of the poller.
       # Also relabel the job name to maintain consistency after the change to
       # partition metrics
        - target_label: instance
          replacement: application_poller
          action: replace
        - target_label: job
          replacement: application_poller
          action: replace


- job_name: application_poller_partition1
    metrics_path: /metrics
    params:
        partition: [1]
        total_partitions: [20]
    scheme: http
    file_sd_configs:
      - files:
        - /config/specific-region-name/poller-targets.yml
    relabel_configs:
        - target_label: instance
          replacement: application_poller
          action: replace
        - target_label: job
          replacement: application_poller
          action: replace

- job_name: application_poller_partition2
    metrics_path: /metrics
    params:
        partition: [2]
        total_partitions: [20]
    scheme: http
    file_sd_configs:
      - files:
        - /config/specific-region-name/poller-targets.yml
    relabel_configs:
        - target_label: instance
          replacement: application_poller
          action: replace
        - target_label: job
          replacement: application_poller
          action: replace
:
:
:
- job_name: application_poller_partition18
    metrics_path: /metrics
    params:
        partition: [18]
        total_partitions: [20]
    scheme: http
    file_sd_configs:
      - files:
        - /config/specific-region-name/poller-targets.yml
    relabel_configs:
        - target_label: instance
          replacement: application_poller
          action: replace
        - target_label: job
          replacement: application_poller
          action: replace

  - job_name: application_poller_partition19
    metrics_path: /metrics
    params:
        partition: [19]
        total_partitions: [20]
    scheme: http
    file_sd_configs:
      - files:
        - /config/specific-region-name/poller-targets.yml
    relabel_configs:
        - target_label: instance
          replacement: application_poller
          action: replace
        - target_label: job
          replacement: application_poller
          action: replace

alerting:
 alertmanagers:
   - scheme: https
     timeout: 30s
     path_prefix: /alertmanager/
     tls_config:
       ca_file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
       cert_file: /certs/cert_chain.pem
       key_file: /certs/key.pem
     file_sd_configs:
       - files:
         - /config/specific-region-name/prometheus-targets.yml
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 8, 2018

Thanks @dhbarman. Given the stack traces, I suspect that too many queries may be running at the same time. Can you post the following metrics?

  • prometheus_engine_queries
  • prometheus_engine_queries_concurrent_max
  • sum(rate(prometheus_engine_query_duration_seconds_sum[5m]))
  • sum(rate(prometheus_engine_query_duration_seconds_count[5m]))

Regarding your configuration, I notice that you partition the scrapes across several jobs. Is the reason that a single scrape takes too much time to complete?
As you rewrite the job and instance labels to the be the same for all scrape configurations, are you sure that you have other labels differentiating the time series eventually?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 8, 2018

No matter how many queries are running at once, we shouldn't be having races.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 8, 2018

was just in the middle of writing this 😜

@simonpasquier let me know if you find out how to replicate and I will be interested to look into this as well. Races are interesting 🤔

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 8, 2018

Ah, I know what this is. This was incorrect management of the matrixes that are reused across queries. I noticed this, and removed that code as part of #3966. This is fixed in 2.3.0.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 8, 2018

No matter how many queries are running at once, we shouldn't be having races.

Sure, I was just wondering why it happens with @dhbarman's setup and hasn't been reported by other users.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 8, 2018

hm 🤔 , the last binary I attached should have already included the promql refactoring PR.

attached a new binary just in case for another test.

prometheus.zip

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 8, 2018

@dhbarman not sure if it will help with the failed to restore the stack error , but can you try running it with GORACE=history_size=7 ./prometheus

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 8, 2018

@krasi-georgiev It can't have, the stack trace mentions functions that were removed in my PR.

@dhbarman

This comment has been minimized.

Copy link
Author

dhbarman commented Jun 12, 2018

Have seen the following errors so far:

level=info ts=2018-06-12T05:31:52.274877181Z caller=main.go:603 msg="Loading configuration file" filename=/config/<region-host>/prometheus.yml
level=info ts=2018-06-12T05:31:52.588785058Z caller=main.go:500 msg="Server is ready to receive web requests."
level=warn ts=2018-06-12T05:36:02.003899701Z caller=scrape.go:802 component="scrape manager" scrape_pool=<app> target=http://<hostname>:8095/ msg="append failed" err="expected text in HELP, got \"INVALID\""
level=warn ts=2018-06-12T05:56:02.004419649Z caller=scrape.go:802 component="scrape manager" scrape_pool=<app> target=http://<hostname>:8095/ msg="append failed" err="expected text in HELP, got \"INVALID\""
level=warn ts=2018-06-12T06:16:02.202608886Z caller=scrape.go:802 component="scrape manager" scrape_pool=<app> target=http://<hostname>:8095/ msg="append failed" err="expected text in HELP, got \"INVALID\""
level=warn ts=2018-06-12T06:36:01.987557506Z caller=scrape.go:802 component="scrape manager" scrape_pool=<app> target=http://<hostname>:8095/ msg="append failed" err="expected text in HELP, got \"INVALID\""
level=warn ts=2018-06-12T06:56:02.003599422Z caller=scrape.go:802 component="scrape manager" scrape_pool=<app> target=http://<hostname>:8095/ msg="append failed" err="expected text in HELP, got \"INVALID\""
level=info ts=2018-06-12T07:01:02.228744878Z caller=compact.go:387 component=tsdb msg="write block" mint=1528776000000 maxt=1528783200000 ulid=01CFSCPP6WEGYBX3RWGGWRYPV7
level=info ts=2018-06-12T07:01:07.989694212Z caller=head.go:348 component=tsdb msg="head GC completed" duration=4.441075873s
level=info ts=2018-06-12T07:01:22.579550843Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=14.589585177s
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

The initial report here is fixed. Please open a new issue for anything else so that we can keep things organised.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 12, 2018

@brian-brazil the PR to fix the original issue hasn't been merged yet.
#4178

@dhbarman thanks for testing it again. Seems all races are fixed.
Regarding the last error it seems that the target you are scraping doesn't output the correct format. If you are convinced that this is another bug please open a new issue with details how to replicate. Thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Oh sorry, confused this with a different bug.

@brian-brazil brian-brazil reopened this Jun 12, 2018

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Wait now actually I didn't. There are 3 distinct issues now in this bug, this is confusing.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 12, 2018

yeah we thought it is all part of the same bug, but it seems it is all good now.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Okay, now both of those are fixed.

@dhbarman Can you file the HELP log as a new bug please?

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.