Investigate spurious target-manager exit after startup #374

juliusv · 2014-01-23T17:06:19Z

Prometheus was restarted and stopped scraping soon after startup. The relevent log section said:

2014-01-23_14:46:24.30597 I0123 14:46:24.305391 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32035 I0123 14:46:24.317359 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32093 I0123 14:46:24.317389 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32147 I0123 14:46:24.317416 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32197 I0123 14:46:24.317436 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32205 I0123 14:46:24.317463 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32268 I0123 14:46:24.317496 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32499 I0123 14:46:24.319706 15094 web.go:80] listening on :9090
2014-01-23_14:49:49.43797 I0123 14:49:49.437400 15094 query.go:133] Range query: haproxy_current_sessions{service="XXXXX"}
2014-01-23_14:49:49.68786 Query stats:
2014-01-23_14:49:49.68788 Total eval time: 2.795053ms
2014-01-23_14:49:49.68789 Total view building time: 2.401951ms
2014-01-23_14:49:49.68789 Query analysis time: 2.323989ms
2014-01-23_14:49:49.68790 View request building time: 2.224us
2014-01-23_14:49:49.68791 Inner view building time: 66.164us
2014-01-23_14:49:49.68792 View queue wait time: 23.022us
2014-01-23_14:49:49.68793 View building job scanning time: 2.456us
2014-01-23_14:49:49.68794 Total view data extraction time: 959ns
2014-01-23_14:49:49.68795 Inner eval time: 381.827us
2014-01-23_14:49:49.68796 GetValueAtTime() time: 173.564us
2014-01-23_14:49:49.68796 Result append time: 840ns
2014-01-23_14:49:49.68797 Result sorting time: 2.158us
2014-01-23_14:49:49.68798 JSON encoding time: 61.154152ms
2014-01-23_14:49:49.68799
2014-01-23_14:49:55.09421 I0123 14:49:55.094138 15094 targetmanager.go:120] Target manager exiting...
2014-01-23_14:49:55.10095 I0123 14:49:55.100903 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73047 I0123 14:49:55.730072 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73205 I0123 14:49:55.731796 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73238 I0123 14:49:55.732089 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73269 I0123 14:49:55.732416 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:56.44305 I0123 14:49:56.442549 15094 query.go:133] Range query: haproxy_current_sessions{service="XXXXXXXX"}
2014-01-23_14:49:56.44460 Query stats:
2014-01-23_14:49:56.44461 Total eval time: 13.33909ms
2014-01-23_14:49:56.44462 Total view building time: 12.900073ms
2014-01-23_14:49:56.44463 Query analysis time: 12.849138ms
2014-01-23_14:49:56.44463 View request building time: 1.795us
2014-01-23_14:49:56.44464 Inner view building time: 41.516us
2014-01-23_14:49:56.44465 View queue wait time: 9.074us
2014-01-23_14:49:56.44465 View building job scanning time: 1.818us
2014-01-23_14:49:56.44466 Total view data extraction time: 742ns
2014-01-23_14:49:56.44467 Inner eval time: 427.893us
2014-01-23_14:49:56.44467 GetValueAtTime() time: 187.04us
2014-01-23_14:49:56.44468 Result append time: 837ns
2014-01-23_14:49:56.44516 Result sorting time: 1.382us
2014-01-23_14:49:56.44518 JSON encoding time: 16.822us

After that, the web interface worked, but since scrapers (target pools) were down, returned no fresh data for any query.

Possibly (and maybe due to the long shutdown time before) there was an issue where Prometheus got another SIGTERM/SIGINT from its supervisor. But why did it not shut down completely then? The same happened for a group of Prometheus instances at the same time, when they each got a new target configured. After another restart, everything is working.

The text was updated successfully, but these errors were encountered:

juliusv · 2014-01-24T13:38:45Z

Found out that this was due to a panic happening, which triggered the deferred p.close() in main. The close got stuck somehow, so we never got to see the panic:

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/prometheus/client_golang/model.(*Fingerprint).LoadFromMetric(0xc20afc21e0, 0xc210b51940)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/client_golang/model/fingerprinting.go:135 +0x5bb
github.com/prometheus/prometheus/storage/metric.(*memorySeriesStorage).AppendSample(0xc20028b680, 0xc210b53320, 0x0, 0x0)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/memory.go:216 +0x9d
github.com/prometheus/prometheus/storage/metric.(*memorySeriesStorage).AppendSamples(0xc20028b680, 0xc217edb940, 0x2, 0x2, 0x7f32e574cc18, ...)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/memory.go:205 +0x5f
github.com/prometheus/prometheus/storage/metric.(*TieredStorage).AppendSamples(0xc200284120, 0xc217edb940, 0x2, 0x2, 0x0, ...)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/tiered.go:171 +0xd4
main.main()
    /home/julius/prom/prometheus/main.go:391 +0x1112

juliusv · 2014-01-24T14:49:25Z

This was triggered by two bugs:

The fingerprinting code (LoadFromMetric) crashes for a metric with an empty label value.
When doing sum(foo) by (non_existent_label), non_existent_label will be propagated into the result with an empty label value if the label doesn't exist in the timeseries being aggregated. Not sure if this is a bug or desired behavior, but in any case, this currently triggers bug 1.

This fixes bug 2. of #374 Change-Id: Ia4a13153616bafce5bf10597966b071434422d09

Add cachebusting fingerprints to asset files.

lock · 2019-03-25T00:17:14Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ghost assigned juliusv Jan 23, 2014

juliusv added a commit that referenced this issue Jan 24, 2014

Nonexistent labels in BY-clauses shouldn't propagate to result.

0378c2c

This fixes bug 2. of #374 Change-Id: Ia4a13153616bafce5bf10597966b071434422d09

juliusv mentioned this issue Mar 9, 2014

Support metrics with empty labels? prometheus/client_golang#33

Closed

juliusv closed this as completed Mar 9, 2014

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

Merge pull request prometheus#374 from prometheus/cachebusting

484367d

Add cachebusting fingerprints to asset files.

lock bot locked and limited conversation to collaborators Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate spurious target-manager exit after startup #374

Investigate spurious target-manager exit after startup #374

juliusv commented Jan 23, 2014

juliusv commented Jan 24, 2014

juliusv commented Jan 24, 2014

lock bot commented Mar 25, 2019

Investigate spurious target-manager exit after startup #374

Investigate spurious target-manager exit after startup #374

Comments

juliusv commented Jan 23, 2014

juliusv commented Jan 24, 2014

juliusv commented Jan 24, 2014

lock bot commented Mar 25, 2019