Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate spurious target-manager exit after startup #374

Closed
juliusv opened this issue Jan 23, 2014 · 3 comments
Closed

Investigate spurious target-manager exit after startup #374

juliusv opened this issue Jan 23, 2014 · 3 comments
Assignees

Comments

@juliusv
Copy link
Member

juliusv commented Jan 23, 2014

Prometheus was restarted and stopped scraping soon after startup. The relevent log section said:

2014-01-23_14:46:24.30597 I0123 14:46:24.305391 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32035 I0123 14:46:24.317359 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32093 I0123 14:46:24.317389 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32147 I0123 14:46:24.317416 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32197 I0123 14:46:24.317436 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32205 I0123 14:46:24.317463 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32268 I0123 14:46:24.317496 15094 targetmanager.go:68] Pool for job XXXXXXX does not exist; creating and starting...
2014-01-23_14:46:24.32499 I0123 14:46:24.319706 15094 web.go:80] listening on :9090
2014-01-23_14:49:49.43797 I0123 14:49:49.437400 15094 query.go:133] Range query: haproxy_current_sessions{service="XXXXX"}
2014-01-23_14:49:49.68786 Query stats:
2014-01-23_14:49:49.68788 Total eval time: 2.795053ms
2014-01-23_14:49:49.68789 Total view building time: 2.401951ms
2014-01-23_14:49:49.68789 Query analysis time: 2.323989ms
2014-01-23_14:49:49.68790 View request building time: 2.224us
2014-01-23_14:49:49.68791 Inner view building time: 66.164us
2014-01-23_14:49:49.68792 View queue wait time: 23.022us
2014-01-23_14:49:49.68793 View building job scanning time: 2.456us
2014-01-23_14:49:49.68794 Total view data extraction time: 959ns
2014-01-23_14:49:49.68795 Inner eval time: 381.827us
2014-01-23_14:49:49.68796 GetValueAtTime() time: 173.564us
2014-01-23_14:49:49.68796 Result append time: 840ns
2014-01-23_14:49:49.68797 Result sorting time: 2.158us
2014-01-23_14:49:49.68798 JSON encoding time: 61.154152ms
2014-01-23_14:49:49.68799
2014-01-23_14:49:55.09421 I0123 14:49:55.094138 15094 targetmanager.go:120] Target manager exiting...
2014-01-23_14:49:55.10095 I0123 14:49:55.100903 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73047 I0123 14:49:55.730072 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73205 I0123 14:49:55.731796 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73238 I0123 14:49:55.732089 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:55.73269 I0123 14:49:55.732416 15094 targetpool.go:67] TargetPool exiting...
2014-01-23_14:49:56.44305 I0123 14:49:56.442549 15094 query.go:133] Range query: haproxy_current_sessions{service="XXXXXXXX"}
2014-01-23_14:49:56.44460 Query stats:
2014-01-23_14:49:56.44461 Total eval time: 13.33909ms
2014-01-23_14:49:56.44462 Total view building time: 12.900073ms
2014-01-23_14:49:56.44463 Query analysis time: 12.849138ms
2014-01-23_14:49:56.44463 View request building time: 1.795us
2014-01-23_14:49:56.44464 Inner view building time: 41.516us
2014-01-23_14:49:56.44465 View queue wait time: 9.074us
2014-01-23_14:49:56.44465 View building job scanning time: 1.818us
2014-01-23_14:49:56.44466 Total view data extraction time: 742ns
2014-01-23_14:49:56.44467 Inner eval time: 427.893us
2014-01-23_14:49:56.44467 GetValueAtTime() time: 187.04us
2014-01-23_14:49:56.44468 Result append time: 837ns
2014-01-23_14:49:56.44516 Result sorting time: 1.382us
2014-01-23_14:49:56.44518 JSON encoding time: 16.822us

After that, the web interface worked, but since scrapers (target pools) were down, returned no fresh data for any query.

Possibly (and maybe due to the long shutdown time before) there was an issue where Prometheus got another SIGTERM/SIGINT from its supervisor. But why did it not shut down completely then? The same happened for a group of Prometheus instances at the same time, when they each got a new target configured. After another restart, everything is working.

@ghost ghost assigned juliusv Jan 23, 2014
@juliusv
Copy link
Member Author

juliusv commented Jan 24, 2014

Found out that this was due to a panic happening, which triggered the deferred p.close() in main. The close got stuck somehow, so we never got to see the panic:

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/prometheus/client_golang/model.(*Fingerprint).LoadFromMetric(0xc20afc21e0, 0xc210b51940)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/client_golang/model/fingerprinting.go:135 +0x5bb
github.com/prometheus/prometheus/storage/metric.(*memorySeriesStorage).AppendSample(0xc20028b680, 0xc210b53320, 0x0, 0x0)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/memory.go:216 +0x9d
github.com/prometheus/prometheus/storage/metric.(*memorySeriesStorage).AppendSamples(0xc20028b680, 0xc217edb940, 0x2, 0x2, 0x7f32e574cc18, ...)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/memory.go:205 +0x5f
github.com/prometheus/prometheus/storage/metric.(*TieredStorage).AppendSamples(0xc200284120, 0xc217edb940, 0x2, 0x2, 0x0, ...)
    /home/julius/prom/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/storage/metric/tiered.go:171 +0xd4
main.main()
    /home/julius/prom/prometheus/main.go:391 +0x1112

@juliusv
Copy link
Member Author

juliusv commented Jan 24, 2014

This was triggered by two bugs:

  1. The fingerprinting code (LoadFromMetric) crashes for a metric with an empty label value.
  2. When doing sum(foo) by (non_existent_label), non_existent_label will be propagated into the result with an empty label value if the label doesn't exist in the timeseries being aggregated. Not sure if this is a bug or desired behavior, but in any case, this currently triggers bug 1.

juliusv added a commit that referenced this issue Jan 24, 2014
This fixes bug 2. of #374

Change-Id: Ia4a13153616bafce5bf10597966b071434422d09
@juliusv juliusv closed this as completed Mar 9, 2014
simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017
Add cachebusting fingerprints to asset files.
@lock
Copy link

lock bot commented Mar 25, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant