Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series quarantined on query #2478

Closed
funkelnd opened this Issue Mar 7, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@funkelnd
Copy link

funkelnd commented Mar 7, 2017

What did you do?
Queried for metric using query_range endpoint. On first try, got time series data.

What did you expect to see?
To see data returned on consecutive query.

What did you see instead? Under which circumstances?
Got no data. Later, only data available was with timestamps newer than second query.

Environment
There are about 300k metrics from various sources queried with 1 minute interval. There is 6 month retention set on data and it is being gathered for 9 months, so retention has long kicked in. There is large number of labels used, which basically are slightly modified versions of metric names. So each individual metric has a unique label. I am aware that this is anti-pattern, and this approach is being reworked. Same issue was present at least in 1.5.1. There have been ~250 occurrences in about 1.5 months, but non before.

  • System information:

Linux 2.6.32-573.22.1.el6.x86_64 x86_64

  • Prometheus version:

1.5.2

  • Command-Line Flags:
alertmanager.notification-queue-capacity	10000
alertmanager.timeout	10s
alertmanager.url	
config.file	/opt/prometheus/conf/prometheus.yml
log.format	"logger:stderr"
log.level	"info"
query.max-concurrency	64
query.staleness-delta	5m0s
query.timeout	2m0s
storage.local.checkpoint-dirty-series-limit	7000000
storage.local.checkpoint-interval	5m0s
storage.local.chunk-encoding-version	1
storage.local.dirty	false
storage.local.engine	persisted
storage.local.index-cache-size.fingerprint-to-metric	209715200
storage.local.index-cache-size.fingerprint-to-timerange	209715200
storage.local.index-cache-size.label-name-to-label-values	209715200
storage.local.index-cache-size.label-pair-to-fingerprints	209715200
storage.local.max-chunks-to-persist	3500000
storage.local.memory-chunks	7000000
storage.local.num-fingerprint-mutexes	131072
storage.local.path	/data/db/prometheus
storage.local.pedantic-checks	false
storage.local.retention	4392h0m0s
storage.local.series-file-shrink-ratio	0.3
storage.local.series-sync-strategy	adaptive
storage.remote.graphite-address	
storage.remote.graphite-prefix	
storage.remote.graphite-transport	tcp
storage.remote.influxdb-url	
storage.remote.influxdb.database	prometheus
storage.remote.influxdb.retention-policy	default
storage.remote.influxdb.username	
storage.remote.opentsdb-url	
storage.remote.timeout	30s
version	false
web.console.libraries	console_libraries
web.console.templates	consoles
web.enable-remote-shutdown	false
web.external-url	http://metrics:9090/
web.listen-address	:9090
web.max-connections	512
web.read-timeout	30s
web.route-prefix	/
web.telemetry-path	/metrics
web.user-assets
  • Logs:
time="2017-03-06T16:45:59Z" level=warning msg="Series quarantined." fingerprint=36bc2c2ad185767b metric=GET_STATE_OneMinuteRate{escaped="GET::STATE_OneMinuteRate", instance="localhost:9108", job="dev"} reason="unexpected number of chunk descs loaded for fingerprint 36bc2c2ad185767b: expected 263, got 762" source="storage.go:1699
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 7, 2017

Sadly, 1.5.0 and 1.5.1 were both corrupting time series data. Even after upgrade to 1.5.2, those corruptions will occasionally lead to series quarantining (once those corruptions are hit when loading or persisting data from or to disk). The corruptions were affecting especially short-lived and/or sparse series, so you are right in the line of fire here.

@funkelnd

This comment has been minimized.

Copy link
Author

funkelnd commented Mar 7, 2017

Thank you for quick response. Are there any patterns or methods by which it would be possible to determine or predict what TS are affected?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 7, 2017

The corruption happened on an unlucky coincidence of conditions, where the main one is that the series was in memory despite having all chunks persisted. This only happens to a series that doesn't receive samples anymore but is not yet archived. With a lot of series churn, you get that case more often.

@funkelnd

This comment has been minimized.

Copy link
Author

funkelnd commented Mar 7, 2017

Thanks once again.

@funkelnd funkelnd closed this Mar 7, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.