Prometheus head series memory issues with high-cardinality ephemeral metrics #10598
dweinshenker
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
When prometheus quickly ingests a large number of high-cardinality, ephemeral metrics, sample limits don't seem to prevent the head block from being filled with these stale series. This appears to cause prometheus to crash due to memory issues and fail to replay the WAL on restart.
We deleted the WAL and set up relabeling rules to drop the offending metrics, but we wanted to know if there are additional things we could do to prevent this situation from happening again. In addition, we want to get advice on how to deal with situations where high-cardinality metrics are distributed across many targets. In these cases, sample limits may not prevent the head block from filling up with high-cardinality series.
Config
Prometheus: version ->
v2.33.4
Scrape config:
Details
We encountered an issue where prometheus was ingesting a histogram metric from a target that had a
request_id
label with UUID values. This caused prometheus to crash after its head series filled up.Below is a graph showing the
prometheus_tsdb_head_series
increasing for this particular prometheus HA pair, even when theup > 0
series below it on the bottom shows gaps where the sample limits were being exceeded.We had a
sample_limit
in place on this particular job, but this didn’t seem to protect prometheus from crashing due to having too many series in the head block.Below is a graph showing 2 metrics for the problematic target:
scrape_sample_limit
andscrape_samples_post_metric_relabeling
. While the target is emitting these problematic metrics, we see thescrape_samples_post_metric_relabeling
grow as expected until it hits the scrape limit. It then falls back down for a while before starting to scrape again. This process seems to be continuous. Another curious question here is why the sample limits are exceeded on this target for a couple of scrapes before samples are no longer ingested from the target.Beta Was this translation helpful? Give feedback.
All reactions