Add a scrape_job label to scrape_samples_* metrics #9285

prymitive · 2021-09-01T14:41:53Z

Proposal

Use case. Why is this important?

Context: Cardinality is a big issue for large deployments and it often requires setting scrape limits to avoid accidental exporting of a huge number of samples per target via sample_limit parameter. This works well but it's pretty binary - target exposes less than limit and everything works OR target is exposing more than the limit allows and all samples are rejected, so it's a bit of a last resort tool.
One of the problems with using sample_limit is that it is not always easy to monitor scrape jobs to alert if they are getting too close to the limit. This is due to flexibility Prometheus allows for in scrape configuration - job_name in scrape config block doesn't have to be the same as job label on scrape_samples_post_metric_relabeling metric describing per target result of a scrape config.
Some other metrics that are scrape related (like prometheus_target_scrape_pool_targets) have scrape_job label that is identical to the job_name key in scrape configuration block which allows to map it 1:1 to the configuration itself.

Request: would it be possible to add scrape_job label to at least scrape_samples_scraped & scrape_samples_post_metric_relabeling so it's possible to link series to a scrape configuration block, which would allow help with auditing TSDB when Prometheus is ingesting too many samples.

The text was updated successfully, but these errors were encountered:

roidelapluie · 2021-09-01T14:50:06Z

I do not see a way to do this without a breaking change I expect users to already have a scrape_job config. It would also not benefit most of the users who have a single limit or always job=scrape_job.

However, what would you think about something similar to #9247 but for sample_limit, as an alternative (behind the same feature flag)?

prymitive · 2021-09-01T14:56:34Z

Yes, it could be a breaking change for some users, but I'm not sure what's the promise/policy of label stability for internal metrics, maybe a note in the changelog is enough?

I'm happy with a new metric and a feature flag, I don't really mind how this is surfaced and it would solve my problem.
I was exporting sample_limit myself but it sounds like that could also be exporter under a feature flag similar to scrape_timeout_seconds?

roidelapluie · 2021-09-01T15:01:50Z

Yes, it could be a breaking change for some users, but I'm not sure what's the promise/policy of label stability for internal metrics, maybe a note in the changelog is enough?

In prometheus 2.x users can use any label. We reserve the prefix __ for internal use in relabeling with the exception of __tmp.

I'm happy with a new metric and a feature flag, I don't really mind how this is surfaced and it would solve my problem.
I was exporting sample_limit myself but it sounds like that could also be exporter under a feature flag similar to scrape_timeout_seconds?

it would be the same feature flag, that's why it is called extra-scrape-metrics. it would be inserted for each target after each scrape.

roidelapluie · 2021-09-02T22:58:05Z

Do you want to give a go to adding a scape_sample_limit: metric? #9247 is in.

prymitive · 2021-09-03T09:47:21Z

Yes, I'll try to work on this today, if not next week. Thanks

prymitive · 2021-09-03T11:19:05Z

From a quick reading of scrape code I see that if I wanted to have a metric with scrape_job label it won't be straightforward to add it the same way as scrape_timeout_seconds.
It's all reported in: scrapeLoop->report():

prometheus/scrape/scrape.go

Line 1615 in 31f4108

    
           func (sl *scrapeLoop) report(app storage.Appender, start time.Time, timeout, duration time.Duration, scraped, added, seriesAdded int, scrapeErr error) (err error) {

which calls scrapeLoop->addReportSample():

prometheus/scrape/scrape.go

Line 1676 in 31f4108

    
           func (sl *scrapeLoop) addReportSample(app storage.Appender, s string, t int64, v float64) error {

There are two problems here:
a) scrapeLoop doesn't have access to the scrape config struct, so it doesn't know the value of job_name field, it's only accessible from scrapePool, which is one level up. We could pass it down though.
b) addReportSample doesn't deal with any labels other than __name__, so at the moment it cannot be used to expose metrics with scrape_job

This really means that adding any metric that is exposed per config.ScrapeConfig in addReportSample doesn't feel like the ideal place, plus newly added scrape_timeout_seconds will share the same problem as I'm facing right now - it's exposed per job label really so it's not always easy to map it back to config.ScrapeConfig as there's no explicit link (like scrape_job label).

roidelapluie · 2021-09-03T11:37:09Z

it is not needed to add scrape_job. Your issue is to have the sample limit per target. You can add a sample limit metric with the targets' labels.

prymitive · 2021-09-03T11:59:38Z

Well, maybe it's indeed enough, let's start with that.

One of the problems is that scrape config allows so much flexibility thanks to relabelling that it's sometimes difficult to figure out which scrape config was involved.
For example one can add N scrape configs that rewrite job label to the same value, once you query for scrape_timeout_seconds all you'll get is job and instance labels, since job is the same across N scrape configs it's only half useful (it points to a list rather than a single scrape config) and instance can be difficult to track down since it comes from service discovery. Having scrape_job that points to specific config.ScrapeConfig would allow for 1:1 mapping of metrics and scrape configs and avoid this confusion.

But I think that you're right: if I get a full set of labels it will be enough.

Previously I exposed scrape limits by a custom exporter that queries prometheus for runtime config and exports per job metrics describing the limit. When I discovered that job label itself isn't always enough I did try to export limit metrics with all the labels (as the PR I'll raise would do), but realised that I would basically have to reimplement service discovery to get this right - I can parse relabel config and figure out labels forced this way, but I cannot tell which labels each target will have, hence this ticket.

prymitive · 2021-09-03T13:39:31Z

Raised #9295

prymitive mentioned this issue Sep 1, 2021

Add scrape_timeout_seconds metric (behind feature flag) #9247

Merged

prymitive closed this as completed Sep 4, 2021

prometheus locked as resolved and limited conversation to collaborators Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a scrape_job label to scrape_samples_* metrics #9285

Add a scrape_job label to scrape_samples_* metrics #9285

prymitive commented Sep 1, 2021

roidelapluie commented Sep 1, 2021

prymitive commented Sep 1, 2021

roidelapluie commented Sep 1, 2021

roidelapluie commented Sep 2, 2021

prymitive commented Sep 3, 2021

prymitive commented Sep 3, 2021

roidelapluie commented Sep 3, 2021

prymitive commented Sep 3, 2021

prymitive commented Sep 3, 2021

Add a scrape_job label to scrape_samples_* metrics #9285

Add a scrape_job label to scrape_samples_* metrics #9285

Comments

prymitive commented Sep 1, 2021

Proposal

roidelapluie commented Sep 1, 2021

prymitive commented Sep 1, 2021

roidelapluie commented Sep 1, 2021

roidelapluie commented Sep 2, 2021

prymitive commented Sep 3, 2021

prymitive commented Sep 3, 2021

roidelapluie commented Sep 3, 2021

prymitive commented Sep 3, 2021

prymitive commented Sep 3, 2021