Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

Merged
merged 51 commits into from
Jan 10, 2023

Conversation

fayzal-g
Copy link
Contributor

@fayzal-g fayzal-g commented Nov 11, 2022

What this PR does

Gives the ability to specify a config flag: validation.separate-metrics-label which will add an additional label group to cortex_discarded_samples_total to take the value of a label on an incoming timeseries. This works similar to HATracker and the cluster label. Default value of this config flag is "" which means when this flag is not set, it should get dropped when scraped by Prometheus.

There is a limit to the number of active groups which is set by by the -max-groups-per-user flag. If a new group is received whilst the group limit has been reached, the value will be changed to "other". Inactive groups are cleaned up every minute, and a group is considered inactive if it has not been updated in 20 minutes.

The counters that this affects are as follows:

Distributor

  • discardedSamplesTooManyHaClusters
  • discardedSamplesRateLimited

Ingester Metrics

  • discardedSamplesSampleOutOfBounds
  • discardedSamplesSampleOutOfOrder
  • discardedSamplesSampleTooOld
  • discardedSamplesNewValueForTimestamp
  • discardedSamplesPerUserSeriesLimit
  • discardedSamplesPerMetricSeriesLimit

Validate

  • missingMetricName
  • invalidMetricName
  • maxLabelNamesPerSeries
  • invalidLabel
  • labelNameTooLong
  • labelValueTooLong
  • duplicateLabelNames
  • labelsNotSorted
  • tooFarInFuture

For the above counters, the deletion of these metrics has been updated to discardedCounter.DeletePartialMatch(userID)

This builds upon the work started by @LeviHarrison (https://github.com/LeviHarrison) in: #2702

Which issue(s) this PR fixes or relates to

#2420

Fixes #2420

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@colega
Copy link
Contributor

colega commented Nov 12, 2022

Sorry for not being constructive in this feedback, but1 I had to read the code to understand what "split metrics by a further label" means. I'm not sure that "Split" is the best word, although I can't come up with a better name. "Label" as a verb would be a proper term, but it also wouldn't help understanding the feature IMO.

Footnotes

  1. Everyone knows that the important part always comes after the "but".

@fayzal-g fayzal-g changed the title Split metrics by a further label specified by a config flag Add additional label to certain metrics defined by a config flag Nov 21, 2022
@fayzal-g fayzal-g marked this pull request as ready for review November 21, 2022 12:16
@fayzal-g fayzal-g requested a review from a team as a code owner November 21, 2022 12:16
@osg-grafana
Copy link
Contributor

Do we need a documentation issue for this?

Copy link
Contributor

@osg-grafana osg-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s get closer to clarity together. Blocking for now so we can throw some more ideas back and forth.

@osg-grafana osg-grafana added the type/docs Improvements or additions to documentation label Nov 21, 2022
@fayzal-g fayzal-g marked this pull request as draft November 21, 2022 12:23
@fayzal-g fayzal-g changed the title Add additional label to certain metrics defined by a config flag Add additional label to cortex_discarded_samples_total with the value defined by a config flag Nov 23, 2022
@fayzal-g fayzal-g changed the title Add additional label to cortex_discarded_samples_total with the value defined by a config flag Add additional label to cortex_discarded_samples_total with the label value defined by a config flag Nov 23, 2022
@fayzal-g fayzal-g force-pushed the separate-metrics branch 2 times, most recently from 8520037 to 6a415da Compare November 24, 2022 10:48
@fayzal-g fayzal-g marked this pull request as ready for review November 24, 2022 11:11
@fayzal-g fayzal-g requested a review from a team as a code owner November 24, 2022 11:11
@fayzal-g fayzal-g requested a review from a team November 24, 2022 11:11
osg-grafana
osg-grafana previously approved these changes Nov 24, 2022
Copy link
Contributor

@osg-grafana osg-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc is fine

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list this feature as experimental in the list at docs/sources/operators-guide/configure/about-versioning.md and also add a CHANGELOG entry for this [FEATURE] mentioning it's experimental.

pkg/util/push/otel.go Outdated Show resolved Hide resolved
pkg/api/api.go Outdated Show resolved Hide resolved
@replay
Copy link
Contributor

replay commented Nov 29, 2022

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job! Thank you for addressing all my feedback.

CHANGELOG.md Outdated Show resolved Hide resolved
@pracucci pracucci self-requested a review January 9, 2023 13:16
@pracucci
Copy link
Collaborator

pracucci commented Jan 9, 2023

Please list this feature as experimental in the list at docs/sources/operators-guide/configure/about-versioning.md

I think this is still missing.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Overall LGTM. I left many minor comments. I think I found an issue around locking in ActiveGroupsCleanupService.iteration() (see comment there). In few cases I left comments about stuff that I would like to see in a follow up PR.

pkg/mimir/mimir.go Outdated Show resolved Hide resolved
@@ -98,6 +98,7 @@ type Config struct {
MultitenancyEnabled bool `yaml:"multitenancy_enabled"`
NoAuthTenant string `yaml:"no_auth_tenant" category:"advanced"`
ShutdownDelay time.Duration `yaml:"shutdown_delay" category:"experimental"`
MaxGroupsPerUser int `yaml:"max_groups_per_user" category:"experimental"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non blocking because marked as experimental and we can move it around] It's a bit an anti-pattern defining it here. Thinking about a more actionable feedback to give you.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/util/validation/separate_metrics.go Outdated Show resolved Hide resolved
pkg/util/active_groups.go Outdated Show resolved Hide resolved
Comment on lines +135 to +138
if s.activeGroups.ActiveGroupLimitExceeded(user, group) {
group = "other"
}
s.activeGroups.UpdateGroupTimestampForUser(user, group, now)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[not this PR, but for a follow up PR] We need to take the lock and lookup the map twice. It would be reduced to 1 if we moved the limit logic to UpdateGroupTimestampForUser() (e.g. passing the limit in input). If the limit is reached, it could return a specific error which we then check here.

pkg/util/active_groups.go Outdated Show resolved Hide resolved
pkg/util/active_groups.go Show resolved Hide resolved
pkg/util/active_groups_test.go Outdated Show resolved Hide resolved
@@ -91,11 +84,31 @@ func (ag *ActiveGroups) PurgeInactiveGroupsForUser(userID string, deadline int64
return deletedGroups
}

func (ag *ActiveGroups) PurgeInactiveGroups(inactiveTimeout time.Duration, cleanupFuncs ...func(string, string)) {
userIDs := make([]string, len(ag.timestampsPerUser))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't access ag.timestampsPerUser outside the lock. Please move it after ag.mu.RLock().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to simplify the code, initialise the capacity, not the length, and then just use userIDs = append(userIDs, userID). This removes the need of keeping i.

ag.mu.RUnlock()

for _, userID := range userIDs {
inactiveGroups := ag.PurgeInactiveGroupsForUser(userID, time.Now().Add(-inactiveTimeout).UnixNano())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please compute time.Now().Add(-inactiveTimeout).UnixNano() just once, before the for loop and then just reuse it.

Copy link
Collaborator

@pracucci pracucci left a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/notified-changelog-cut type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subdivide some per-tenant metrics by another label
5 participants