Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade upstream Prometheus: add keep_firing_for support to alerting rules and track 499 on canceled requests #4099

Merged
merged 3 commits into from
Jan 31, 2023

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jan 27, 2023

What this PR does

This PR updates the upstream Prometheus. The user facing changes are:

  • Add keep_firing_for support to alerting rules
  • Track 499 in cortex_request_duration_seconds_count for canceled queries

Note to reviewers

What has changed in updated vendored libs

Alertmanager changes:

  • Migrated to github.com/hashicorp/golang-lru/v2
  • PushoverConfig introduced UserKeyFile and TokenFile
  • Refactored nflog.Log options and maintenance (commit)
-       github.com/prometheus/alertmanager v0.25.0
+       github.com/prometheus/alertmanager v0.25.1-0.20230119163903-f59460bfd4bf

Prometheus changes:

  • Added KeepFiringFor to Rule and AlertingRule, and KeepFiringSince to Alert. The 'keep_firing_for' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results.
  • 499 on canceled request
  • TSDB in our fork now requires ShardFunc
  • More changes to native histograms (not yet supported by Mimir main, so didn't review changes in depth)
-replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230125082610-38af345deab1
+replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230127083412-f1986bc584b6

Checked the diff and AWS SDK just added new endpoints:

-       github.com/aws/aws-sdk-go v1.44.159
+       github.com/aws/aws-sdk-go v1.44.187

Which issue(s) this PR fixes or relates to

Fixes #4071

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Makefile Outdated
@@ -327,7 +327,14 @@ lint: check-makefiles
# Note that we don't automatically suggest replacing sort.Float64s() with slices.Sort() as the documentation for slices.Sort()
# at the time of writing warns that slices.Sort() may not correctly handle NaN values.
faillint -paths \
"sort.{Strings,Ints}=golang.org/x/exp/slices.Sort" \
"sort.{Strings,Ints}=golang.org/x/exp/slices.{Sort}" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: it wasn't working as expected. Functions must be wrapped by {}.

@@ -188,19 +188,25 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
am.state = newReplicatedStates(cfg.UserID, cfg.ReplicationFactor, cfg.Replicator, cfg.Store, am.logger, am.registry)
am.persister = newStatePersister(cfg.PersisterConfig, cfg.UserID, am.state, cfg.Store, am.logger, am.registry)

am.wg.Add(1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: moved below because this was used to track the maintenance goroutine which we're not responsible to start.

am.nflog, err = nflog.New(
nflog.WithRetention(cfg.Retention),
nflog.WithSnapshot(filepath.Join(cfg.TenantDataDir, notificationLogSnapshot)),
nflog.WithMaintenance(maintenancePeriod, am.stop, am.wg.Done, nil),
Copy link
Collaborator Author

@pracucci pracucci Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this option doesn't exist anymore because now you're expected to start the maintenance goroutine (done below).

Labels labels.Labels `json:"labels"`
Annotations labels.Labels `json:"annotations"`
State string `json:"state"`
ActiveAt *time.Time `json:"activeAt,omitempty"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: added omitempty to keep parity with Prometheus.

Copy link
Contributor

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments.
Is there any chance to get the shardfunc onto Prometheus main? I feel we need to review any update to tsdb from now to check if it hardcodes using labels.Hash

@@ -69,4 +69,6 @@ message AlertStateDesc {
[(gogoproto.nullable) = false, (gogoproto.stdtime) = true];
google.protobuf.Timestamp valid_until = 9
[(gogoproto.nullable) = false, (gogoproto.stdtime) = true];
google.protobuf.Timestamp keep_firing_since = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] not sure if it's a problem but the generated definition does not have the omittempty flag in the json tag, due nullable=false I think. For both this and active_at

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use JSON here. The exported JSON is based on the other manually defined struct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci
Copy link
Collaborator Author

Is there any chance to get the shardfunc onto Prometheus main? I feel we need to review any update to tsdb from now to check if it hardcodes using labels.Hash

We should talk to @bboreham about it.

@pracucci pracucci marked this pull request as ready for review January 27, 2023 14:47
@pracucci pracucci requested review from a team as code owners January 27, 2023 14:47
pkg/alertmanager/api.go Show resolved Hide resolved
pkg/storage/sharding/label.go Outdated Show resolved Hide resolved
pkg/storage/sharding/label_test.go Outdated Show resolved Hide resolved
pkg/storage/sharding/label.go Outdated Show resolved Hide resolved
pkg/storage/sharding/label.go Outdated Show resolved Hide resolved
@pracucci
Copy link
Collaborator Author

Comment on Slack from @colega:

I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.

I remember we have for distributors but I don't remember for compactor. I will check it.

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for doing the lord's work on implementing keep_firing_for so quickly.

pkg/alertmanager/alertmanager.go Show resolved Hide resolved
pkg/ruler/api.go Outdated Show resolved Hide resolved
@pracucci pracucci marked this pull request as draft January 30, 2023 11:05
@pracucci pracucci force-pushed the update-upstream-prometheus branch 2 times, most recently from 38b7586 to 7647c66 Compare January 30, 2023 11:18
@pracucci
Copy link
Collaborator Author

I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.

I remember we have for distributors but I don't remember for compactor. I will check it.

I confirm we have one for distributors:

func TestDistributor_Push_ShouldGuaranteeShardingTokenConsistencyOverTheTime(t *testing.T) {

@pracucci pracucci force-pushed the update-upstream-prometheus branch 5 times, most recently from ebfce8b to 304b184 Compare January 30, 2023 15:10
@pracucci pracucci marked this pull request as ready for review January 30, 2023 15:13
Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the tests.

Please fix this comment before merging:

// promql.ErrQueryCanceled, mapped to 503
// promql.ErrQueryTimeout, mapped to 503

…rules and track 499 on canceled requests

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci enabled auto-merge (squash) January 31, 2023 15:25
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci merged commit 5ac2d26 into main Jan 31, 2023
@pracucci pracucci deleted the update-upstream-prometheus branch January 31, 2023 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Querier tracks cortex_request_duration_seconds_count tracks status_code=503 on "context canceled"
5 participants