-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Attach hash of instance labels to state history log lines #65968
Conversation
@@ -2,6 +2,7 @@ package historian | |||
|
|||
import ( | |||
"context" | |||
"crypto/md5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the algorithm is mostly irrelevant, can we use one of the SHA families instead? Those are supported in the browser's webcrypto API and might allow us to use the same fingerprinting on the front-end for areas where we don't have access to data from the Alert State History API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check @JohnnyQQQQ's comment: #65968 (comment)
I'm thinking of instead just directly lifting prometheus's hash function. It's just a basic FNV hash and we can just use its exact same base and offset. This gives us some stronger alignment with prometheus, and the UI can also easily implement this and get the same values - WDYT? There's no real reason to use any crypto hash here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up going this route. The hash function we used is implemented here: https://github.com/prometheus/common/blob/main/model/fnv.go
And it serializes the label set using: https://github.com/prometheus/common/blob/2f04d2ec94b96df44aa4f8a60ae9fb29d8512636/model/signature.go#L33
This should translate pretty easily, it's faster, and more universal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit.
@@ -291,6 +295,7 @@ type lokiEntry struct { | |||
Values *simplejson.Json `json:"values"` | |||
DashboardUID string `json:"dashboardUID"` | |||
PanelID int64 `json:"panelID"` | |||
InstanceID string `json:"instanceID"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prometheus it's called fingerprint
. Perhaps we want to pick that up, so the naming is kind of the same, as it's the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also pick up the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM, but please see my comment.
@@ -242,14 +243,17 @@ func statesToStream(rule history_model.RuleMeta, states []state.StateTransition, | |||
continue | |||
} | |||
|
|||
sanitizedLabels := removePrivateLabels(state.Labels) | |||
instFingerprint, _ := json.Marshal(sanitizedLabels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct.
There's no sorting of the labels so so the same labels might produce different results.
On top of that, why do we need the json marshalling? (which is very expensive - probably more than the hashing itself).
I'd replace the current approach with:
md5.Sum([]byte(sanitizedLabels.String()))
Which a) would remove the need for JSON marshalling and then have a guaranteed order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json package sorts the map keys automatically - this is actually why I originally chose json, because it's stable. String()
felt more unreliable to me.
I instead decided to lift the exact same fingerprint calculation that Prometheus implements. It's a super simple FNV hash with no intermediate step, so it's faster than all other suggestions - and it also sorts keys, and has the side effect of producing the same fingerprint across alertmanagers 😄 . There was no need for a crypto hash here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lifting FNV is something I can def get behind - historically, JSON computations are pretty expensive in Go and I'd to shy away from it when possible.
Out of curiosity though, why does String()
feels unreliable to you? In my eyes, this looks pretty standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
d15eed8
to
1cdd0bc
Compare
@@ -263,7 +264,8 @@ func statesToStream(rule history_model.RuleMeta, states []state.StateTransition, | |||
Condition: rule.Condition, | |||
DashboardUID: rule.DashboardUID, | |||
PanelID: rule.PanelID, | |||
InstanceLabels: removePrivateLabels(state.Labels), | |||
Fingerprint: labelFingerprint(sanitizedLabels), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state contains cacheID, why don't you use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative, state has a method GetAlertInstanceKey that returns result that contains a hash of the labels. The only difference is that it include private labels. Why do you remove private labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above, I think it's a much more portable solution to match with Prometheus's fingerprint calculation.
State history has always removed private labels for a very long time, even back when it wrote only annotations. The fingerprint should definitely be based off the labels that actually get written - otherwise, the result is confusing and won't align with fingerprints the UI might be calculating.
Whether we strip private labels at all is another issue, making this behavior change everywhere is something we can consider - but such a change should be made separately from this PR. In this scope of this PR, the fingerprint should follow the labels we actually store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, private labels contain information such as alert rule UID, by removing this label and calculating a hash based on only public ones you may encounter a situation when there are two rules that do not have any labels and produce the exact same set of labels would end up with exact same fingerprints.
…65968) * Add instanceID which is hash of labels * Rename field to fingerprint * Move to prometheus style signature * Appease linter
What is this feature?
Adds a new field in the log line,
instanceID
, which will be unique based on the alert instance for which the log line is attached. If two lines are associated with the same alert instance, they will have the same instance ID.Why do we need this feature?
This allows for new types of LogQL queries on the data - queries are now able to group by alert instance which allows for new types of analysis.
Which issue(s) does this PR fix?:
n/a
Special notes for your reviewer:
Please check that: