Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaN checks in [Float]Histogram.Equals method #12891

Merged
merged 5 commits into from
Oct 17, 2023

Conversation

linasm
Copy link
Contributor

@linasm linasm commented Sep 26, 2023

This PR modifies [Float]Histogram.Equals method to compare Sum fields at binary level (using math.Float64bits) which would recognize identical NaN (among them, staleness markers) values as equal, to make native histogram equality checks consistent with plain (non-histogram) sample equality check:

if t == msMaxt {
// We are allowing exact duplicates as we can encounter them in valid cases
// like federation and erroring out at that time would be extremely noisy.
// This only checks against the latest in-order sample.
// The OOO headchunk has its own method to detect these duplicates.
if math.Float64bits(s.lastValue) != math.Float64bits(v) {
return false, 0, storage.ErrDuplicateSampleForTimestamp
}
// Sample is identical (ts + value) with most current (highest ts) sample in sampleBuf.
return false, 0, nil

Also (for FloatHistogram) Count, ZeroCount and bucket values are now compared at binary level.

Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
@linasm
Copy link
Contributor Author

linasm commented Sep 26, 2023

Not really sure about this test failure on Windows in CI, I cannot reproduce it locally (on Mac), could it be a flaky test?

=== RUN   TestDeletedSamplesAndSeriesStillInWALAfterCheckpoint
    head_test.go:1364: 
        	Error Trace:	D:/a/prometheus/prometheus/tsdb/head_test.go:1364
        	Error:      	Not equal: 
        	            	expected: 9999
        	            	actual  : 12888
        	Test:       	TestDeletedSamplesAndSeriesStillInWALAfterCheckpoint
--- FAIL: TestDeletedSamplesAndSeriesStillInWALAfterCheckpoint (0.39s)

@beorn7
Copy link
Member

beorn7 commented Sep 26, 2023

That's a good point. The sum can actually be a non-stale NaN (after an observation of NaN), and the question is if we then want to consider those equal to other non-stale NaNs… I'll think about it.

And yeah, MS Windows tests have a history of being flaky.

@beorn7
Copy link
Member

beorn7 commented Sep 27, 2023

OK, so currently we only use the Equals method in tests and in head_append.go. The latter is the starting point of your argument, so that's fine.

We do not yet implement the == operator between histograms (and we might never do), but if we did, it would be tempting to use the Equals method. Stale-NaNs will never show up in the PromQL evaluation, but "normal" NaNs might. And for the sake of a PromQL ==, an NaN should never equal any other NaN. So we should make clear in the doc comment, that the Equals method is not about mathematical equality.

Another point is if we need to give the same treatment to other floats in the histograms. We have implemented the division operator by now, so you could divide a histogram by zero, and then a count of zero could become NaN. I believe we have to somehow special-case what happens if a histogram is divided by zero (probably removing all buckets, setting the zero bucket count to +Inf if it was non-zero and to NaN otherwise). And then there is the case of NaN's coming in from the outside. But we could mandate that this is against the exposition standard and do some validation during ingestion.

Soooo, let's say the same treatment should also apply to fh.ZeroCount and fh.Count of a FloatHistogram, and the doc comment should describe the precise behavior. @linasm could you add these changes?

Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
@linasm
Copy link
Contributor Author

linasm commented Sep 28, 2023

@beorn7 I have made the changes based on your suggestion:

  1. Also comparing FloatHistogram.[Zero]Count using math.Float64bits.
  2. Updated doc comments (but I'm really open to some better formulation).

Also, what about comparing FloatHistogram bucket values?

@linasm linasm changed the title Fix NaN sum check in [Float]Histogram.Equals method Fix NaN checks in [Float]Histogram.Equals method Sep 28, 2023
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long review backlog.

I have thought about the buckets. Initially, I thought we can avoid the comparison by making sure that buckets get removed after a division by zero. However, there are weird edge cases where a NaN bucket value is more or less legitimate. (The one I could come up with: A bucket count could overflow, resulting in +Inf. Then we could multiply this histogram by -1, so the bucket count is -Inf, and finally add the result to the original histogram, resulting in a bucket count of NaN. Obviously, these cases are all pretty academic because a histogram with such an overflow will be mostly useless, but we still have to handle it correctly.)

So I guess we have to do the following: For a float histogram, we also have to compare bucket counts based on their bit pattern. For an integer histogram, nothing changes. Sadly, that mean that the generic bucketsMatch helper function has to be manually specialized for float and ints.

Does this make sense?

@@ -307,13 +307,16 @@ func (h *FloatHistogram) Sub(other *FloatHistogram) *FloatHistogram {
// Exact match is when there are no new buckets (even empty) and no missing buckets,
// and all the bucket values match. Spans can have different empty length spans in between,
// but they must represent the same bucket layout to match.
// Non-metadata float fields (Sum, Count, ZeroCount) are compared as binary (using math.Float64bits).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Metadata" is a highly overloaded word.

Suggestion: "Sum, Count, and ZeroCount are compared based on their bit patterns because this method is about data equality rather than mathematical equality."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments (agree on "metadata").

Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
@linasm linasm force-pushed the fix-gaps-in-histogram-equals branch from d7e2818 to ec823d9 Compare October 9, 2023 13:10
@linasm
Copy link
Contributor Author

linasm commented Oct 9, 2023

So I guess we have to do the following: For a float histogram, we also have to compare bucket counts based on their bit pattern. For an integer histogram, nothing changes. Sadly, that mean that the generic bucketsMatch helper function has to be manually specialized for float and ints.

Makes sense. Updated float bucket comparisons to bitwise. Only had to sepcialize bucketsMatch into its float flavour. For integer buckets, slices.Equal from stdlib does the trick. It comes from golang.org/x/exp/slices, but I see it being used in multiple places already, so I hope this is not an issue.

@linasm linasm requested a review from beorn7 October 9, 2023 13:16
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Only one tiny comment amendment left.

@@ -307,13 +307,17 @@ func (h *FloatHistogram) Sub(other *FloatHistogram) *FloatHistogram {
// Exact match is when there are no new buckets (even empty) and no missing buckets,
// and all the bucket values match. Spans can have different empty length spans in between,
// but they must represent the same bucket layout to match.
// Sum, Count, and ZeroCount are compared based on their bit patterns because this method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buckets need to be mentioned here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I always forget those comments. Updated.

Signed-off-by: Linas Medziunas <linas.medziunas@gmail.com>
@linasm linasm force-pushed the fix-gaps-in-histogram-equals branch from f6b3390 to 62bbb81 Compare October 14, 2023 18:30
@linasm
Copy link
Contributor Author

linasm commented Oct 14, 2023

@beorn7 one more thing - at some point I noticed that Equals does not check for CounterResetHint equality. Is this intentional for some reason, or should we add that check?

@linasm linasm requested a review from beorn7 October 14, 2023 19:22
@beorn7
Copy link
Member

beorn7 commented Oct 17, 2023

You are right. The CounterResetHint should also be checked. Could you add that to this PR, too (including updating the doc comments once more)?

@linasm linasm changed the title Fix NaN checks in [Float]Histogram.Equals method Fix gaps in [Float]Histogram.Equals method Oct 17, 2023
@linasm
Copy link
Contributor Author

linasm commented Oct 17, 2023

You are right. The CounterResetHint should also be checked. Could you add that to this PR, too (including updating the doc comments once more)?

Added a check for CounterResetHint, also updated PR title and description to reflect the expanded scope of changes.

I believe doc comment update is not necessary for CounterResetHint, because other trivial field comparisons are not explicitly documented, either (such as Schama, integer Count, etc.). Only non-trivial fields are documented (buckets/spans and math.Float64bits-compared fields). Everything else is covered by blanket Equals returns true if the given histogram matches exactly. statement.
Let me know if you still think the docs comment needs update.

@linasm
Copy link
Contributor Author

linasm commented Oct 17, 2023

Oh, now we are getting this failure, and apparently there is no infrastructure (parser?) to set up an expected value for CounterResetHint. Not sure how much this would extend the scope, maybe it is better to revert my last commit and deal with CounterResetHint separately?

    --- FAIL: TestEvaluations/testdata/native_histograms.test (0.10s)
        test.go:105: 
            	Error Trace:	/__w/prometheus/prometheus/promql/test.go:105
            	            				/__w/prometheus/prometheus/promql/test.go:83
            	Error:      	Received unexpected error:
            	            	error in eval incr_histogram (line 100): expected {{count:14 sum:24 buckets:[1 12 1]}} for {__name__="incr_histogram"} but got {{count:14 sum:24 buckets:[1 12 1]}}

@beorn7
Copy link
Member

beorn7 commented Oct 17, 2023

Hmm, yeah, good point. We are really getting deep into the rabbit hole here.

Here are my (rabbit hole deep) thoughts: This Equals method is essentially used to allow duplicate samples (same timestamp, same sample value) to get ingested without an error. Maybe the reset hint doesn't really matter here. Or maybe it even have to ignore it because a duplicate histogram will lead to "no reset", even if the first instance of the same sample was happening directly after a counter reset.

So after all, let's just ignore the reset hint for now.

@linasm linasm changed the title Fix gaps in [Float]Histogram.Equals method Fix NaN checks in [Float]Histogram.Equals method Oct 17, 2023
@linasm linasm force-pushed the fix-gaps-in-histogram-equals branch from 22240ce to 62bbb81 Compare October 17, 2023 17:33
@linasm
Copy link
Contributor Author

linasm commented Oct 17, 2023

So after all, let's just ignore the reset hint for now.

Agreed, reverted CounterResetHint related changes.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@beorn7 beorn7 merged commit f33bffa into prometheus:main Oct 17, 2023
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants