-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querier: batch merge iterator handle histograms #4286
Conversation
The batch merge iterator is default interator used to let the querier merge the chunks received from store-gateways. The chunk.Batch and stream merge code is optimized to the extreme. Adding histograms into the Batch with any additional indirection destroys the performance, hence the solution proposed in the PR which has 0 additional allocation when the code deals with float series. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed two small things as I was reading through this (I don't have enough context to comment on the overall change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only looked at non-test files for now, LGTM in general
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comments on the tests
Small fix in mockIterator Co-authored-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's too much panic()
in this PR. The general rule is that we should try to avoid using panic()
as much as possible. Can we do anything about it?
// Since Batch is expected to be passed by value, the array needs to be constant sized, | ||
// however increasing the size of the Batch also adds memory management overhead. Using the unsafe.Pointers | ||
// combined with the ValueType implements a kind of "union" type to keep the memory use down. | ||
PointerValues [BatchSize]unsafe.Pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use an interface{}
instead? I don't understand why we need to use unsafe.Pointer
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing converting from interface{} to a static type is more expensive than using an unsafe.Pointer? (if not, using interface{} makes sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try it and measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too bad, a bit more time (checks type, plus extra allocation time due to memory) and memory (extra type info pointer)
cpu: AMD Ryzen 7 4700U with Radeon Graphics
│ benchmark-batch-pointers.txt │ benchmark-batch-interface.txt │
│ sec/op │ sec/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 4.731m ± 1% 4.861m ± 1% +2.75% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 14.35m ± 1% 14.59m ± 1% +1.72% (p=0.002 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 484.3µ ± 1% 496.1µ ± 1% +2.44% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 1.455m ± 1% 1.485m ± 2% +2.09% (p=0.001 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 12.24µ ± 3% 13.37µ ± 3% +9.25% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 32.18µ ± 2% 36.22µ ± 3% +12.56% (p=0.000 n=10)
geomean 515.8µ 541.9µ +5.05%
│ benchmark-batch-pointers.txt │ benchmark-batch-interface.txt │
│ B/op │ B/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 73.43Ki ± 0% 73.93Ki ± 0% +0.68% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 282.9Ki ± 0% 283.9Ki ± 0% +0.35% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 9.586Ki ± 0% 10.086Ki ± 0% +5.22% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 34.24Ki ± 0% 35.24Ki ± 0% +2.92% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 2.062Ki ± 0% 2.562Ki ± 0% +24.24% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 4.859Ki ± 0% 5.859Ki ± 0% +20.58% (p=0.000 n=10)
geomean 20.22Ki 21.96Ki +8.58%
│ benchmark-batch-pointers.txt │ benchmark-batch-interface.txt │
│ allocs/op │ allocs/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 1.013k ± 0% 1.013k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 3.023k ± 0% 3.023k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 113.0 ± 0% 113.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 323.0 ± 0% 323.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 14.00 ± 0% 14.00 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 26.00 ± 0% 26.00 ± 0% ~ (p=1.000 n=10) ¹
geomean 185.5 185.5 +0.00%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And using type inference to get rid of the ValueType field when using interface is even worse:
cpu: AMD Ryzen 7 4700U with Radeon Graphics
│ benchmark-batch-pointers.txt │ benchmark-batch-interface2.txt │
│ sec/op │ sec/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 4.731m ± 1% 10.635m ± 1% +124.78% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 14.35m ± 1% 30.27m ± 2% +110.98% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 484.3µ ± 1% 1074.9µ ± 1% +121.94% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 1.455m ± 1% 3.055m ± 1% +110.02% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 12.24µ ± 3% 19.60µ ± 4% +60.14% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 32.18µ ± 2% 50.80µ ± 3% +57.87% (p=0.000 n=10)
geomean 515.8µ 1.009m +95.53%
│ benchmark-batch-pointers.txt │ benchmark-batch-interface2.txt │
│ B/op │ B/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 73.43Ki ± 0% 73.87Ki ± 0% +0.60% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 282.9Ki ± 0% 283.8Ki ± 0% +0.33% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 9.586Ki ± 0% 10.023Ki ± 0% +4.56% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 34.24Ki ± 0% 35.18Ki ± 0% +2.74% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 2.062Ki ± 0% 2.500Ki ± 0% +21.21% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 4.859Ki ± 0% 5.797Ki ± 0% +19.29% (p=0.000 n=10)
geomean 20.22Ki 21.79Ki +7.78%
│ benchmark-batch-pointers.txt │ benchmark-batch-interface2.txt │
│ allocs/op │ allocs/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 1.013k ± 0% 1.013k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 3.023k ± 0% 3.023k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 113.0 ± 0% 113.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 323.0 ± 0% 323.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 14.00 ± 0% 14.00 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 26.00 ± 0% 26.00 ± 0% ~ (p=1.000 n=10) ¹
geomean 185.5 185.5 +0.00%
} | ||
if a.curr.ValueType == chunkenc.ValHistogram { | ||
h := (*histogram.Histogram)(a.curr.PointerValues[a.curr.Index]) | ||
return a.curr.Timestamps[a.curr.Index], h.ToFloat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToFloat()
returns a deep copy but is it required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to refer to @beorn7 on this one, because:
https://github.com/prometheus/prometheus/blob/5ec1b4baaf01dae0b22567db53886c83b74e0e76/promql/engine.go#L1779
https://github.com/prometheus/prometheus/blob/5ec1b4baaf01dae0b22567db53886c83b74e0e76/promql/engine.go#L1894
https://github.com/prometheus/prometheus/blob/5ec1b4baaf01dae0b22567db53886c83b74e0e76/promql/engine.go#L1924
I will note however that in this experimental phase, we aim for correctness, not optimizations for histograms, so I'd defer this to later. Open an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've double checked and our integration tests fail (on sparsehistogram branch) if I remove this automatic upconvert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the histogram was an integer histogram, and if the caller asks for ToFloat()
, then we do need to return the converted float histogram. The conversion ends up requiring deep copy because it's a different struct and type of slice gets changed (int64 to float). The only thing common are the histogram spans, which are small compared to the bucket slices in general so we can just copy away those spans for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context to Marco: The difference between integer and float histogram is that integer histogram stores buckets with delta encoding, while float histograms have absolute float values for buckets, which is required by the query engine to work on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering, @codesome .
Note that, in Prometheus, by calling it.AtFloatHistogram()
, we avoid the additional copy because the implementation of the AtFloatHistogram
method will create a FloatHistogram
directly from the encoded form of an integer histogram (rather than first materializing an integer Histogram
and then calling ToFloat
on it).
Relevant code in https://github.com/prometheus/prometheus/blob/main/tsdb/chunkenc/histogram.go , in particular https://github.com/prometheus/prometheus/blob/d33eb3ab17616a54b97d9f7791c791a79823f279/tsdb/chunkenc/histogram.go#L684-L701
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job Krajo! I haven't found any issue. I left few minor comments. My only real comment is about the usage of panic()
which I discourage.
Ref: #4286 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Ref: #4286 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
} | ||
return chunkenc.ValNone | ||
} | ||
|
||
// At implements chunkenc.Iterator. | ||
func (a *iteratorAdapter) At() (int64, float64) { | ||
if a.curr.ValueType != chunkenc.ValFloat { | ||
panic(fmt.Sprintf("Cannot read float from batch %v", a.curr.ValueType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codesome and me think panic() in iterator is justified as it would indicate coding error (user not using switch to select function) and not data error, these should be caught at testing. Adding error
return value is costly. Changing to returning 0 value would violate our rule to either return correct results or none at all.
batch.PointerValues[j] = unsafe.Pointer(fh) | ||
} | ||
default: | ||
panic(fmt.Sprintf("invalid chunk encoding %v", valueType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a return error instead here , marginal impact on performance:
cpu: AMD Ryzen 7 4700U with Radeon Graphics
│ benchmark-batch-pointers.txt │ benchmark-batch-error.txt │
│ sec/op │ sec/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 4.731m ± 1% 4.856m ± 1% +2.63% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 14.35m ± 1% 14.47m ± 2% ~ (p=0.165 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 484.3µ ± 1% 500.6µ ± 1% +3.35% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 1.455m ± 1% 1.471m ± 1% +1.16% (p=0.023 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 12.24µ ± 3% 12.30µ ± 3% ~ (p=0.592 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 32.18µ ± 2% 32.74µ ± 2% +1.72% (p=0.015 n=10)
geomean 515.8µ 524.6µ +1.70%
│ benchmark-batch-pointers.txt │ benchmark-batch-error.txt │
│ B/op │ B/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 73.43Ki ± 0% 73.46Ki ± 0% +0.04% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 282.9Ki ± 0% 283.0Ki ± 0% +0.03% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 9.586Ki ± 0% 9.617Ki ± 0% +0.33% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 34.24Ki ± 0% 34.34Ki ± 0% +0.27% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 2.062Ki ± 0% 2.094Ki ± 0% +1.52% (p=0.000 n=10)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 4.859Ki ± 0% 4.953Ki ± 0% +1.93% (p=0.000 n=10)
geomean 20.22Ki 20.36Ki +0.68%
│ benchmark-batch-pointers.txt │ benchmark-batch-error.txt │
│ allocs/op │ allocs/op vs base │
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1-8 1.013k ± 0% 1.013k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3-8 3.023k ± 0% 3.023k ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1-8 113.0 ± 0% 113.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3-8 323.0 ± 0% 323.0 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1-8 14.00 ± 0% 14.00 ± 0% ~ (p=1.000 n=10) ¹
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3-8 26.00 ± 0% 26.00 ± 0% ~ (p=1.000 n=10) ¹
geomean 185.5 185.5 +0.00%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, we're testing if we've covered all enumeration values. Which should be done by a linter I think.
break | ||
} | ||
if vt != valueType { | ||
panic(fmt.Sprintf("chunk encoding expected to be consistent in chunk start %v now %v", valueType, vt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at Prometheus chunk iterator implementations: https://github.com/prometheus/prometheus/tree/main/tsdb/chunkenc . Those either return None or a hardcoded chunk specific value, so this should really never happen.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just left a final question.
pkg/querier/batch/stream.go
Outdated
b.Timestamps[b.Index], b.Values[b.Index] = bs.at() | ||
b.Index++ | ||
for t := bs.hasNext(); t != chunkenc.ValNone; t = bs.hasNext() { | ||
populate(bs, t) | ||
b.Length++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to increase b.Length
here? It should be unnecessary because (a) what we want to increase is b.Index
but it's already done by populate()
and (b) b.Length
is overridden later by
// The Index is the place at which new sample
// has to be appended, hence it tells the length.
b.Length = b.Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all checks pass if I remove it. not sure why it was there in the first place
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
What this PR does
The batch merge iterator is default interator used to let the querier merge the chunks received from store-gateways.
The chunk.Batch and stream merge code is optimized to the extreme. Adding histograms into the Batch with any additional indirection destroys the performance, hence the solution proposed in the PR which has 0 additional allocation when the code deals with float series.
Which issue(s) this PR fixes or relates to
Related to #3478
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]