-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alternate implementation of index-header reader that does not use mmap #3639
Conversation
The CI build for this PR has been stuck in "Expected - waiting for status to be reported" for two days now. However, there's no CI build queued according to GitHub Actions. The internet suggests closing and reopening the PR might get things going again, so I'm going to try that... |
63ed836
to
1a25bbd
Compare
1a25bbd
to
b214245
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting a (very) partial review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all the logic except the tests. You did an impressive work! 👏 The code is well documented, and you kept the Prometheus code as much as possible so doing a side-by-side diff of the logic has been trivial. Very good job!
I haven't found any issue (in a couple of cases actually depend on the answer to some questions). I left few comments and questions. Thanks!
"github.com/pkg/errors" | ||
) | ||
|
||
const bufferSize = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very small. Thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any justification for this number. Since they're being pooled, I suppose it doesn't hurt to make them larger. My only concern would be if increasing the size of this buffer means os.File.Read()
reads more data every time it's invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the questions we should answer running a benchmark on a real TSDB block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment saying that this number was chosen without any justification, so future readers or maintainers would not hesitate to tune it if benchmarks show that it can be improved. (Otherwise anyone reading this might think "oh, there probably was a good reason to have this number")
// by the contents and the expected checksum. This method checks the CRC of the content and will | ||
// return an error Decbuf if it does not match the expected CRC. | ||
func (df *DecbufFactory) NewDecbufAtChecked(offset int, table *crc32.Table) Decbuf { | ||
f, err := os.Open(df.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking] My biggest concern is that we open a file description each time. It's called in called in hot path, like Symbols.Lookup()
and Symbols.ReverseLookup()
. Not a blocker for this PR, but I would like to hear your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been worrying about this but after deploying a test image in a dev cluster, opening the file doesn't seem to account for much CPU time (according to phlare). The zone with this index-header reader enabled seems to be about as fast as the existing mmap implementation as far as I can tell.
If it turned out to be an issue, I was thinking about maintaining a (small) pool of file handles for each index-header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe it is significant. Looking more closely at the profile: Symbols.Lookup()
is about 0.27% of total time and os.OpenFile
called by Symbols.Lookup()
is 0.12% of total time (based on the load test that we've been running). So I guess 1/2 the time spent by the symbol reader is opening the file even if it's not a large part of time spent overall. I'll add pooling of file handles to the list of future performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add pooling of file handles to the list of future performance improvements.
👍
return []byte{} | ||
} | ||
|
||
return b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to directly return b
? Could the underlying buffer be reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe to return and use until the next read of the underlying FileReader
. After that, the bytes are invalidated. FileReader.Skip()
(via bufio.Reader.Discard()
) is guaranteed not to read from disk as long as n
is less than the amount of data buffered. Because of the Peek(n)
call above, we know that at least n
bytes are buffered.
I was wondering about this today and was in the process of writing a unit test to verify. I'll push that test along with other code review changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed an instance where we were doing this efea391
All other uses of UvarintBytes
are in order to skip bytes or immediately yoloString
d and compared with another string (such as the binary search Symbols
does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe to return and use until the next read of the underlying FileReader
I would add a comment to this function to explain that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for this method mention it but I'll make it more prominent.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
… initial position in ResetAt.
fadvise is not available on other platforms (including Darwin). fadvise is only used in benchmarks, so it's safe to ignore it on other platforms (on the assumption that we'll only use results from Linux hosts for the affected benchmarks).
This uncovered a small issue in Uvarint64(): it was not able to decode values that encode to more than 8 bytes.
These didn't uncover any issues, so I'm not sure if there's value in keeping them or not.
… between the two.
…capture errors that occur during cleanup. This is inspired by dskit's runutil.CloseWithErrCapture().
This improves the performance of LabelNames() by 10-17%. name old time/op new time/op delta LabelNames/20Names100Values-10 1.71µs ± 1% 1.50µs ± 1% -11.93% (p=0.008 n=5+5) LabelNames/20Names500Values-10 1.70µs ± 2% 1.50µs ± 2% -12.22% (p=0.008 n=5+5) LabelNames/20Names1000Values-10 1.69µs ± 1% 1.48µs ± 2% -12.55% (p=0.008 n=5+5) LabelNames/50Names100Values-10 4.42µs ± 1% 3.77µs ± 1% -14.60% (p=0.008 n=5+5) LabelNames/50Names500Values-10 4.55µs ± 3% 3.75µs ± 0% -17.50% (p=0.008 n=5+5) LabelNames/50Names1000Values-10 4.50µs ± 0% 3.72µs ± 2% -17.18% (p=0.008 n=5+5) LabelNames/100Names100Values-10 9.77µs ± 1% 8.22µs ± 1% -15.93% (p=0.008 n=5+5) LabelNames/100Names500Values-10 9.72µs ± 1% 8.19µs ± 1% -15.69% (p=0.008 n=5+5) LabelNames/100Names1000Values-10 9.72µs ± 1% 8.35µs ± 2% -14.11% (p=0.008 n=5+5) LabelNames/200Names100Values-10 21.0µs ± 0% 17.6µs ± 1% -16.16% (p=0.008 n=5+5) LabelNames/200Names500Values-10 21.0µs ± 0% 17.7µs ± 2% -15.91% (p=0.008 n=5+5) LabelNames/200Names1000Values-10 21.0µs ± 0% 17.6µs ± 1% -16.00% (p=0.008 n=5+5)
b219c44
to
66e9a92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing job! I'm approving it. The logic looks very solid to me and, in the spirit outlined in the PR description, I would merge it by EOW so that code will be rolled out (disabled) next week to dev. However, I would wait for @colega review too and block the merge if he has any concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks solid, I think we can merge this.
I've left some nitpicks, mostly style-related and also comments to revisit with performance-improvements.
There's a bunch of hard to follow code that was brought from Prometheus where it was already hard to follow, so I'm not commenting on that.
func (s *Symbols) reverseLookup(sym string, d stream_encoding.Decbuf) (uint32, error) { | ||
i := sort.Search(len(s.offsets), func(i int) bool { | ||
d.ResetAt(s.offsets[i]) | ||
return yoloString(d.UvarintBytes()) > sym |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to YOLO just for a comparison: golang/go@69cd91a
return yoloString(d.UvarintBytes()) > sym | |
return string(d.UvarintBytes()) > sym |
Although that test only checks equality, I tested with comparison and it also doesn't allocate:
func TestCompareTempString(t *testing.T) {
s := "foo"
b := []byte(s)
n := testing.AllocsPerRun(1000, func() {
if string(b) > s {
t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s)
}
if string(b) < s {
t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s)
}
})
if n != 0 {
t.Fatalf("want 0 allocs, got %v", n)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I'll make the change to use string
everywhere we're just doing a comparison.
stream_encoding "github.com/grafana/mimir/pkg/storegateway/indexheader/encoding" | ||
stream_index "github.com/grafana/mimir/pkg/storegateway/indexheader/index" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, this package aliases should be streamencoding
and streamindex
.
if r.version != BinaryFormatV1 { | ||
return nil, fmt.Errorf("unknown index-header file version %d", r.version) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but this should be checked right after reading the version. We can't assume that in future versions the index version would be the next thing to read.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
#3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly.
#3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly.
* Use slices.Sort instead of sort.Strings or sort.Ints. #3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly. * Remove unnecessary comment. * Rename benchmarks file. * Add benchmark for BinaryReader.LabelNames(). * Add benchmark for LabelValues(). * Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.
… mmap (grafana#3639) ## What this PR does This adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in grafana#3465. ## Changes The organization of the changes is described below. Some knowledge of the current architecture of the store-gateway is assumed. All paths are relative to `pkg/storegateway`. ### Encoding components * `indexheader/encoding/reader.go/FileReader`: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking. * `indexheader/encoding/encoding.go/Decbuf`: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using a `FileReader` instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway. * `indexheader/encoding/factory.go/DecbufFactory`: this code is responsible for creating new `Decbuf` instances in a variety of ways. It is also responsible for cleaning up after `Decbuf` instances are no longer needed. This is because it pools the underlying `bufio.Reader`s used. ### TSDB index components * `indexheader/index/symbols.go/Symbols`: this is a copy/modification of the equivalent Prometheus file. It includes a modified `Symbol` reader that works with our `DecbufFactory` (which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by the `StreamBinaryReader` * `indexheader/index/positings.go/PostingOffsetTable`: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existing `BinaryReader` used a bunch of `if` statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existing `BinaryReader`. ### New index-header `Reader` * `indexheader/stream_binary_reader.go/StreamBinaryReader`: the alternate implementation of the `Reader` interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated. * `indexheader/reader_pool.go/ReaderPool`: there are some limited changes here to allow the `StreamBinaryReader` to be used when index-header `Reader`s are lazy loaded or eagerly loaded. Co-authored-by: Steve Simpson <steve.simpson@grafana.com> Co-authored-by: Charles Korn <charles.korn@grafana.com> Co-authored-by: Nick Pillitteri <nick.pillitteri@grafana.com>
…fana#3690) * Use slices.Sort instead of sort.Strings or sort.Ints. grafana#3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly. * Remove unnecessary comment. * Rename benchmarks file. * Add benchmark for BinaryReader.LabelNames(). * Add benchmark for LabelValues(). * Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.
What this PR does
This PR adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in #3465.
cc: @56quarters, @stevesg
Which issue(s) this PR fixes or relates to
#3465
Notes to reviewers
After review is finished, DO NOT MERGE as-is. @charleskorn and @56quarters will handle fixing up the squashed message.
The most important parts of this PR to examine:
We expect the performance of this implementation to not be equivalent to the existing mmap implementation at first. The goal of this PR is not to merge a perfect replacement that we'll immediately be able to switch to in production. Instead, we'd like to get things into main as they are so that we can iterate on performance, making small changes, based on further testing.
Changes
There's a lot of commits here and it doesn't make sense to review them individually. However, there's also a lot of code here. The organization of our changes is described below. Some knowledge of the current architecture of the store-gateway is assumed.
(note that all paths are relative to
pkg/storegateway
)Encoding components
indexheader/encoding/reader.go/FileReader
: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking.indexheader/encoding/encoding.go/Decbuf
: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using aFileReader
instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway.indexheader/encoding/factory.go/DecbufFactory
: this code is responsible for creating newDecbuf
instances in a variety of ways. It is also responsible for cleaning up afterDecbuf
instances are no longer needed. This is because it pools the underlyingbufio.Reader
s used.TSDB index components
indexheader/index/symbols.go/Symbols
: this is a copy/modification of the equivalent Prometheus file. It includes a modifiedSymbol
reader that works with ourDecbufFactory
(which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by theStreamBinaryReader
indexheader/index/positings.go/PostingOffsetTable
: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existingBinaryReader
used a bunch ofif
statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existingBinaryReader
.New index-header
Reader
indexheader/stream_binary_reader.go/StreamBinaryReader
: the alternate implementation of theReader
interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated.indexheader/reader_pool.go/ReaderPool
: there are some limited changes here to allow theStreamBinaryReader
to be used when index-headerReader
s are lazy loaded or eagerly loaded.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]