Add alternate implementation of index-header reader that does not use mmap #3639

charleskorn · 2022-12-04T22:56:50Z

What this PR does

This PR adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in #3465.

cc: @56quarters, @stevesg

Which issue(s) this PR fixes or relates to

#3465

Notes to reviewers

After review is finished, DO NOT MERGE as-is. @charleskorn and @56quarters will handle fixing up the squashed message.

The most important parts of this PR to examine:

When the experimental streaming index-header reader is disabled, everything should work exactly as it does today.
There should be nothing fundamentally wrong with this architecture that will prevent us from switching over to it entirely in the future.

We expect the performance of this implementation to not be equivalent to the existing mmap implementation at first. The goal of this PR is not to merge a perfect replacement that we'll immediately be able to switch to in production. Instead, we'd like to get things into main as they are so that we can iterate on performance, making small changes, based on further testing.

Changes

There's a lot of commits here and it doesn't make sense to review them individually. However, there's also a lot of code here. The organization of our changes is described below. Some knowledge of the current architecture of the store-gateway is assumed.

(note that all paths are relative to pkg/storegateway)

Encoding components

indexheader/encoding/reader.go/FileReader: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking.
indexheader/encoding/encoding.go/Decbuf: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using a FileReader instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway.
indexheader/encoding/factory.go/DecbufFactory: this code is responsible for creating new Decbuf instances in a variety of ways. It is also responsible for cleaning up after Decbuf instances are no longer needed. This is because it pools the underlying bufio.Readers used.

TSDB index components

indexheader/index/symbols.go/Symbols: this is a copy/modification of the equivalent Prometheus file. It includes a modified Symbol reader that works with our DecbufFactory (which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by the StreamBinaryReader
indexheader/index/positings.go/PostingOffsetTable: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existing BinaryReader used a bunch of if statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existing BinaryReader.

New index-header `Reader`

indexheader/stream_binary_reader.go/StreamBinaryReader: the alternate implementation of the Reader interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated.
indexheader/reader_pool.go/ReaderPool: there are some limited changes here to allow the StreamBinaryReader to be used when index-header Readers are lazy loaded or eagerly loaded.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

charleskorn · 2022-12-07T06:10:04Z

The CI build for this PR has been stuck in "Expected - waiting for status to be reported" for two days now. However, there's no CI build queued according to GitHub Actions.

The internet suggests closing and reopening the PR might get things going again, so I'm going to try that...

colega

Submitting a (very) partial review

pkg/storegateway/indexheader/encoding/reader.go

pkg/storegateway/indexheader/index/postings.go

pracucci

I reviewed all the logic except the tests. You did an impressive work! 👏 The code is well documented, and you kept the Prometheus code as much as possible so doing a side-by-side diff of the logic has been trivial. Very good job!

I haven't found any issue (in a couple of cases actually depend on the answer to some questions). I left few comments and questions. Thanks!

CHANGELOG.md

pkg/storegateway/indexheader/stream_binary_reader.go

pkg/storegateway/indexheader/encoding/encoding.go

pkg/storegateway/indexheader/index/postings.go

pracucci · 2022-12-07T18:19:48Z

pkg/storegateway/indexheader/encoding/factory.go

+	"github.com/pkg/errors"
+)
+
+const bufferSize = 4096


This looks very small. Thoughts on this?

I don't have any justification for this number. Since they're being pooled, I suppose it doesn't hurt to make them larger. My only concern would be if increasing the size of this buffer means os.File.Read() reads more data every time it's invoked.

These are the questions we should answer running a benchmark on a real TSDB block.

I would add a comment saying that this number was chosen without any justification, so future readers or maintainers would not hesitate to tune it if benchmarks show that it can be improved. (Otherwise anyone reading this might think "oh, there probably was a good reason to have this number")

pracucci · 2022-12-07T18:30:50Z

pkg/storegateway/indexheader/encoding/factory.go

+// by the contents and the expected checksum. This method checks the CRC of the content and will
+// return an error Decbuf if it does not match the expected CRC.
+func (df *DecbufFactory) NewDecbufAtChecked(offset int, table *crc32.Table) Decbuf {
+	f, err := os.Open(df.path)


[non blocking] My biggest concern is that we open a file description each time. It's called in called in hot path, like Symbols.Lookup() and Symbols.ReverseLookup(). Not a blocker for this PR, but I would like to hear your thoughts on this.

I've been worrying about this but after deploying a test image in a dev cluster, opening the file doesn't seem to account for much CPU time (according to phlare). The zone with this index-header reader enabled seems to be about as fast as the existing mmap implementation as far as I can tell.

If it turned out to be an issue, I was thinking about maintaining a (small) pool of file handles for each index-header.

Actually, maybe it is significant. Looking more closely at the profile: Symbols.Lookup() is about 0.27% of total time and os.OpenFile called by Symbols.Lookup() is 0.12% of total time (based on the load test that we've been running). So I guess 1/2 the time spent by the symbol reader is opening the file even if it's not a large part of time spent overall. I'll add pooling of file handles to the list of future performance improvements.

I'll add pooling of file handles to the list of future performance improvements.

👍

pracucci · 2022-12-07T18:39:37Z

pkg/storegateway/indexheader/encoding/encoding.go

+		return []byte{}
+	}
+
+	return b


Is it safe to directly return b? Could the underlying buffer be reused?

It's safe to return and use until the next read of the underlying FileReader. After that, the bytes are invalidated. FileReader.Skip() (via bufio.Reader.Discard()) is guaranteed not to read from disk as long as n is less than the amount of data buffered. Because of the Peek(n) call above, we know that at least n bytes are buffered.

I was wondering about this today and was in the process of writing a unit test to verify. I'll push that test along with other code review changes.

Fixed an instance where we were doing this efea391

All other uses of UvarintBytes are in order to skip bytes or immediately yoloStringd and compared with another string (such as the binary search Symbols does).

It's safe to return and use until the next read of the underlying FileReader

I would add a comment to this function to explain that.

The docs for this method mention it but I'll make it more prominent.

pkg/storegateway/indexheader/encoding/reader.go

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

… initial position in ResetAt.

fadvise is not available on other platforms (including Darwin). fadvise is only used in benchmarks, so it's safe to ignore it on other platforms (on the assumption that we'll only use results from Linux hosts for the affected benchmarks).

This uncovered a small issue in Uvarint64(): it was not able to decode values that encode to more than 8 bytes.

🤦

These didn't uncover any issues, so I'm not sure if there's value in keeping them or not.

… between the two.

…capture errors that occur during cleanup. This is inspired by dskit's runutil.CloseWithErrCapture().

… place.

This improves the performance of LabelNames() by 10-17%. name old time/op new time/op delta LabelNames/20Names100Values-10 1.71µs ± 1% 1.50µs ± 1% -11.93% (p=0.008 n=5+5) LabelNames/20Names500Values-10 1.70µs ± 2% 1.50µs ± 2% -12.22% (p=0.008 n=5+5) LabelNames/20Names1000Values-10 1.69µs ± 1% 1.48µs ± 2% -12.55% (p=0.008 n=5+5) LabelNames/50Names100Values-10 4.42µs ± 1% 3.77µs ± 1% -14.60% (p=0.008 n=5+5) LabelNames/50Names500Values-10 4.55µs ± 3% 3.75µs ± 0% -17.50% (p=0.008 n=5+5) LabelNames/50Names1000Values-10 4.50µs ± 0% 3.72µs ± 2% -17.18% (p=0.008 n=5+5) LabelNames/100Names100Values-10 9.77µs ± 1% 8.22µs ± 1% -15.93% (p=0.008 n=5+5) LabelNames/100Names500Values-10 9.72µs ± 1% 8.19µs ± 1% -15.69% (p=0.008 n=5+5) LabelNames/100Names1000Values-10 9.72µs ± 1% 8.35µs ± 2% -14.11% (p=0.008 n=5+5) LabelNames/200Names100Values-10 21.0µs ± 0% 17.6µs ± 1% -16.16% (p=0.008 n=5+5) LabelNames/200Names500Values-10 21.0µs ± 0% 17.7µs ± 2% -15.91% (p=0.008 n=5+5) LabelNames/200Names1000Values-10 21.0µs ± 0% 17.6µs ± 1% -16.00% (p=0.008 n=5+5)

…Reader.

…rwards

pracucci

Amazing job! I'm approving it. The logic looks very solid to me and, in the spirit outlined in the PR description, I would merge it by EOW so that code will be rolled out (disabled) next week to dev. However, I would wait for @colega review too and block the merge if he has any concern.

colega

Logic looks solid, I think we can merge this.

I've left some nitpicks, mostly style-related and also comments to revisit with performance-improvements.

There's a bunch of hard to follow code that was brought from Prometheus where it was already hard to follow, so I'm not commenting on that.

pkg/storegateway/indexheader/encoding/encoding.go

pkg/storegateway/indexheader/index/symbols.go

colega · 2022-12-09T10:20:32Z

pkg/storegateway/indexheader/index/symbols.go

+func (s *Symbols) reverseLookup(sym string, d stream_encoding.Decbuf) (uint32, error) {
+	i := sort.Search(len(s.offsets), func(i int) bool {
+		d.ResetAt(s.offsets[i])
+		return yoloString(d.UvarintBytes()) > sym


There's no need to YOLO just for a comparison: golang/go@69cd91a

Suggested change

return yoloString(d.UvarintBytes()) > sym

return string(d.UvarintBytes()) > sym

Although that test only checks equality, I tested with comparison and it also doesn't allocate:

func TestCompareTempString(t *testing.T) { s := "foo" b := []byte(s) n := testing.AllocsPerRun(1000, func() { if string(b) > s { t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s) } if string(b) < s { t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s) } }) if n != 0 { t.Fatalf("want 0 allocs, got %v", n) } }

Nice. I'll make the change to use string everywhere we're just doing a comparison.

colega · 2022-12-09T10:25:26Z

pkg/storegateway/indexheader/stream_binary_reader.go

+	stream_encoding "github.com/grafana/mimir/pkg/storegateway/indexheader/encoding"
+	stream_index "github.com/grafana/mimir/pkg/storegateway/indexheader/index"


Same as above, this package aliases should be streamencoding and streamindex.

colega · 2022-12-09T10:27:10Z

pkg/storegateway/indexheader/stream_binary_reader.go

+	if r.version != BinaryFormatV1 {
+		return nil, fmt.Errorf("unknown index-header file version %d", r.version)
+	}


Nit, but this should be checked right after reading the version. We can't assume that in future versions the index version would be the next thing to read.

pkg/storegateway/indexheader/stream_binary_reader.go

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

#3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly.

* Use slices.Sort instead of sort.Strings or sort.Ints. #3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly. * Remove unnecessary comment. * Rename benchmarks file. * Add benchmark for BinaryReader.LabelNames(). * Add benchmark for LabelValues(). * Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.

… mmap (grafana#3639) ## What this PR does This adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in grafana#3465. ## Changes The organization of the changes is described below. Some knowledge of the current architecture of the store-gateway is assumed. All paths are relative to `pkg/storegateway`. ### Encoding components * `indexheader/encoding/reader.go/FileReader`: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking. * `indexheader/encoding/encoding.go/Decbuf`: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using a `FileReader` instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway. * `indexheader/encoding/factory.go/DecbufFactory`: this code is responsible for creating new `Decbuf` instances in a variety of ways. It is also responsible for cleaning up after `Decbuf` instances are no longer needed. This is because it pools the underlying `bufio.Reader`s used. ### TSDB index components * `indexheader/index/symbols.go/Symbols`: this is a copy/modification of the equivalent Prometheus file. It includes a modified `Symbol` reader that works with our `DecbufFactory` (which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by the `StreamBinaryReader` * `indexheader/index/positings.go/PostingOffsetTable`: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existing `BinaryReader` used a bunch of `if` statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existing `BinaryReader`. ### New index-header `Reader` * `indexheader/stream_binary_reader.go/StreamBinaryReader`: the alternate implementation of the `Reader` interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated. * `indexheader/reader_pool.go/ReaderPool`: there are some limited changes here to allow the `StreamBinaryReader` to be used when index-header `Reader`s are lazy loaded or eagerly loaded. Co-authored-by: Steve Simpson <steve.simpson@grafana.com> Co-authored-by: Charles Korn <charles.korn@grafana.com> Co-authored-by: Nick Pillitteri <nick.pillitteri@grafana.com>

…fana#3690) * Use slices.Sort instead of sort.Strings or sort.Ints. grafana#3639 (comment), prometheus/prometheus#11054, and prometheus/prometheus#11318 all show substantial performance improvements when using slices.Sort over sort.Strings. I've also added a linting rule to catch usage of sort.Strings and sort.Ints in the future. (We don't currently use sort.Ints anywhere.) We could also replace sort.Float64s with slices.Sort, however, this requires more careful consideration as the documentation for slices.Sort warns that it may not handle NaN values correctly. * Remove unnecessary comment. * Rename benchmarks file. * Add benchmark for BinaryReader.LabelNames(). * Add benchmark for LabelValues(). * Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.

charleskorn mentioned this pull request Dec 4, 2022

Stop using mmap in the store-gateway #3465

Closed

19 tasks

charleskorn closed this Dec 7, 2022

charleskorn reopened this Dec 7, 2022

charleskorn force-pushed the 56quarters/indexheader-mmap-removal branch from 63ed836 to 1a25bbd Compare December 7, 2022 06:17

56quarters force-pushed the 56quarters/indexheader-mmap-removal branch from 1a25bbd to b214245 Compare December 7, 2022 14:10

56quarters marked this pull request as ready for review December 7, 2022 14:47

56quarters requested review from a team as code owners December 7, 2022 14:47

colega reviewed Dec 7, 2022

View reviewed changes

pracucci self-requested a review December 7, 2022 16:33

pracucci reviewed Dec 7, 2022

View reviewed changes

stevesg and others added 18 commits December 8, 2022 16:12

WIP

56f388b

Skip tests meant to test mmap hangs

eb4b714

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Fix errant import from rebase

9092d9b

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Move free functions to streaming binary reader

63cdcda

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Remove PopulateBinaryReader Reader implementation

121334d

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Remove ReadBinaryReader Reader implementation

110376e

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Rough tests for StreamBinaryReader + notes about Symbols bug

82aa9fc

Use constants rather than magic values in Reset / ResetAt and respect…

ff2273d

… initial position in ResetAt.

Only use fadvise on Linux.

c5b9342

fadvise is not available on other platforms (including Darwin). fadvise is only used in benchmarks, so it's safe to ignore it on other platforms (on the assumption that we'll only use results from Linux hosts for the affected benchmarks).

Add tests for streaming Decbuf implementation.

3ef6dae

This uncovered a small issue in Uvarint64(): it was not able to decode values that encode to more than 8 bytes.

Call assertions with expected value first.

000e68e

🤦

Add fuzz tests for Decbuf.

eccae10

These didn't uncover any issues, so I'm not sure if there's value in keeping them or not.

Clarify variable names in test.

c8aea0b

Fix issue where symbols are read incorrectly from v1 index files.

9901fb8

Rename test methods to clarify their purpose.

57d8801

Move Reader and implementations to a separate file.

cba6034

Add tests for FileReader and BufReader, and make behaviour consistent…

9b16b87

… between the two.

Fix test broken in ce19084.

62cb8da

charleskorn added 12 commits December 8, 2022 16:14

Update comment to address PR feedback.

74bfb13

Introduce DecbufFactory.CloseWithErrCapture and use it everywhere to …

c716f3e

…capture errors that occur during cleanup. This is inspired by dskit's runutil.CloseWithErrCapture().

Don't bother trimming posting offsets if there's nothing to trim.

ed3fa17

Don't bother exporting LabelNameCount method that is only used in one…

a68fe25

… place.

Simplify logic in FileReader.ResetAt().

2054dbb

Add benchmark for LabelNames.

e064787

Update go.mod to reflect direct use of golang.org/x/exp.

7ac5979

Remove outdated comment.

b1e58e7

Clarify documentation for LabelNames.

26437b1

Add benchmark for loading an index-header from disk with StreamBinary…

b710f2f

…Reader.

Add explanation about offset calculation.

66e9a92

charleskorn force-pushed the 56quarters/indexheader-mmap-removal branch from b219c44 to 66e9a92 Compare December 8, 2022 05:18

aknuds1 added the component/store-gateway label Dec 8, 2022

Use UvarintStr instead of bytes since Decbuf is read immediately afte…

efea391

…rwards

pracucci approved these changes Dec 9, 2022

View reviewed changes

colega approved these changes Dec 9, 2022

View reviewed changes

Code review changes

5e880de

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters merged commit 0ae7fe7 into main Dec 9, 2022

56quarters deleted the 56quarters/indexheader-mmap-removal branch December 9, 2022 16:51

charleskorn mentioned this pull request Dec 12, 2022

Use slices.Sort() instead of sort.Strings() or sort.Ints() #3690

Merged

56quarters restored the 56quarters/indexheader-mmap-removal branch December 13, 2022 15:34

56quarters deleted the 56quarters/indexheader-mmap-removal branch December 13, 2022 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternate implementation of index-header reader that does not use mmap #3639

Add alternate implementation of index-header reader that does not use mmap #3639

charleskorn commented Dec 4, 2022 •

edited by 56quarters

Loading

charleskorn commented Dec 7, 2022

colega left a comment

pracucci left a comment

pracucci Dec 7, 2022

56quarters Dec 7, 2022

pracucci Dec 9, 2022

colega Dec 9, 2022 •

edited

Loading

pracucci Dec 7, 2022

56quarters Dec 7, 2022

56quarters Dec 7, 2022 •

edited

Loading

pracucci Dec 9, 2022

pracucci Dec 7, 2022

56quarters Dec 7, 2022

56quarters Dec 8, 2022

pracucci Dec 9, 2022

56quarters Dec 9, 2022

pracucci left a comment

colega left a comment

colega Dec 9, 2022

56quarters Dec 9, 2022 •

edited

Loading

colega Dec 9, 2022

colega Dec 9, 2022

	return yoloString(d.UvarintBytes()) > sym
	return string(d.UvarintBytes()) > sym

		stream_encoding "github.com/grafana/mimir/pkg/storegateway/indexheader/encoding"
		stream_index "github.com/grafana/mimir/pkg/storegateway/indexheader/index"

Add alternate implementation of index-header reader that does not use mmap #3639

Add alternate implementation of index-header reader that does not use mmap #3639

Conversation

charleskorn commented Dec 4, 2022 • edited by 56quarters Loading

What this PR does

Which issue(s) this PR fixes or relates to

Notes to reviewers

Changes

Encoding components

TSDB index components

New index-header Reader

Checklist

charleskorn commented Dec 7, 2022

colega left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colega Dec 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

56quarters Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

colega left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

56quarters Dec 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charleskorn commented Dec 4, 2022 •

edited by 56quarters

Loading

New index-header `Reader`

colega Dec 9, 2022 •

edited

Loading

56quarters Dec 7, 2022 •

edited

Loading

56quarters Dec 9, 2022 •

edited

Loading