Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternate implementation of index-header reader that does not use mmap #3639

Merged
merged 127 commits into from
Dec 9, 2022

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Dec 4, 2022

What this PR does

This PR adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in #3465.

cc: @56quarters, @stevesg

Which issue(s) this PR fixes or relates to

#3465

Notes to reviewers

After review is finished, DO NOT MERGE as-is. @charleskorn and @56quarters will handle fixing up the squashed message.

The most important parts of this PR to examine:

  • When the experimental streaming index-header reader is disabled, everything should work exactly as it does today.
  • There should be nothing fundamentally wrong with this architecture that will prevent us from switching over to it entirely in the future.

We expect the performance of this implementation to not be equivalent to the existing mmap implementation at first. The goal of this PR is not to merge a perfect replacement that we'll immediately be able to switch to in production. Instead, we'd like to get things into main as they are so that we can iterate on performance, making small changes, based on further testing.

Changes

There's a lot of commits here and it doesn't make sense to review them individually. However, there's also a lot of code here. The organization of our changes is described below. Some knowledge of the current architecture of the store-gateway is assumed.

(note that all paths are relative to pkg/storegateway)

Encoding components

  • indexheader/encoding/reader.go/FileReader: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking.
  • indexheader/encoding/encoding.go/Decbuf: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using a FileReader instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway.
  • indexheader/encoding/factory.go/DecbufFactory: this code is responsible for creating new Decbuf instances in a variety of ways. It is also responsible for cleaning up after Decbuf instances are no longer needed. This is because it pools the underlying bufio.Readers used.

TSDB index components

  • indexheader/index/symbols.go/Symbols: this is a copy/modification of the equivalent Prometheus file. It includes a modified Symbol reader that works with our DecbufFactory (which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by the StreamBinaryReader
  • indexheader/index/positings.go/PostingOffsetTable: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existing BinaryReader used a bunch of if statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existing BinaryReader.

New index-header Reader

  • indexheader/stream_binary_reader.go/StreamBinaryReader: the alternate implementation of the Reader interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated.
  • indexheader/reader_pool.go/ReaderPool: there are some limited changes here to allow the StreamBinaryReader to be used when index-header Readers are lazy loaded or eagerly loaded.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@charleskorn
Copy link
Contributor Author

The CI build for this PR has been stuck in "Expected - waiting for status to be reported" for two days now. However, there's no CI build queued according to GitHub Actions.

The internet suggests closing and reopening the PR might get things going again, so I'm going to try that...

@charleskorn charleskorn closed this Dec 7, 2022
@charleskorn charleskorn reopened this Dec 7, 2022
@charleskorn charleskorn force-pushed the 56quarters/indexheader-mmap-removal branch from 63ed836 to 1a25bbd Compare December 7, 2022 06:17
@56quarters 56quarters force-pushed the 56quarters/indexheader-mmap-removal branch from 1a25bbd to b214245 Compare December 7, 2022 14:10
@56quarters 56quarters marked this pull request as ready for review December 7, 2022 14:47
@56quarters 56quarters requested review from a team as code owners December 7, 2022 14:47
Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting a (very) partial review

pkg/storegateway/indexheader/encoding/reader.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
@pracucci pracucci self-requested a review December 7, 2022 16:33
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed all the logic except the tests. You did an impressive work! 👏 The code is well documented, and you kept the Prometheus code as much as possible so doing a side-by-side diff of the logic has been trivial. Very good job!

I haven't found any issue (in a couple of cases actually depend on the answer to some questions). I left few comments and questions. Thanks!

CHANGELOG.md Show resolved Hide resolved
pkg/storegateway/indexheader/stream_binary_reader.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/stream_binary_reader.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/postings.go Outdated Show resolved Hide resolved
"github.com/pkg/errors"
)

const bufferSize = 4096
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very small. Thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any justification for this number. Since they're being pooled, I suppose it doesn't hurt to make them larger. My only concern would be if increasing the size of this buffer means os.File.Read() reads more data every time it's invoked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the questions we should answer running a benchmark on a real TSDB block.

Copy link
Contributor

@colega colega Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment saying that this number was chosen without any justification, so future readers or maintainers would not hesitate to tune it if benchmarks show that it can be improved. (Otherwise anyone reading this might think "oh, there probably was a good reason to have this number")

// by the contents and the expected checksum. This method checks the CRC of the content and will
// return an error Decbuf if it does not match the expected CRC.
func (df *DecbufFactory) NewDecbufAtChecked(offset int, table *crc32.Table) Decbuf {
f, err := os.Open(df.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non blocking] My biggest concern is that we open a file description each time. It's called in called in hot path, like Symbols.Lookup() and Symbols.ReverseLookup(). Not a blocker for this PR, but I would like to hear your thoughts on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been worrying about this but after deploying a test image in a dev cluster, opening the file doesn't seem to account for much CPU time (according to phlare). The zone with this index-header reader enabled seems to be about as fast as the existing mmap implementation as far as I can tell.

If it turned out to be an issue, I was thinking about maintaining a (small) pool of file handles for each index-header.

Copy link
Contributor

@56quarters 56quarters Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it is significant. Looking more closely at the profile: Symbols.Lookup() is about 0.27% of total time and os.OpenFile called by Symbols.Lookup() is 0.12% of total time (based on the load test that we've been running). So I guess 1/2 the time spent by the symbol reader is opening the file even if it's not a large part of time spent overall. I'll add pooling of file handles to the list of future performance improvements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add pooling of file handles to the list of future performance improvements.

👍

return []byte{}
}

return b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to directly return b? Could the underlying buffer be reused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to return and use until the next read of the underlying FileReader. After that, the bytes are invalidated. FileReader.Skip() (via bufio.Reader.Discard()) is guaranteed not to read from disk as long as n is less than the amount of data buffered. Because of the Peek(n) call above, we know that at least n bytes are buffered.

I was wondering about this today and was in the process of writing a unit test to verify. I'll push that test along with other code review changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed an instance where we were doing this efea391

All other uses of UvarintBytes are in order to skip bytes or immediately yoloStringd and compared with another string (such as the binary search Symbols does).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to return and use until the next read of the underlying FileReader

I would add a comment to this function to explain that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for this method mention it but I'll make it more prominent.

pkg/storegateway/indexheader/encoding/reader.go Outdated Show resolved Hide resolved
stevesg and others added 18 commits December 8, 2022 16:12
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
fadvise is not available on other platforms (including Darwin).

fadvise is only used in benchmarks, so it's safe to ignore it on other
platforms (on the assumption that we'll only use results from Linux
hosts for the affected benchmarks).
This uncovered a small issue in Uvarint64(): it was not able to
decode values that encode to more than 8 bytes.
These didn't uncover any issues, so I'm not sure if there's value in
keeping them or not.
…capture errors that occur during cleanup.

This is inspired by dskit's runutil.CloseWithErrCapture().
This improves the performance of LabelNames() by 10-17%.

name                              old time/op  new time/op  delta
LabelNames/20Names100Values-10    1.71µs ± 1%  1.50µs ± 1%  -11.93%  (p=0.008 n=5+5)
LabelNames/20Names500Values-10    1.70µs ± 2%  1.50µs ± 2%  -12.22%  (p=0.008 n=5+5)
LabelNames/20Names1000Values-10   1.69µs ± 1%  1.48µs ± 2%  -12.55%  (p=0.008 n=5+5)
LabelNames/50Names100Values-10    4.42µs ± 1%  3.77µs ± 1%  -14.60%  (p=0.008 n=5+5)
LabelNames/50Names500Values-10    4.55µs ± 3%  3.75µs ± 0%  -17.50%  (p=0.008 n=5+5)
LabelNames/50Names1000Values-10   4.50µs ± 0%  3.72µs ± 2%  -17.18%  (p=0.008 n=5+5)
LabelNames/100Names100Values-10   9.77µs ± 1%  8.22µs ± 1%  -15.93%  (p=0.008 n=5+5)
LabelNames/100Names500Values-10   9.72µs ± 1%  8.19µs ± 1%  -15.69%  (p=0.008 n=5+5)
LabelNames/100Names1000Values-10  9.72µs ± 1%  8.35µs ± 2%  -14.11%  (p=0.008 n=5+5)
LabelNames/200Names100Values-10   21.0µs ± 0%  17.6µs ± 1%  -16.16%  (p=0.008 n=5+5)
LabelNames/200Names500Values-10   21.0µs ± 0%  17.7µs ± 2%  -15.91%  (p=0.008 n=5+5)
LabelNames/200Names1000Values-10  21.0µs ± 0%  17.6µs ± 1%  -16.00%  (p=0.008 n=5+5)
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job! I'm approving it. The logic looks very solid to me and, in the spirit outlined in the PR description, I would merge it by EOW so that code will be rolled out (disabled) next week to dev. However, I would wait for @colega review too and block the merge if he has any concern.

Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks solid, I think we can merge this.

I've left some nitpicks, mostly style-related and also comments to revisit with performance-improvements.

There's a bunch of hard to follow code that was brought from Prometheus where it was already hard to follow, so I'm not commenting on that.

pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/index/symbols.go Show resolved Hide resolved
func (s *Symbols) reverseLookup(sym string, d stream_encoding.Decbuf) (uint32, error) {
i := sort.Search(len(s.offsets), func(i int) bool {
d.ResetAt(s.offsets[i])
return yoloString(d.UvarintBytes()) > sym
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to YOLO just for a comparison: golang/go@69cd91a

Suggested change
return yoloString(d.UvarintBytes()) > sym
return string(d.UvarintBytes()) > sym

Although that test only checks equality, I tested with comparison and it also doesn't allocate:

func TestCompareTempString(t *testing.T) {
	s := "foo"
	b := []byte(s)
	n := testing.AllocsPerRun(1000, func() {
		if string(b) > s {
			t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s)
		}
		if string(b) < s {
			t.Fatalf("strings are not equal: '%v' and '%v'", string(b), s)
		}
	})
	if n != 0 {
		t.Fatalf("want 0 allocs, got %v", n)
	}
}

Copy link
Contributor

@56quarters 56quarters Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I'll make the change to use string everywhere we're just doing a comparison.

Comment on lines 22 to 23
stream_encoding "github.com/grafana/mimir/pkg/storegateway/indexheader/encoding"
stream_index "github.com/grafana/mimir/pkg/storegateway/indexheader/index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this package aliases should be streamencoding and streamindex.

Comment on lines +95 to +97
if r.version != BinaryFormatV1 {
return nil, fmt.Errorf("unknown index-header file version %d", r.version)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but this should be checked right after reading the version. We can't assume that in future versions the index version would be the next thing to read.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters merged commit 0ae7fe7 into main Dec 9, 2022
@56quarters 56quarters deleted the 56quarters/indexheader-mmap-removal branch December 9, 2022 16:51
charleskorn added a commit that referenced this pull request Dec 12, 2022
#3639 (comment),
prometheus/prometheus#11054, and
prometheus/prometheus#11318 all show
substantial performance improvements when using slices.Sort over
sort.Strings.

I've also added a linting rule to catch usage of sort.Strings and
sort.Ints in the future. (We don't currently use sort.Ints anywhere.)

We could also replace sort.Float64s with slices.Sort, however, this
requires more careful consideration as the documentation for
slices.Sort warns that it may not handle NaN values correctly.
@56quarters 56quarters restored the 56quarters/indexheader-mmap-removal branch December 13, 2022 15:34
@56quarters 56quarters deleted the 56quarters/indexheader-mmap-removal branch December 13, 2022 15:35
charleskorn added a commit that referenced this pull request Dec 15, 2022
#3639 (comment),
prometheus/prometheus#11054, and
prometheus/prometheus#11318 all show
substantial performance improvements when using slices.Sort over
sort.Strings.

I've also added a linting rule to catch usage of sort.Strings and
sort.Ints in the future. (We don't currently use sort.Ints anywhere.)

We could also replace sort.Float64s with slices.Sort, however, this
requires more careful consideration as the documentation for
slices.Sort warns that it may not handle NaN values correctly.
pracucci pushed a commit that referenced this pull request Dec 15, 2022
* Use slices.Sort instead of sort.Strings or sort.Ints.

#3639 (comment),
prometheus/prometheus#11054, and
prometheus/prometheus#11318 all show
substantial performance improvements when using slices.Sort over
sort.Strings.

I've also added a linting rule to catch usage of sort.Strings and
sort.Ints in the future. (We don't currently use sort.Ints anywhere.)

We could also replace sort.Float64s with slices.Sort, however, this
requires more careful consideration as the documentation for
slices.Sort warns that it may not handle NaN values correctly.

* Remove unnecessary comment.

* Rename benchmarks file.

* Add benchmark for BinaryReader.LabelNames().

* Add benchmark for LabelValues().

* Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
… mmap (grafana#3639)

## What this PR does

This adds an alternate implementation of the store gateway's index-header reader that does not use mmap. The motivations for this change and references are covered in grafana#3465.

## Changes

The organization of the changes is described below. Some knowledge of the current architecture of the store-gateway is assumed. All paths are relative to `pkg/storegateway`.

### Encoding components

* `indexheader/encoding/reader.go/FileReader`: this is the lowest level part of this change. This is a buffering wrapper around file operations that provides some convenience methods for seeking, reading, peeking.
* `indexheader/encoding/encoding.go/Decbuf`: this is a copy / modification of the equivalent Prometheus file. It includes methods for reading integers, strings, and bytes from binary files using a `FileReader` instance. This is the important part: it uses standard go file operations that the goruntime knows how to schedule if they block. Compare this to mmap which causes invisible blocking operations that hang the entire store-gateway.
* `indexheader/encoding/factory.go/DecbufFactory`: this code is responsible for creating new `Decbuf` instances in a variety of ways. It is also responsible for cleaning up after `Decbuf` instances are no longer needed. This is because it pools the underlying `bufio.Reader`s used.

### TSDB index components

* `indexheader/index/symbols.go/Symbols`: this is a copy/modification of the equivalent Prometheus file. It includes a modified `Symbol` reader that works with our `DecbufFactory` (which in turn uses standard go file operations). It also includes a bulk API for reverse lookups since this is required by the `StreamBinaryReader`
* `indexheader/index/positings.go/PostingOffsetTable`: an interface abstracting the differences between how index v1 postings are looked up vs index v2 postings. The existing `BinaryReader` used a bunch of `if` statements to pick how to work on postings. We feel this is cleaner. This file contains the v1 and v2 implementations of this interface using largely the same logic as the existing `BinaryReader`.

### New index-header `Reader`

* `indexheader/stream_binary_reader.go/StreamBinaryReader`: the alternate implementation of the `Reader` interface that uses all the previously described file-based logic for reading index-headers from disk. The interesting part of this class is the logic run when instances are first created: the index-header table-of-contents is loaded, symbols are loaded and validated, postings are loaded and validated.
* `indexheader/reader_pool.go/ReaderPool`: there are some limited changes here to allow the `StreamBinaryReader` to be used when index-header `Reader`s are lazy loaded or eagerly loaded.

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
Co-authored-by: Charles Korn <charles.korn@grafana.com>
Co-authored-by: Nick Pillitteri <nick.pillitteri@grafana.com>
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
…fana#3690)

* Use slices.Sort instead of sort.Strings or sort.Ints.

grafana#3639 (comment),
prometheus/prometheus#11054, and
prometheus/prometheus#11318 all show
substantial performance improvements when using slices.Sort over
sort.Strings.

I've also added a linting rule to catch usage of sort.Strings and
sort.Ints in the future. (We don't currently use sort.Ints anywhere.)

We could also replace sort.Float64s with slices.Sort, however, this
requires more careful consideration as the documentation for
slices.Sort warns that it may not handle NaN values correctly.

* Remove unnecessary comment.

* Rename benchmarks file.

* Add benchmark for BinaryReader.LabelNames().

* Add benchmark for LabelValues().

* Modify mergeExemplarQueryResponses benchmark to exercise merging exemplars with different sets of labels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants