-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fileset index format to sorted string order with summaries and bloom filter #395
Update fileset index format to sorted string order with summaries and bloom filter #395
Conversation
Coverage decreased (-0.5%) to 77.591% when pulling 1c0864efbb7a757e1492ecff28cf51488934e160 on r/sorted-string-index-with-summaries-with-bloom-filter into c6b20ed on master. |
Coverage decreased (-0.5%) to 77.532% when pulling 1c0864efbb7a757e1492ecff28cf51488934e160 on r/sorted-string-index-with-summaries-with-bloom-filter into c6b20ed on master. |
persist/encoding/decoder.go
Outdated
@@ -42,3 +51,110 @@ type Decoder interface { | |||
// DecodeLogEntry decodes commit log entry | |||
DecodeLogEntry() (schema.LogEntry, error) | |||
} | |||
|
|||
// DecoderStream is a data stream that is read by the decoder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little confusing. The comment seems like its saying that this interface is meant to be composable with other Reader interface, and I expected the constructor for DecoderStream to look like:
NewDecoderStream(b []byte, r io.Reader) DecoderStream {}
but instead it only accepts []byte, and the type of the internal reader is *bytes.Reader which is not an interface.
How would you use this interface with a ReaderWithDigest interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewDecoderStream(...)
in this package is just a default offering for users to use if they don't need to provide the reader themselves.
There is a private newDecoderStream(...)
in the persist/fs package
for that use case. This is so that users who don't need to and already have just a byte slice can still easily pass a buffer to decode into the msgpack decoder.
@@ -141,13 +153,45 @@ func (dec *decoder) decodeIndexInfo() schema.IndexInfo { | |||
indexInfo.Start = dec.decodeVarint() | |||
indexInfo.BlockSize = dec.decodeVarint() | |||
indexInfo.Entries = dec.decodeVarint() | |||
indexInfo.MajorVersion = dec.decodeVarint() | |||
indexInfo.Summaries = dec.decodeIndexSummariesInfo() | |||
indexInfo.BloomFilter = dec.decodeIndexBloomFilterInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index files are per shard / block so we have a bloomfilter per shard/block right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct yup.
@@ -372,9 +360,11 @@ func TestReadCommitLogMissingMetadata(t *testing.T) { | |||
for i := 0; i < 200; i++ { | |||
willNotHaveMetadata := !(i%2 == 0) | |||
allSeries = append(allSeries, testSeries(uint64(i), "hax", uint32(i%100))) | |||
mockBitSet.indexTestReturn[uint(i)] = willNotHaveMetadata | |||
if willNotHaveMetadata { | |||
bitSet.Set(uint(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I probably should have realized this myself
persist/fs/commitlog/writer.go
Outdated
@@ -90,7 +91,7 @@ type writer struct { | |||
chunkReserveHeader []byte | |||
buffer *bufio.Writer | |||
sizeBuffer []byte | |||
seen bitSet | |||
seen *xsets.BitSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you worried about the performance impact of having an interface here, or it just wasn't necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that this might get moved to m3x
at some point and if so to help avoid breaking changes when we add methods to things it might be worth keeping the pretty simple data structures that don't need to get mocked out as just structs.
So this choice was mainly for reducing cross repo breaking changes.
If it was a more complicated/logic based structured then we probably would've want to mock it.
|
||
type mmapFileDesc struct { | ||
// file is the *os.File ref to store | ||
file **os.File |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a double ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so you can pass in the ptr you wish to have set.
// file is the *os.File ref to store | ||
file **os.File | ||
// bytes is the []byte slice ref to store the mmap'd address | ||
bytes *[]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, confused why we're using a pointer to a "pointer type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is so that the API call to mmapFiles(...)
can populate refs for you.
persist/fs/mmap_linux.go
Outdated
func (b *bitSetImpl) clearAll() { | ||
for i := range b.values { | ||
b.values[i] = 0 | ||
if err := syscall.Madvise(b, syscall.MADV_HUGEPAGE); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment explaining this would be nice. I found these two links helpful personally:
- http://man7.org/linux/man-pages/man2/madvise.2.html
- https://www.kernel.org/doc/Documentation/vm/transhuge.txt
Seems like the gist of it is that you can reduce TLB misses by making the pages larger, at the risk of wasting a little extra memory. Is that basically it?
Also just out of curiosity, did this make a big difference in your benchmarks or is it just kind of a "best practices" thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the gist of it is that you can reduce TLB misses by making the pages larger, at the risk of wasting a little extra memory. Is that basically it?
That's it yeah. Also we're mmap'ing really large data files and in case of the index file might use the summary table to jump close to a record but still need to scan a reasonable way forward to find the exact entry, the amount of pages that need to cause a page fault will matter when we do these scans.
I can add a comment, sounds good.
persist/fs/options.go
Outdated
@@ -30,6 +31,12 @@ import ( | |||
) | |||
|
|||
const ( | |||
// defaultIndexSummariesPercent is the default percent to save summaries for when writing summary files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment could be a little more clear: "defaultIndexSummariesPercent is the default percent of series for which an entry will be written into the metadata summary" or something like that. It took me awhile to figure out what this was until you explained it in person in the meeting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, will update the comment. Thanks for the note.
@@ -54,25 +64,55 @@ type writer struct { | |||
err error | |||
} | |||
|
|||
type indexEntry struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any benefit in optimizing the size of this struct?
From what I can tell, offsetForSummary never stores a value that is different than offset, so you could probably replace it with a bool. That doesn't actually change the size of the struct, but if you then replace size (or offset but size seems safer) with uint32 and re-order the fields, you can save 8 bytes per struct making it 40 bytes instead of 48 (if this is meaningful at all):
https://play.golang.org/p/dXGawT5YxL
Might be premature, I don't know how hot this is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it might make your sort faster too just by virtue of being able to fit more of them in the cache lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two offset values are actually different. One is the offset into the data file, the offsetForSummary
is the offset of this record in the index file that the summary entry will point to. I'll update these to be dataOffset
and indexOffset
perhaps.
Yeah size could be reduced to be a uint32 for sure, offsets probably shouldn't be in case we ever have a file >4gb.
I'm pretty sure that the size of the struct won't benefit from just a single field being a uint32 though due to alignment.
Although if I make checksum a uint32 as well then it seems it can help if I place the two fields together, but not if they're apart (which makes sense due to alignment):
https://play.golang.org/p/7RRrGG6y10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/dominikh/go-tools/tree/master/cmd/structlayout might be helpful if you wanna go down this road
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is optimized, in future we can always look at this - agreed.
@@ -21,14 +21,18 @@ | |||
package fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot going on in this file. I count 6 different file descriptors. Seems like generally speaking now (or sometime soon with the subsequent P.R's) would be a good time to write a few paragraphs about our new file formats (kind of what we have in commitlog.md) and then link to that documentation at the top of this page.
Otherwise as someone unfamiliar with the codebase its pretty difficult to decipher the what / why from code that is just manipulating bytes on disk if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup sounds good, I'll write a storage.md
and reference it from this module. And maybe a short paragraph in the source file itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I found this helpful while reviewing:
┌─────────────────────┐
┌─────────────────────┐ │ Index File │ ┌─────────────────────┐
│ Summaries File │ │ (sorted by ID) │ │ Data File │
│ (sorted by ID) │ ├─────────────────────┤ ├─────────────────────┤
├─────────────────────┤ ┌────────▶│- Idx │ │List of: │
┌─────────────────────┐ │- Idx │ │ │- ID │ ┌────▶│ - Marker (16 bytes)│
│ Info File │ │- ID │ │ │- Size │ │ │ - ID │
├─────────────────────┤ │- Index Entry Offset ├───────┘ │- Checksum │ │ │ - Data (size bytes)│
│- Block Start │ └─────────────────────┘ │- Data Entry Offset ├────┘ └─────────────────────┘
│- Block Size │ └─────────────────────┘
│- Entries (Num) │
│- Major Version │ ┌─────────────────────┐
│- Summaries (Num) │ │ Bloom_Filter File │
│- BloomFilter (K/M) │ ├─────────────────────┤
└─────────────────────┘ │- N: # elems │
│- P: P(false +ve) │
└─────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha thanks for that, removed the URL from your comment and replaced with the actual content though to avoid leakage of non-public assets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol cheers
persist/fs/write.go
Outdated
// Write the index entries and calculate the bloom filter | ||
n, p := uint(w.currIdx), w.bloomFilterFalsePositivePercent | ||
m, k := xsets.BloomFilterEstimate(n, p) | ||
bloomFilter := xsets.NewBloomFilter(m, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok interesting so we don't even create the bloom filter until we're ready to flush to disk? I guess I thought we'd be creating it as we saw things, but I guess we don't need it until data has been flushed to disk, because if its still in memory we can just check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yeah. We'll experiment with how much time this takes to do sequentially all during a flush vs amortizing it during series insertions.
for i := range w.indexEntries { | ||
id := w.indexEntries[i].id.Data().Get() | ||
|
||
entry := schema.IndexEntry{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of annoying we have to do this conversion because of a single field difference but I dont know if there is much that can be done about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not too big a deal it'll just be a memcopy to a stack allocated piece of memory though.
persist/fs/write.go
Outdated
w.currIdx++ | ||
|
||
return nil | ||
} | ||
|
||
func (w *writer) close() error { | ||
func (w *writer) writeIndexEntriesSummariesBloomFilter() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this function does more than the name implies. Maybe: "writeIndexEntriesAndIndexSummaryAndBloomFilter" or something. I know that's verbose but the existing name makes it seem like its writing one file when its writing 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I might break some of this up into smaller fns and have a second attempt at naming this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, breaking into smaller fns would help readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
persist/fs/write.go
Outdated
} | ||
|
||
w.encoder.Reset() | ||
if err := w.encoder.EncodeIndexInfo(info); err != nil { | ||
return err | ||
} | ||
|
||
if _, err := w.infoFdWithDigest.WriteBytes(w.encoder.Bytes()); err != nil { | ||
_, err = w.infoFdWithDigest.Write(w.encoder.Bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the info about the index is stored in a separate file than the index itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, right now all files are either just repeated records or a single record. i.e. always a single type in one file.
Probably worth continuing this for simplicity, in the future if its important enough we can combine them. The thing is there's data in the info file that you don't have ahead of time when you write the index file (i.e. number of summaries), although we could probably do some maths to calculate that rather than keep a counter potentially.
persist/fs/write.go
Outdated
if err != nil { | ||
return err | ||
} | ||
|
||
info := schema.IndexInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have constructors for these msgpack structs? Seems like it would be easy to leave out a field by accident and corrupt the whole thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's usually only a single place where we ever construct these and the unit tests are pretty rigorous. Since there's only one piece of code constructing these msgpack structs I'd prefer the readability when setting the fields rather than the safety added of a constructor (where you don't know what each arg is when reading the code necessarily and maybe you specify the wrong local variable to the wrong argument).
If tons of different places constructed these types then I would agree with you.
x/sets/bloom_filter.go
Outdated
// NewBloomFilter creates a new bloom filter that can represent | ||
// m elements with k hashes. | ||
func NewBloomFilter(m uint, k uint) *BloomFilter { | ||
if m < 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should return an error if these values are invalid...might be annoying to add the error value everywhere though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, although am definitely trying to keep the generic datastructures as simple as possible to use so its more ergonomic. I think in this case keeping both bitset and bloomfilter without an error is more ergonomic currently. If there was some other args that could truly screw up things then at that point I'd lean towards an error here.
Overall a good comment but I think just for ergonomics we should probably avoid returning an error for this API.
Looks good to me for the most part. My main feedback is that it would be really nice to document these new file formats and how they interact (even a few paragraphs) and then link to them from the write.go file |
Coverage decreased (-0.6%) to 77.45% when pulling ed3a584d8732004124338720e08da2ab02956b0c on r/sorted-string-index-with-summaries-with-bloom-filter into c6b20ed on master. |
persist/fs/commitlog/reader.go
Outdated
@@ -501,7 +509,7 @@ func (r *reader) readInfo() (schema.LogInfo, error) { | |||
return emptyLogInfo, err | |||
} | |||
data.IncRef() | |||
r.logDecoder.Reset(data.Get()) | |||
r.logDecoder.Reset(encoding.NewDecoderStream(data.Get())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, should we pool DecoderStream
somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reuse the decoderstream for the reader, sure thing.
persist/fs/mmap_linux.go
Outdated
return nil, err | ||
} | ||
|
||
if length < hugePageSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, why not use a huge page for 1.5MB index file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we definitely don't want to use huge pages for anything less than 4096 or even "a few" 4096 files in length, otherwise you do cause memory fragmentation (it's a problem with huge pages). Maybe I'll use > 32kb which is 8 pages.
Done.
persist/fs/mmap.go
Outdated
// close the ones that have been opened. | ||
for _, desc := range files { | ||
if *desc.file != nil { | ||
(*desc.file).Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also need to munmap
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call.
r := readerPool.get() | ||
r, err := readerPool.get() | ||
if err != nil { | ||
s.log.Errorf("unable to get reader from pool") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be bubbled up so we know that we've bootstrapped from partial data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, restructured this to bubble it up.
filesetPathFromTime(shardDir, blockStart, digestFileSuffix): &digestFd, | ||
}); err != nil { | ||
return err | ||
} | ||
|
||
r.infoFdWithDigest.Reset(infoFd) | ||
r.indexFdWithDigest.Reset(indexFd) | ||
r.dataFdWithDigest.Reset(dataFd) | ||
r.digestFdWithDigestContents.Reset(digestFd) | ||
|
||
defer func() { | ||
// NB(r): We don't need to keep these FDs open as we use these up front | ||
r.infoFdWithDigest.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to not retain these fds in the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually retain the fds in the struct, we just retain the fd with digest wrappers so that we don't need to create them every time we open the file again (i.e. pooling the fd with digest wrappers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense. would you mind putting a note down in the struct about the lifecycle of the wrappers
persist/fs/read.go
Outdated
indexEntriesByOffsetAsc []schema.IndexEntry | ||
} | ||
|
||
type indexEntriesByOffsetAsc []schema.IndexEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move indexEntriesByOffsetAsc
and decoderStream
to the bottom of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'll move decoderStream
to it's own file and indexEntriesByOffsetAsc
to the bottom.
return err | ||
func (r *reader) readIndex() error { | ||
r.decoder.Reset(r.indexDecoderStream) | ||
for i := 0; i < r.entries; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're still keeping all of the index file in memory, is the plan to do this lazily in subsequent PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, however we'll implement this in the seeker and keep the reader as is. We'll just never use the reader outside needing to do batch jobs.
bytesPool pool.CheckedBytesPool, | ||
decodingOpts msgpack.DecodingOptions, | ||
) fs.FileSetReader | ||
opts fs.Options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
massive +1 for this change
r, err := readerPool.get() | ||
if err != nil { | ||
s.log.Errorf("unable to get reader from pool") | ||
readersCh <- shardReaders{err: err} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need a readerPool.put(r)
after this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no because reader is nil here (the err was from trying to get it from the pool hehe).
) | ||
bytesPool := s.opts.ResultOptions().DatabaseBlockOptions().BytesPool() | ||
readerPool := newReaderPool(func() (fs.FileSetReader, error) { | ||
return s.newReaderFn(bytesPool, s.fsopts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have this here instead of Options
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you don't really want to allocate and keep around readers outside of the bootstrapping process, hence why its created on demand each time.
I'll add a comment about this.
persist/fs/commitlog/reader.go
Outdated
@@ -270,7 +272,9 @@ func (r *reader) shutdown() { | |||
func (r *reader) decoderLoop(inBuf <-chan decoderArg, outBuf chan<- readResponse) { | |||
decodingOpts := r.opts.FilesystemOptions().DecodingOptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you put this in a var block
} | ||
} | ||
|
||
func TestDecoderStreamUnreadByteMultiple(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this keeping this test.
Feels like you're having to test a lot of weird combination of conditions, how would you feel about adding a few property tests for the decoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue for it, in the interest of expediting this change we'll look at this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #403
persist/fs/read_decode_stream.go
Outdated
|
||
var ( | ||
// Ensure readerDecoderStream implements encoding.DecoderStream | ||
_ encoding.DecoderStream = &readerDecoderStream{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you avoid this by changing the ctor to the following signature: func newReaderDecoderStream() encoding.DecoderStream {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, done
break | ||
} | ||
|
||
*desc.file = fd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty soon we'll be writing desc->file = fd;
🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, aye
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/ some minor nits
This is the beginning of a set of changes to move to remove all block references from memory except those that are being written to and those that have been loaded in recently as cached blocks due to reads.
This will write out the new file format and support reading it.
For now the seeker still loads the ID and offsets into memory, the next change will update it to load the summaries file and resolve IDs by reading the index entries on disk after searching the summaries for a jumping off point.