New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Refactor block compaction to allow shard-splitting #2366
Conversation
sr := symbolsRewriter{ | ||
profiles: it, | ||
rewriters: make(map[BlockReader]*symdb.Rewriter, len(blocks)), | ||
func newSymbolsRewriter(path string) *symbolsRewriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now creates rewriters lazily as soon as I see the first row for a given block. Rewriters are not shared across multiple output blocks.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it :) Each block has to have its own symdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just need to finish writing some more robust tests then we should be good. I spotted a bug where start/end time is from original block and not actual profiles so I'll have to fix this. |
This refactors the current compaction function
Compact
to create a newCompactWithSplitting(...,shardCount uint64)
function.The new function compacts and deduplicates data away from input blocks while splitting it into
shardCount
shards new blocks. Some output blocks might be emtpy and not returned.Right now the shards key is the labels finguerprint (hash of all labels pair). I suspect we might want to add different sharding strategy in the future after some experiment.
To split into multiple blocks we don't hook into the reader anymore, we read row by row and push to the correct output block, turns out to be much cleaner. Specially since we don't have to write row groups ourselves anymore I discovered that if you configure correctly the parquet writer it will automatically handle flushing row groups.
I'm still working on adding more tests now.
PS: I suggest to review this in split view.