Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Refactor block compaction to allow shard-splitting #2366

Merged
merged 11 commits into from Sep 7, 2023

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Sep 6, 2023

This refactors the current compaction function Compact to create a new CompactWithSplitting(...,shardCount uint64) function.

The new function compacts and deduplicates data away from input blocks while splitting it into shardCount shards new blocks. Some output blocks might be emtpy and not returned.

Right now the shards key is the labels finguerprint (hash of all labels pair). I suspect we might want to add different sharding strategy in the future after some experiment.

To split into multiple blocks we don't hook into the reader anymore, we read row by row and push to the correct output block, turns out to be much cleaner. Specially since we don't have to write row groups ourselves anymore I discovered that if you configure correctly the parquet writer it will automatically handle flushing row groups.

I'm still working on adding more tests now.

PS: I suggest to review this in split view.

@cyriltovena cyriltovena requested a review from a team as a code owner September 6, 2023 08:39
sr := symbolsRewriter{
profiles: it,
rewriters: make(map[BlockReader]*symdb.Rewriter, len(blocks)),
func newSymbolsRewriter(path string) *symbolsRewriter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolesnikovae

We now creates rewriters lazily as soon as I see the first row for a given block. Rewriters are not shared across multiple output blocks.

WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it :) Each block has to have its own symdb

Copy link
Collaborator

@kolesnikovae kolesnikovae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyriltovena cyriltovena marked this pull request as draft September 6, 2023 13:00
@cyriltovena
Copy link
Contributor Author

Just need to finish writing some more robust tests then we should be good. I spotted a bug where start/end time is from original block and not actual profiles so I'll have to fix this.

@cyriltovena cyriltovena added the storage Low level storage matters label Sep 7, 2023
@cyriltovena cyriltovena marked this pull request as ready for review September 7, 2023 09:52
@cyriltovena cyriltovena enabled auto-merge (squash) September 7, 2023 12:37
@cyriltovena cyriltovena merged commit 63561d7 into main Sep 7, 2023
15 checks passed
@cyriltovena cyriltovena deleted the feat/split-compaction branch September 7, 2023 12:47
@cyriltovena cyriltovena mentioned this pull request Sep 29, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage Low level storage matters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants