Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Sort Rows by Series when flushing to disk #800

Closed
cyriltovena opened this issue Jun 26, 2023 · 2 comments · Fixed by #803
Closed

Sort Rows by Series when flushing to disk #800

cyriltovena opened this issue Jun 26, 2023 · 2 comments · Fixed by #803
Assignees

Comments

@cyriltovena
Copy link
Collaborator

Currently we don't sort across row groups of profiles when flushing all of them.

I believe we should be able to stream all row groups from

func (s *profileStore) writeRowGroups(path string, rowGroups []parquet.RowGroup) (n uint64, numRowGroups uint64, err error) {
and reorder them by SeriesID then timestamp.

@simonswine
Copy link
Collaborator

This will improve data locality a ton, but I am a bit unsure how this will impact querying, as the order of querying will be:

  • Timestamp first then SeriesID

And blocks will be strictly stored in

  • SeriesID first then Timestamp

Currently the sorting is more like:

  • Within a single row group strictly: Series first then Timestamp
  • Across row groups, loosely timestamp ordered

If we query a time range only impacting a part of the ~3 hours within a block, we could get away by only reading pages that fall within the time ranges (based on the pages Min/Max). With this change we can only access pages by their SeriesIDs min/max. I don't expect a major issue, I am just wondering.

Fairly relevant to the change in #799

@simonswine
Copy link
Collaborator

Maybe we need to address our query behaviour to use the order that is in the blocks as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants