Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Optimise in-memory parquet reader #848

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

kolesnikovae
Copy link
Contributor

@kolesnikovae kolesnikovae commented Jul 12, 2023

The change makes in-memory parquet reader to not use the stored schema for reconstruction via reflect, and use hand-written reconstruction instead.

err := func(rg parquet.RowGroup) error {
reader := parquet.NewGenericRowGroupReader[M](rg)
defer runutil.CloseWithLogOnErr(util.Logger, reader, "closing parquet generic row group reader")
err = func(rg parquet.RowGroup) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we should also run this in parallel ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. The problem is that it would cost us lots of memory, because page reader acqiures a buffer (of ReadBufferSize size 2MB) per column chunk, and if RGs were read concurrently, the internal buffer pool would be useless.

Interestingly, I see no apparent reason why column chunks should not be read concurrently: https://github.com/segmentio/parquet-go/blob/5d42db8f0d4728c31759068f08da15df44c6cc7f/row_group.go#L320, since buffers are already allocated, it should not be this wasteful

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah column are read in parallel in that case, but I thought it would be good to process multiple rowgroup at once, not sure if it would help as you say let's see.

@kolesnikovae kolesnikovae force-pushed the perf/optimise-inmemory-parquet-reader branch from e475243 to 076cd28 Compare July 13, 2023 05:04
Comment on lines -14 to -17
type StoredString struct {
ID uint64 `parquet:",delta"`
String string `parquet:",dict"`
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't manage to find how the ID field is used

@kolesnikovae kolesnikovae marked this pull request as ready for review July 13, 2023 05:16
Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolesnikovae kolesnikovae merged commit c60f418 into main Jul 13, 2023
17 checks passed
@kolesnikovae kolesnikovae deleted the perf/optimise-inmemory-parquet-reader branch July 13, 2023 09:16
simonswine pushed a commit to simonswine/pyroscope that referenced this pull request Jul 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants