Optimise in-memory parquet reader #848

kolesnikovae · 2023-07-12T10:48:26Z

The change makes in-memory parquet reader to not use the stored schema for reconstruction via reflect, and use hand-written reconstruction instead.

cyriltovena · 2023-07-12T15:04:22Z

pkg/phlaredb/block_querier.go

-		err := func(rg parquet.RowGroup) error {
-			reader := parquet.NewGenericRowGroupReader[M](rg)
-			defer runutil.CloseWithLogOnErr(util.Logger, reader, "closing parquet generic row group reader")
+		err = func(rg parquet.RowGroup) error {


I suspect we should also run this in parallel ?

I'm not sure. The problem is that it would cost us lots of memory, because page reader acqiures a buffer (of ReadBufferSize size 2MB) per column chunk, and if RGs were read concurrently, the internal buffer pool would be useless.

Interestingly, I see no apparent reason why column chunks should not be read concurrently: https://github.com/segmentio/parquet-go/blob/5d42db8f0d4728c31759068f08da15df44c6cc7f/row_group.go#L320, since buffers are already allocated, it should not be this wasteful

yeah column are read in parallel in that case, but I thought it would be good to process multiple rowgroup at once, not sure if it would help as you say let's see.

kolesnikovae · 2023-07-13T05:07:32Z

pkg/phlaredb/schemas/v1/strings.go

-type StoredString struct {
-	ID     uint64 `parquet:",delta"`
-	String string `parquet:",dict"`
-}


I didn't manage to find how the ID field is used

cyriltovena

LGTM

cyriltovena reviewed Jul 12, 2023

View reviewed changes

Optimise in-memory parquet reader

076cd28

kolesnikovae force-pushed the perf/optimise-inmemory-parquet-reader branch from e475243 to 076cd28 Compare July 13, 2023 05:04

kolesnikovae commented Jul 13, 2023

View reviewed changes

kolesnikovae marked this pull request as ready for review July 13, 2023 05:16

kolesnikovae mentioned this pull request Jul 19, 2023

Symbolic information memory usage grafana/pyroscope#2025

Closed

cyriltovena approved these changes Jul 13, 2023

View reviewed changes

kolesnikovae merged commit c60f418 into main Jul 13, 2023
17 checks passed

kolesnikovae deleted the perf/optimise-inmemory-parquet-reader branch July 13, 2023 09:16

simonswine pushed a commit to simonswine/pyroscope that referenced this pull request Jul 18, 2023

Optimise in-memory parquet reader (grafana/phlare#848)

470a13d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise in-memory parquet reader #848

Optimise in-memory parquet reader #848

kolesnikovae commented Jul 12, 2023 •

edited

cyriltovena Jul 12, 2023

kolesnikovae Jul 13, 2023 •

edited

cyriltovena Jul 13, 2023

kolesnikovae Jul 13, 2023

cyriltovena left a comment

Optimise in-memory parquet reader #848

Optimise in-memory parquet reader #848

Conversation

kolesnikovae commented Jul 12, 2023 • edited

cyriltovena Jul 12, 2023

Choose a reason for hiding this comment

kolesnikovae Jul 13, 2023 • edited

Choose a reason for hiding this comment

cyriltovena Jul 13, 2023

Choose a reason for hiding this comment

kolesnikovae Jul 13, 2023

Choose a reason for hiding this comment

cyriltovena left a comment

Choose a reason for hiding this comment

kolesnikovae commented Jul 12, 2023 •

edited

kolesnikovae Jul 13, 2023 •

edited