Added support to read parquet row groups in chunks #789

jorgecarleitao · 2022-01-25T22:45:40Z

This PR allows reading of parquet columns in chunks, thereby allowing decompressing and deserializing pages on demand to reduce the memory footprint.

This is a draft as there is still some work to do, which I will continue

Design

This PR follows the design of this crate:

decompression/deserialization cannot cross column chunk boundaries, so that columns can be read (IO-bounded) and deserialized (CPU-bounded) independently and independently of each of the two operations.
unsafe free
generics to inline hot ops

The overall design of the changes enables pages to be deserialized to arrays of a different length, based on a new parameter chunk_size: usize:

parquet
[                             column chunk                       ]
[page][page][page][page][page][page][page][page][page][page][page]

arrow
[      array      ][      array      ][      array      ][ array ]
 <- chunk size  ->

Broadly, this PR does the following (for a single row group):

read N column chunks to memory (Vec<u8>)
for each column, compose an iterator with 2 iterator adapters:
- Vec<u8> -> Iterator<CompressedDataPage>
- Iterator<CompressedDataPage> -> Iterator<&DataPage> (decompression)
- Iterator<&DataPage> -> Iterator<Arc<dyn Array>> (deserialization)

All these iterators are CPU-bounded. On the last step, we track where we are on the DataPage (through PageState, see below) and the temporary mutable array, and either:

consume the page to the mutable and the rest of it to a ring of mutables when the chunk size is so small that one page contains multiple arrays, or
iterate over more pages until we fill the temporary mutable up to chunk_size, at which point we freeze the mutable and return it

In more detail:

column chunks are now read in a single read_exact, instead of page by page. This was a mistake on the previous design, since we should not mix reading columns from deserializing pages
a &'a DataPage is mapped to a PageState<'a> based on its parquet physical type and encoding. This is the input state of a page and allows us to "suspend" an iteration over a page, thereby allowing a page to "extend" an array without having it completely consumed
a new concept, Decoder, knows how to initialize mutable arrays and how to extend them state from &mut PageState<'a>, advancing the page state accordingly. This is mostly a DRY trait

Closes #768

jorgecarleitao · 2022-01-29T17:51:44Z

This now reproduces all the existing functionality apart from:

read structs
read nested list (i.e. list of list)
delta length encoding

I found a bug in parquet2 in writing nested data while working on this. Thus, I plan to first release parquet2 with the fix and a simpler API to write, and then move back to this one again.

codecov · 2022-01-30T13:48:17Z

Codecov Report

Merging #789 (a84539b) into main (e577c9f) will decrease coverage by 0.09%.
The diff coverage is 69.67%.

@@            Coverage Diff             @@
##             main     #789      +/-   ##
==========================================
- Coverage   71.29%   71.19%   -0.10%     
==========================================
  Files         321      326       +5     
  Lines       16834    17471     +637     
==========================================
+ Hits        12001    12438     +437     
- Misses       4833     5033     +200

Impacted Files	Coverage Δ
src/io/parquet/read/null.rs	`0.00% <0.00%> (ø)`
src/io/parquet/read/record_batch.rs	`0.00% <0.00%> (-79.75%)`	⬇️
src/io/parquet/read/boolean/nested.rs	`50.00% <50.90%> (-38.24%)`	⬇️
src/io/parquet/read/binary/nested.rs	`53.84% <54.90%> (-0.70%)`	⬇️
src/io/parquet/read/primitive/dictionary.rs	`60.00% <60.00%> (+4.73%)`	⬆️
src/io/parquet/read/mod.rs	`53.46% <61.22%> (+14.55%)`	⬆️
src/io/parquet/read/primitive/nested.rs	`62.26% <62.26%> (+4.36%)`	⬆️
src/io/parquet/read/binary/dictionary.rs	`63.38% <63.38%> (-5.72%)`	⬇️
src/io/parquet/read/dictionary.rs	`63.41% <63.41%> (ø)`
...rc/io/parquet/read/fixed_size_binary/dictionary.rs	`63.63% <63.63%> (ø)`
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e577c9f...a84539b. Read the comment docs.

jorgecarleitao · 2022-01-30T22:49:59Z

@houqp, this is ready for a spin.

There is a regression (20-40%) when reading a whole binary/utf8 column chunk at once (i.e. no chunk size). This is related to some tricks in pre-computed capacities of binary/utf8 that benefit from reading the whole column (we can recover this behavior).

I deactivated structs of structs and lists of lists for now as I need to dig a bit into the dremel. The failing tests are just examples that I need to update. ^^

jorgecarleitao added the feature A new feature label Jan 25, 2022

jorgecarleitao mentioned this pull request Jan 27, 2022

Fixed reading parquet binary dict page #791

Merged

jorgecarleitao force-pushed the parquet_pages branch 4 times, most recently from 48b6fb9 to 2bf4c00 Compare January 29, 2022 17:49

jorgecarleitao force-pushed the parquet_pages branch from 18cfc09 to d0b78b0 Compare January 30, 2022 13:39

jorgecarleitao force-pushed the parquet_pages branch 2 times, most recently from fb6a54a to aef6394 Compare January 30, 2022 21:26

jorgecarleitao marked this pull request as ready for review January 30, 2022 22:38

jorgecarleitao force-pushed the parquet_pages branch 3 times, most recently from 2aaf8dd to 97fcf3b Compare February 2, 2022 16:49

jorgecarleitao added 14 commits February 3, 2022 20:36

Bla

5c0d8e4

Primitive

f772eb2

Binary

a94bd17

FixedBinary

27f3474

Simpler

cb09c47

Removed old code

44ff425

Added test and fixed error

514dd2d

Added remaining dictionary

4ebc298

Removed un-used

99304b5

Simpler

181a232

Nested boolean

675e311

Lifted common logic

fc96bb5

Nested

aca359e

Added support to write FixedLen dictionary

18350f3

jorgecarleitao added 8 commits February 3, 2022 20:36

Added support to roundtrip dictionaries for fix-len

268e75f

Ignore

6d378ea

Added missing file

459ff41

Fixed bench

9c9bdf3

Minor improvements

5a07aa6

Simplified API

ad96f97

Fixed nested

81b4b43

Fixed error

18b3be4

jorgecarleitao force-pushed the parquet_pages branch from 09be25e to 18b3be4 Compare February 3, 2022 20:37

Added example to read parquet async

a84539b

jorgecarleitao merged commit f35e02a into main Feb 4, 2022

jorgecarleitao deleted the parquet_pages branch February 4, 2022 15:58

jorgecarleitao changed the title ~~Read parquet row groups in chunks~~ Added support to read parquet row groups in chunks Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support to read parquet row groups in chunks #789

Added support to read parquet row groups in chunks #789

jorgecarleitao commented Jan 25, 2022 •

edited

Loading

jorgecarleitao commented Jan 29, 2022 •

edited

Loading

codecov bot commented Jan 30, 2022 •

edited

Loading

jorgecarleitao commented Jan 30, 2022

Added support to read parquet row groups in chunks #789

Added support to read parquet row groups in chunks #789

Conversation

jorgecarleitao commented Jan 25, 2022 • edited Loading

Design

jorgecarleitao commented Jan 29, 2022 • edited Loading

codecov bot commented Jan 30, 2022 • edited Loading

Codecov Report

jorgecarleitao commented Jan 30, 2022

jorgecarleitao commented Jan 25, 2022 •

edited

Loading

jorgecarleitao commented Jan 29, 2022 •

edited

Loading

codecov bot commented Jan 30, 2022 •

edited

Loading