Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support to write nested parquet #1007

Merged
merged 7 commits into from
May 27, 2022
Merged

Added support to write nested parquet #1007

merged 7 commits into from
May 27, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented May 24, 2022

This PR adds support to write StructArray (and likely arbitrary nested, but more testing is required).

I expect the operation to be (per array) O(N * D * C) (worse case) where:

  • D is the max depth
  • N the number of items.
  • C is the number of leaf parquet columns

The main idea in this PR: given a (potentially nested) parquet field (ParquetType) and an Array,

  1. create a Vec<Nested> containing the validities and lengths of each nest level.
  2. map Vec<Nested> to iterators of rep and def levels.
  3. encode the iterators

@jorgecarleitao jorgecarleitao added the feature A new feature label May 24, 2022
@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #1007 (339668e) into main (a10db3a) will increase coverage by 0.23%.
The diff coverage is 77.74%.

@@            Coverage Diff             @@
##             main    #1007      +/-   ##
==========================================
+ Coverage   71.42%   71.65%   +0.23%     
==========================================
  Files         356      359       +3     
  Lines       19784    20037     +253     
==========================================
+ Hits        14131    14358     +227     
- Misses       5653     5679      +26     
Impacted Files Coverage Δ
src/io/parquet/write/sink.rs 71.42% <ø> (ø)
src/io/parquet/write/mod.rs 61.62% <53.22%> (+3.00%) ⬆️
src/io/parquet/write/pages.rs 65.62% <65.62%> (ø)
src/io/parquet/write/row_group.rs 82.05% <76.92%> (-2.80%) ⬇️
src/io/parquet/write/binary/nested.rs 78.94% <80.00%> (-7.42%) ⬇️
src/io/parquet/write/primitive/nested.rs 76.19% <80.00%> (-7.15%) ⬇️
src/io/parquet/write/utf8/nested.rs 78.94% <80.00%> (-7.42%) ⬇️
src/io/parquet/write/nested/def.rs 86.95% <86.95%> (ø)
src/io/parquet/write/nested/rep.rs 92.72% <92.72%> (ø)
src/io/parquet/write/nested/mod.rs 96.61% <96.61%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a10db3a...339668e. Read the comment docs.

@jorgecarleitao jorgecarleitao changed the title [WIP] Write nested parquet Added support to write nested parquet May 26, 2022
@jorgecarleitao jorgecarleitao marked this pull request as ready for review May 26, 2022 11:53
@jorgecarleitao jorgecarleitao force-pushed the write_nest branch 3 times, most recently from dcfb9d8 to 22a1c84 Compare May 26, 2022 17:30
@jorgecarleitao
Copy link
Owner Author

This is now ready - more tests, generalized to arbitrary nesting, and tests demo that we can write StructArray.

@ritchie46
Copy link
Collaborator

Awesome work @jorgecarleitao!

@ahmedriza
Copy link

ahmedriza commented May 26, 2022

Many thanks for the quick fix @jorgecarleitao. Fantastic work.

@ahmedriza
Copy link

ahmedriza commented May 26, 2022

Tested a different, two level nested Parquet against this branch. Unfortunately, the Parquet reading fails. Submitted a separate issue: #1014.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
3 participants