Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed writing nested/sliced arrays to parquet #1326

Merged
merged 5 commits into from Dec 13, 2022

Conversation

ritchie46
Copy link
Collaborator

@ritchie46 ritchie46 commented Dec 11, 2022

fixes #1323

We wrote invalid arrays to parquet because we did not take the offset of the list arrays into account. This got worse as we now slice leaf arrays when we write to pages and the offsets don't belong to the pages written anymore.

Because we don't use offsets, but dremel during reading/writing of the parquet this still read values, but just way too many.

This PR fixes the huges memory usage/ huge files and incorrect files as observed in #1323.

@codecov
Copy link

codecov bot commented Dec 11, 2022

Codecov Report

Base: 83.12% // Head: 83.12% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (a22cce5) compared to base (1fcfd7c).
Patch coverage: 84.03% of modified lines in pull request are covered.

❗ Current head a22cce5 differs from pull request most recent head bbd2b3a. Consider uploading reports for the commit bbd2b3a to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1326   +/-   ##
=======================================
  Coverage   83.12%   83.12%           
=======================================
  Files         370      370           
  Lines       40169    40241   +72     
=======================================
+ Hits        33391    33451   +60     
- Misses       6778     6790   +12     
Impacted Files Coverage Δ
src/io/parquet/write/mod.rs 85.78% <77.64%> (-0.95%) ⬇️
src/io/parquet/write/binary/nested.rs 96.96% <100.00%> (+0.19%) ⬆️
src/io/parquet/write/boolean/nested.rs 96.66% <100.00%> (+0.23%) ⬆️
src/io/parquet/write/pages.rs 95.85% <100.00%> (ø)
src/io/parquet/write/primitive/nested.rs 97.36% <100.00%> (+0.14%) ⬆️
src/io/parquet/write/utf8/nested.rs 96.96% <100.00%> (+0.19%) ⬆️
src/io/ipc/read/stream_async.rs 75.34% <0.00%> (-1.37%) ⬇️
src/array/binary/mod.rs 90.07% <0.00%> (-1.15%) ⬇️
src/io/ipc/read/file_async.rs 60.44% <0.00%> (-0.75%) ⬇️
src/io/ipc/read/schema.rs 95.29% <0.00%> (-0.30%) ⬇️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator Author

O(n) is restored:
image

@ritchie46
Copy link
Collaborator Author

@jorgecarleitao I added a test list_nested_inner_required_required_i64 which successfully round trips, only the statistics are not equal to pyarrow.

I am not really sure what the proper statistics should be in such a case. The values are correct only the nesting differs. Pyarrow does not nest the null count, whilst we do.

---- io::parquet::write::list_nested_inner_required_required_i64 stdout ----
thread 'io::parquet::write::list_nested_inner_required_required_i64' panicked at 'assertion failed: `(left == right)`
  left: `Statistics { null_count: UInt64[0], distinct_count: UInt64[None], min_value: ListArray[[[0]]], max_value: ListArray[[[10]]] }`,
 right: `Statistics { null_count: ListArray[[[0]]], distinct_count: ListArray[[[None]]], min_value: ListArray[[[0]]], max_value: ListArray[[[10]]] }`', tests/it/io/parquet/write.rs:59:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@jorgecarleitao
Copy link
Owner

Oof, thank you so much, @ritchie46 !

@ritchie46
Copy link
Collaborator Author

@jorgecarleitao This should be good to go. The coverage failure is unrelated.

@jorgecarleitao jorgecarleitao changed the title fix/perf writing nested/sliced arrays to parquet Fixed writing nested/sliced arrays to parquet Dec 13, 2022
@jorgecarleitao jorgecarleitao merged commit 96cafcf into jorgecarleitao:main Dec 13, 2022
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Dec 13, 2022
ritchie46 added a commit to ritchie46/arrow2 that referenced this pull request Mar 29, 2023
ritchie46 added a commit to ritchie46/arrow2 that referenced this pull request Apr 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet writes all values of sliced arrays?
2 participants