Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Delayed dict #1185

Merged
merged 4 commits into from
Aug 11, 2022
Merged

Delayed dict #1185

merged 4 commits into from
Aug 11, 2022

Conversation

jorgecarleitao
Copy link
Owner

This PR is a follow-up on jorgecarleitao/parquet2#160, bringing its design changes here.

The main idea of this PR is that we no longer use the in-memory format for dictionary pages from parquet2. Instead, the dictionary pages are deserialized by us and, depending on the target arrow2::datatypes::DataType, we specialize to what it is deserialised to (e.g. for DataType::Dictionary we use one, for DataType::Utf8 we use another).

This is probably the (globally) optimal way to deserialize dictionary-encoded data to arrow.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Jul 25, 2022
@codecov
Copy link

codecov bot commented Jul 25, 2022

Codecov Report

Merging #1185 (a1dd795) into main (838deca) will increase coverage by 0.13%.
The diff coverage is 88.54%.

@@            Coverage Diff             @@
##             main    #1185      +/-   ##
==========================================
+ Coverage   83.17%   83.30%   +0.13%     
==========================================
  Files         358      358              
  Lines       37255    37314      +59     
==========================================
+ Hits        30986    31085      +99     
+ Misses       6269     6229      -40     
Impacted Files Coverage Δ
src/io/parquet/read/mod.rs 100.00% <ø> (ø)
src/io/parquet/write/mod.rs 86.50% <ø> (ø)
src/io/parquet/write/utils.rs 93.61% <ø> (-0.07%) ⬇️
src/io/parquet/read/deserialize/mod.rs 73.33% <50.00%> (-1.43%) ⬇️
src/io/parquet/write/dictionary.rs 86.33% <60.00%> (ø)
src/io/parquet/read/deserialize/boolean/basic.rs 93.60% <66.66%> (-0.71%) ⬇️
src/io/parquet/read/deserialize/dictionary/mod.rs 77.09% <69.56%> (-0.88%) ⬇️
src/io/parquet/read/deserialize/binary/basic.rs 80.44% <80.00%> (+4.66%) ⬆️
...t/read/deserialize/fixed_size_binary/dictionary.rs 44.28% <83.33%> (-5.72%) ⬇️
...c/io/parquet/read/deserialize/dictionary/nested.rs 65.83% <84.21%> (+3.10%) ⬆️
... and 29 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jorgecarleitao jorgecarleitao force-pushed the delayed_dict branch 4 times, most recently from 77c7e7d to 41a40e7 Compare August 1, 2022 05:02
@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 10, 2022 21:19
@jorgecarleitao jorgecarleitao merged commit 2a12d17 into main Aug 11, 2022
@jorgecarleitao jorgecarleitao deleted the delayed_dict branch August 11, 2022 04:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant