Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed reading parquet binary dict page #791

Merged

Conversation

danburkert
Copy link
Contributor

As described in #790, there is a bug in parquet binary dict page decoding whereby too many values are yielded due to a mixup in the length arguments passed to various decoding functions, ultimately leading to extra bitpacked values being included in the output array. This PR fixes the bug, and adds some debug_assert!s in key places to ensure that tests would catch the issue in the future. I went ahead and added similar debug asserts to the other page type read functions, which exposed a few places where validity buffers are being allocated for non-nullable columns.

Unfortunately these new debug asserts are firing for nested page type tests, but I don't currently understand how the encoding works for these page types (it looks significantly more complicated than RLE & bitpack). I suspect that these test failures are a true positive, but I can't be sure without understanding how nested encoding works.

src/io/parquet/read/binary/basic.rs Outdated Show resolved Hide resolved
let length = std::cmp::min(pack_size, pack_remaining);

let additional = remaining.min(length);
let additional = pack_size.min(remaining);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the full page length instead of the remaining length is one of the bugs.

let values_iterator = values_iter(indices_buffer, dict, additional);

let mut validity_iterator = hybrid_rle::Decoder::new(validity_buffer, 1);

extend_from_decoder(
validity,
&mut validity_iterator,
length,
additional,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter mixup; extend_from_decoder expects the additional length, but this was passing total length.

@tustvold
Copy link
Contributor

tustvold commented Jan 26, 2022

FWIW on the surface this sounds similar to apache/arrow-rs#1111

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Jan 27, 2022

Great fix! Thanks!

My understanding is that the column.num_values() can't be reliably be used to estimate the capacity for nested types. imo It is a consequence of an ambiguous spec.

This has shown up some times:

My understanding is that for nested types num_values is 15 for the following list array:

[[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]]

which overshoots the number of values on the primitive array ([0, 1, 2, None, 3, 4, 5,6,7,8,9,10].len() = 12). The equality in the debug_assert is only valid for non-nested types.

I suggest we move the debug_assert to inside the non-nested branch and document why this is true in that case (imo it should raise an error, since data loss should be an error, but we can leave it for another PR - there is currently a mix of error and panic in this code base).

@codecov
Copy link

codecov bot commented Jan 27, 2022

Codecov Report

Merging #791 (bf675a2) into main (c80b39d) will increase coverage by 0.02%.
The diff coverage is 87.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #791      +/-   ##
==========================================
+ Coverage   71.05%   71.07%   +0.02%     
==========================================
  Files         319      319              
  Lines       16667    16672       +5     
==========================================
+ Hits        11843    11850       +7     
+ Misses       4824     4822       -2     
Impacted Files Coverage Δ
src/io/parquet/read/binary/basic.rs 62.00% <25.00%> (-1.11%) ⬇️
src/io/parquet/read/primitive/basic.rs 56.06% <50.00%> (-0.66%) ⬇️
src/io/parquet/read/binary/mod.rs 48.14% <100.00%> (+4.14%) ⬆️
src/io/parquet/read/boolean/mod.rs 76.19% <100.00%> (+2.50%) ⬆️
src/io/parquet/read/fixed_size_binary/mod.rs 75.75% <100.00%> (+0.75%) ⬆️
src/io/parquet/read/fixed_size_binary/utils.rs 55.55% <100.00%> (+5.55%) ⬆️
src/io/parquet/read/primitive/mod.rs 43.90% <100.00%> (+2.87%) ⬆️
src/io/parquet/read/utils.rs 73.13% <100.00%> (-0.40%) ⬇️
src/bitmap/mutable.rs 90.56% <0.00%> (+0.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c80b39d...bf675a2. Read the comment docs.

@danburkert
Copy link
Contributor Author

Got it, thanks for the explanation. I've gone ahead and moved the checks to within the non-nested branch.

@jorgecarleitao jorgecarleitao merged commit 601fa08 into jorgecarleitao:main Jan 27, 2022
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Jan 27, 2022
@jorgecarleitao jorgecarleitao changed the title Fix parquet binary dict page reading Fixed reading parquet binary dict page Jan 27, 2022
@danburkert
Copy link
Contributor Author

@jorgecarleitao one thing that occurred to me after having some time to digest your explanation - if the length/capacity as used in those methods isn't accurate for nested columns, then the capacity reservations for the data and validity buffers is going to be too large. Obviously won't cause bugs or errors, but perhaps something that could be optimized.

@danburkert
Copy link
Contributor Author

And thanks for the quick help on this! 🙏

@jorgecarleitao
Copy link
Owner

I agree. The reading of parquet is about to take an overall (#789), so I think we should revisit this after #789 lands.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants