New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248
Comments
Relevant Presto code here |
thank you for working on it, @ryanrupp I will take a look at the PRs |
@ryanrupp the PR is in the wrong repo. Could you send out a PR on top of prestodb repo? |
Hi @ryanrupp I left some comments in: I think to make it working in prestodb repo, we need to upgrade Parquet first: Ping me if you have any questions |
Fixed by dfcc669 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When Parquet files are generated with the v2 format RLE_DICTIONARY is used (instead of PLAIN_DICTIONARY for v1). My understanding is some pages can be dictionary encoded but if there's too many unique values and the dictionary grows too large in size it will eventually fall back to plain encoding (see #4778). Previously, with the old version of the Parquet library there wasn't access to
EncodingStats
where it can be detected if this fallback behavior had to occur or not. Therefore, this change - #4779 - fixed the correctness to only work withPLAIN_DICTIONARY
until the Parquet library could be updated andEncodingStats
could be used. In #12247 parquet-mr is being upgraded to 1.10.0 which means theEncodingStats
can be used once it's merged. Specifically,EncodingStats.hasNonDictionaryEncodedPages
can be checked.Here's the equivalent filtering handling in parquet-mr which also has the fallback logic if
EncodingStats
isn't available (e.g. on a V1 file).The text was updated successfully, but these errors were encountered: