Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

Closed
ryanrupp opened this issue Jan 18, 2019 · 5 comments
Closed

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

ryanrupp opened this issue Jan 18, 2019 · 5 comments

Comments

@ryanrupp
Copy link
Contributor

ryanrupp commented Jan 18, 2019

When Parquet files are generated with the v2 format RLE_DICTIONARY is used (instead of PLAIN_DICTIONARY for v1). My understanding is some pages can be dictionary encoded but if there's too many unique values and the dictionary grows too large in size it will eventually fall back to plain encoding (see #4778). Previously, with the old version of the Parquet library there wasn't access to EncodingStats where it can be detected if this fallback behavior had to occur or not. Therefore, this change - #4779 - fixed the correctness to only work with PLAIN_DICTIONARY until the Parquet library could be updated and EncodingStats could be used. In #12247 parquet-mr is being upgraded to 1.10.0 which means the EncodingStats can be used once it's merged. Specifically, EncodingStats.hasNonDictionaryEncodedPages can be checked.

Here's the equivalent filtering handling in parquet-mr which also has the fallback logic if EncodingStats isn't available (e.g. on a V1 file).

@ryanrupp
Copy link
Contributor Author

Relevant Presto code here

@zhenxiao
Copy link
Collaborator

thank you for working on it, @ryanrupp I will take a look at the PRs
CC @nezihyigitbasi @highker

@highker
Copy link
Contributor

highker commented Feb 20, 2019

@ryanrupp the PR is in the wrong repo. Could you send out a PR on top of prestodb repo?

@zhenxiao
Copy link
Collaborator

Hi @ryanrupp I left some comments in:
trinodb/trino#251

I think to make it working in prestodb repo, we need to upgrade Parquet first:
#12247

Ping me if you have any questions

@ryanrupp
Copy link
Contributor Author

ryanrupp commented May 1, 2019

Fixed by dfcc669

@ryanrupp ryanrupp closed this as completed May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants