Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

ryanrupp · 2019-01-18T22:49:45Z

When Parquet files are generated with the v2 format RLE_DICTIONARY is used (instead of PLAIN_DICTIONARY for v1). My understanding is some pages can be dictionary encoded but if there's too many unique values and the dictionary grows too large in size it will eventually fall back to plain encoding (see #4778). Previously, with the old version of the Parquet library there wasn't access to EncodingStats where it can be detected if this fallback behavior had to occur or not. Therefore, this change - #4779 - fixed the correctness to only work with PLAIN_DICTIONARY until the Parquet library could be updated and EncodingStats could be used. In #12247 parquet-mr is being upgraded to 1.10.0 which means the EncodingStats can be used once it's merged. Specifically, EncodingStats.hasNonDictionaryEncodedPages can be checked.

Here's the equivalent filtering handling in parquet-mr which also has the fallback logic if EncodingStats isn't available (e.g. on a V1 file).

The text was updated successfully, but these errors were encountered:

ryanrupp · 2019-01-19T01:03:11Z

Relevant Presto code here

zhenxiao · 2019-02-20T20:59:08Z

thank you for working on it, @ryanrupp I will take a look at the PRs
CC @nezihyigitbasi @highker

highker · 2019-02-20T21:34:32Z

@ryanrupp the PR is in the wrong repo. Could you send out a PR on top of prestodb repo?

zhenxiao · 2019-02-20T23:13:31Z

Hi @ryanrupp I left some comments in:
trinodb/trino#251

I think to make it working in prestodb repo, we need to upgrade Parquet first:
#12247

Ping me if you have any questions

ryanrupp · 2019-05-01T19:26:35Z

Fixed by dfcc669

ryanrupp mentioned this issue Feb 18, 2019

Support Parquet v2 dictionary filtering trinodb/trino#251

Merged

ryanrupp closed this as completed May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

ryanrupp commented Jan 18, 2019 •

edited

ryanrupp commented Jan 19, 2019

zhenxiao commented Feb 20, 2019

highker commented Feb 20, 2019

zhenxiao commented Feb 20, 2019

ryanrupp commented May 1, 2019

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

Support Parquet v2 format dictionary filtering (RLE_DICTIONARY) #12248

Comments

ryanrupp commented Jan 18, 2019 • edited

ryanrupp commented Jan 19, 2019

zhenxiao commented Feb 20, 2019

highker commented Feb 20, 2019

zhenxiao commented Feb 20, 2019

ryanrupp commented May 1, 2019

ryanrupp commented Jan 18, 2019 •

edited