Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results with corrupt binary statistics with the new Parquet reader #4732

Open
nezihyigitbasi opened this issue Mar 7, 2016 · 9 comments

Comments

Projects
None yet
7 participants
@nezihyigitbasi
Copy link
Contributor

commented Mar 7, 2016

Corrupt statistics for the Parquet Binary type (PARQUET-251) affects the experimental Parquet reader when predicate pushdown is enabled. Apparently the existing change for handling incorrect binary statistics cannot handle all the cases. Maybe it's time to upgrade to the latest stable Parquet release (v 1.8.1), which includes a proper fix.

cc\ @zhenxiao

@nezihyigitbasi nezihyigitbasi changed the title Incorrect Binary Parquet statistics affect the new reader Incorrect results with corrupt binary statistics with the new Parquet reader Mar 7, 2016

@zhenxiao

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2016

yes, exactly. We need to upgrade to a newer parquet version, which is using the org.apache namespace, not the old com.twitter namespace. presto-hive-apache needs to be updated to 1.8.1 first, then we could upgrade presto side. What do you think? @dain @electrum

@xubo245

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2016

there are warnings :2016-4-15 14:24:31 WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr using format: (.+) version ((.) )?(build ?(.)) in adam 0.18.3
How to solve it please?

@zhenxiao

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2016

@xubo245 are u using the new Parquet Reader or the old Parquet Reader in Presto? The stack trace u sent seems from a Parquet bug, which the old Parquet reader is depending on. For the new Parquet Reader, we bypass the messy Parquet-MR code, and do all the reading in Presto, with lots of optimizations. You could enable it by setting session properties:
set session parquet_optimized_reader_enabled=true
set session parquet_predicate_pushdown_enabled=true

@xubo245

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2016

@zhenxiao sorry, I don't use Presto. I use adam system:https://github.com/bigdatagenomics/adam
I don't know the version of Parquet. In adam : <parquet.version>1.4.3</parquet.version>

@shardnit

This comment has been minimized.

Copy link

commented Jun 29, 2016

@zhenxiao are those parquet session properties removed in newer version of presto/presto-cli? I am running 0.149 and both properties are not found.
Unknown session property parquet_optimized_reader_enabled
Unknown session property parquet_predicate_pushdown_enabled

@zhenxiao

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2016

@shardnit these are session properties for hive connector. You could try show session to check all the available session properties.

presto:luoz> show session;

And playing with them:

set session hive.parquet_optimized_reader_enabled=true;
set session hive.parquet_predicate_pushdown_enabled = true;

@electrum

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2016

The Adam project is not related to Presto. Please file an issue with that project.

@ducky427

This comment has been minimized.

Copy link

commented Mar 2, 2017

+1

@ryanrupp

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2018

If this issue ends up being a general issue for upgrading the Parquet library version I will note I ran into an exception here because with the latest version of Parquet-MR (1.9.0), which I was using to generate Parquet files, support was added for delta encoding was added to int64 (in addition to int32), see here - apache/parquet-mr#154

In the meantime I worked around it by just disabling that encoding on the int64 column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.