Fix bugs in New Parquet Reader #4000

zhenxiao · 2015-11-20T02:41:56Z

Fix 3 things in New Parquet Reader:

ParquetReader reading incorrect row group metadata
LazyRead does not do necessary skips
ParquetReader and ParquetFileReader could be merged into one class

dain · 2015-11-20T20:36:19Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/reader/ParquetFileReader.java

@@ -80,7 +68,9 @@ public ParquetColumnChunkPageReader readColumn(ColumnDescriptor columnDescriptor
    {
        checkArgument(currentBlockMetadata.getRowCount() > 0, "Row group having 0 rows");

-        ColumnChunkMetaData metadata = columnMetadata.get(columnDescriptor);
+        ColumnChunkMetaData metadata = getColumnChunkMetaData(columnDescriptor);
+        checkArgument(metadata != null, "Could not find column metadata in the parquet file");


why not put this in getColumnChunkMetaData and then the method would never return null

Should this be some kind of Presto corruption exception (like we have for ORC)?

zhenxiao · 2015-11-21T00:17:45Z

thank you @dain , get comments addressed
Just found no need to make ParquetColumnReader indexed by both rowGroupNumber and ColumnDescriptor, just indexed by ColumnDescriptor is OK

zhenxiao · 2015-11-24T00:55:20Z

@dain I updated this PR with more fixes in the new Parquet Reader, your comments and suggestions are appreciated

zhenxiao · 2015-11-24T02:17:28Z

also merge the testcase PR here

zhenxiao · 2015-12-03T01:25:49Z

Add another bug fix, following @electrum 's handling of missing columns where columns were added to the metastore after table data was written:
e4b6ee5

Also fix the handling of missing columns in the new Parquet Reader

dain · 2016-02-09T20:33:34Z

I'm repeating this comment from one of the commits....

For all of the Parquet code, can you take a look at the calls like checkArgument and if they are actually checks for file corruption (as opposed to programming errors), switch them to ParquetCorruptionException.

dain · 2016-02-09T20:36:10Z

Additionally, look at all uses of the new ParquetCorruptionException and remove unnecessary call to toString on the args

dain · 2016-02-09T20:37:42Z

This PR seems to have this. on most uses of class fields. Generally, we don't add an unnecessary this. to field accesses unless it helps with readability (you see this in some constructors).

dain · 2016-02-09T20:46:44Z

Can you scan through all the commits and fix any place where the method arguments are aligned, instead of using two indents (8 spaces)?

dain · 2016-02-09T21:31:25Z

For the new AbstractTestParquetReader code, the goal should be to test data streams that cover all of the various encodings in Parquet. My guess is you don't need cases like testLongStrideDictionary and testLongPatchedBase since these are specifically designed for funky ORC encodings. Additionally, it is possible you are missing some to stress your code. I'd used the code coverage tool in IntelliJ to verify I had cases for all the ORC encodings.

dain · 2016-02-09T21:32:37Z

My comments are mostly style/formatting. Let me know when this is updated and I'll get it in.

zhenxiao · 2016-02-10T16:21:09Z

Thank u so much @dain
I get comments addressed

dain · 2016-02-10T20:56:39Z

I will land this after the next release goes out (they are working on it now).

dain · 2016-02-16T19:51:12Z

Merged, thanks!

facebook-github-bot added the CLA Signed label Nov 20, 2015

dain self-assigned this Nov 20, 2015

dain reviewed Nov 20, 2015
View reviewed changes

dain added the changes-requested label Nov 20, 2015

dain removed their assignment Nov 20, 2015

zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from 1f9f915 to 5b4f318 Compare November 21, 2015 00:16

zhenxiao changed the title ~~Fix reading row group in New Parquet Reader~~ Fix bugs in New Parquet Reader Nov 24, 2015

zhenxiao force-pushed the parquet-row-group branch from f0f650f to 779b0bb Compare November 24, 2015 02:11

zhenxiao mentioned this pull request Nov 24, 2015

Add testcases for reading Parquet #3716

Closed

dain self-assigned this Dec 3, 2015

zhenxiao mentioned this pull request Jan 20, 2016

Minor Parquet reader cleanup #4371

Closed

dain removed the changes-requested label Feb 9, 2016

dain added the changes-requested label Feb 9, 2016

zhenxiao added 3 commits February 9, 2016 15:01

Fix reading row group in New Parquet Reader

6de024d

Fix LazyRead in Parquet: Skip values

954a0c5

Merge ParquetFileReader into ParquetReader

5ee1d8e

zhenxiao added 5 commits February 9, 2016 15:33

Fix handling of missing statistics in TupleDomainParquetPredicate

26121ce

Add Testcases for New Parquet Reader

b957c99

Add ParquetDataSource, not to use HDFS FileSystem directly

3a67192

Fix handling of missing columns in Parquet Reader

d440d39

Not filtering column if there is any exception when reading dictionary

a9d2c18

zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from 6dcd646 to 622fe5a Compare February 10, 2016 16:10

zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from f08a1dc to ed843c0 Compare February 10, 2016 19:42

zhenxiao added 2 commits February 10, 2016 12:06

Ignore corrupted statistics in Parquet

617d073

Parquet code cleanup

eddb174

zhenxiao force-pushed the parquet-row-group branch from ed843c0 to eddb174 Compare February 10, 2016 20:09

dain added accepted and removed changes-requested labels Feb 10, 2016

dain closed this Feb 16, 2016

zhenxiao deleted the parquet-row-group branch February 17, 2016 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in New Parquet Reader #4000

Fix bugs in New Parquet Reader #4000

zhenxiao commented Nov 20, 2015

dain Nov 20, 2015

dain Nov 20, 2015

zhenxiao commented Nov 21, 2015

zhenxiao commented Nov 24, 2015

zhenxiao commented Nov 24, 2015

zhenxiao commented Dec 3, 2015

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

zhenxiao commented Feb 10, 2016

dain commented Feb 10, 2016

dain commented Feb 16, 2016

Fix bugs in New Parquet Reader #4000

Fix bugs in New Parquet Reader #4000

Conversation

zhenxiao commented Nov 20, 2015

dain Nov 20, 2015

Choose a reason for hiding this comment

dain Nov 20, 2015

Choose a reason for hiding this comment

zhenxiao commented Nov 21, 2015

zhenxiao commented Nov 24, 2015

zhenxiao commented Nov 24, 2015

zhenxiao commented Dec 3, 2015

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

dain commented Feb 9, 2016

zhenxiao commented Feb 10, 2016

dain commented Feb 10, 2016

dain commented Feb 16, 2016