Fix bugs in New Parquet Reader #4000

Closed
wants to merge 10 commits into
from

Conversation

Projects
None yet
3 participants
@zhenxiao
Contributor

zhenxiao commented Nov 20, 2015

Fix 3 things in New Parquet Reader:

  1. ParquetReader reading incorrect row group metadata
  2. LazyRead does not do necessary skips
  3. ParquetReader and ParquetFileReader could be merged into one class

@dain dain self-assigned this Nov 20, 2015

@dain

View changes

...main/java/com/facebook/presto/hive/parquet/reader/ParquetFileReader.java
@@ -80,7 +68,9 @@ public ParquetColumnChunkPageReader readColumn(ColumnDescriptor columnDescriptor
{
checkArgument(currentBlockMetadata.getRowCount() > 0, "Row group having 0 rows");
- ColumnChunkMetaData metadata = columnMetadata.get(columnDescriptor);
+ ColumnChunkMetaData metadata = getColumnChunkMetaData(columnDescriptor);
+ checkArgument(metadata != null, "Could not find column metadata in the parquet file");

This comment has been minimized.

@dain

dain Nov 20, 2015

Contributor

why not put this in getColumnChunkMetaData and then the method would never return null

@dain

dain Nov 20, 2015

Contributor

why not put this in getColumnChunkMetaData and then the method would never return null

This comment has been minimized.

@dain

dain Nov 20, 2015

Contributor

Should this be some kind of Presto corruption exception (like we have for ORC)?

@dain

dain Nov 20, 2015

Contributor

Should this be some kind of Presto corruption exception (like we have for ORC)?

@dain

View changes

...src/main/java/com/facebook/presto/hive/parquet/reader/ParquetReader.java
@@ -45,7 +46,8 @@
private long currentPosition;
private long currentGroupRowCount;
private long nextRowInGroup;
- private Map<ColumnDescriptor, ParquetColumnReader> columnReadersMap = new HashMap<>();
+ private long currentRowGroup;
+ private Map<ColumnIndex, ParquetColumnReader> columnReadersMap = new HashMap<>();

This comment has been minimized.

@dain

dain Nov 20, 2015

Contributor

For a big file will this map become huge? If so, maybe clear the map in advanceToNextRowGroup

@dain

dain Nov 20, 2015

Contributor

For a big file will this map become huge? If so, maybe clear the map in advanceToNextRowGroup

@dain

View changes

...src/main/java/com/facebook/presto/hive/parquet/reader/ParquetReader.java
+ this.columnDescriptor = columnDescriptor;
+ }
+
+ public long getRowGroupNumber()

This comment has been minimized.

@dain

dain Nov 20, 2015

Contributor

If these aren't called, just remove them for now. We can always add them back later

@dain

dain Nov 20, 2015

Contributor

If these aren't called, just remove them for now. We can always add them back later

@dain

View changes

...src/main/java/com/facebook/presto/hive/parquet/reader/ParquetReader.java
+ }
+
+ ColumnIndex other = (ColumnIndex) obj;
+ return rowGroupNumber == other.getRowGroupNumber() &&

This comment has been minimized.

@dain

dain Nov 20, 2015

Contributor

use the direct fields of other

@dain

dain Nov 20, 2015

Contributor

use the direct fields of other

@dain dain removed their assignment Nov 20, 2015

@zhenxiao

This comment has been minimized.

Show comment
Hide comment
@zhenxiao

zhenxiao Nov 21, 2015

Contributor

thank you @dain , get comments addressed
Just found no need to make ParquetColumnReader indexed by both rowGroupNumber and ColumnDescriptor, just indexed by ColumnDescriptor is OK

Contributor

zhenxiao commented Nov 21, 2015

thank you @dain , get comments addressed
Just found no need to make ParquetColumnReader indexed by both rowGroupNumber and ColumnDescriptor, just indexed by ColumnDescriptor is OK

@zhenxiao zhenxiao changed the title from Fix reading row group in New Parquet Reader to Fix bugs in New Parquet Reader Nov 24, 2015

@zhenxiao

This comment has been minimized.

Show comment
Hide comment
@zhenxiao

zhenxiao Nov 24, 2015

Contributor

@dain I updated this PR with more fixes in the new Parquet Reader, your comments and suggestions are appreciated

Contributor

zhenxiao commented Nov 24, 2015

@dain I updated this PR with more fixes in the new Parquet Reader, your comments and suggestions are appreciated

@zhenxiao

This comment has been minimized.

Show comment
Hide comment
@zhenxiao

zhenxiao Nov 24, 2015

Contributor

also merge the testcase PR here

Contributor

zhenxiao commented Nov 24, 2015

also merge the testcase PR here

@zhenxiao

This comment has been minimized.

Show comment
Hide comment
@zhenxiao

zhenxiao Dec 3, 2015

Contributor

Add another bug fix, following @electrum 's handling of missing columns where columns were added to the metastore after table data was written:
e4b6ee5

Also fix the handling of missing columns in the new Parquet Reader

Contributor

zhenxiao commented Dec 3, 2015

Add another bug fix, following @electrum 's handling of missing columns where columns were added to the metastore after table data was written:
e4b6ee5

Also fix the handling of missing columns in the new Parquet Reader

@dain dain self-assigned this Dec 3, 2015

@dain dain removed the changes-requested label Feb 9, 2016

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

I'm repeating this comment from one of the commits....

For all of the Parquet code, can you take a look at the calls like checkArgument and if they are actually checks for file corruption (as opposed to programming errors), switch them to ParquetCorruptionException.

Contributor

dain commented Feb 9, 2016

I'm repeating this comment from one of the commits....

For all of the Parquet code, can you take a look at the calls like checkArgument and if they are actually checks for file corruption (as opposed to programming errors), switch them to ParquetCorruptionException.

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

Additionally, look at all uses of the new ParquetCorruptionException and remove unnecessary call to toString on the args

Contributor

dain commented Feb 9, 2016

Additionally, look at all uses of the new ParquetCorruptionException and remove unnecessary call to toString on the args

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

This PR seems to have this. on most uses of class fields. Generally, we don't add an unnecessary this. to field accesses unless it helps with readability (you see this in some constructors).

Contributor

dain commented Feb 9, 2016

This PR seems to have this. on most uses of class fields. Generally, we don't add an unnecessary this. to field accesses unless it helps with readability (you see this in some constructors).

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

Can you scan through all the commits and fix any place where the method arguments are aligned, instead of using two indents (8 spaces)?

Contributor

dain commented Feb 9, 2016

Can you scan through all the commits and fix any place where the method arguments are aligned, instead of using two indents (8 spaces)?

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

For the new AbstractTestParquetReader code, the goal should be to test data streams that cover all of the various encodings in Parquet. My guess is you don't need cases like testLongStrideDictionary and testLongPatchedBase since these are specifically designed for funky ORC encodings. Additionally, it is possible you are missing some to stress your code. I'd used the code coverage tool in IntelliJ to verify I had cases for all the ORC encodings.

Contributor

dain commented Feb 9, 2016

For the new AbstractTestParquetReader code, the goal should be to test data streams that cover all of the various encodings in Parquet. My guess is you don't need cases like testLongStrideDictionary and testLongPatchedBase since these are specifically designed for funky ORC encodings. Additionally, it is possible you are missing some to stress your code. I'd used the code coverage tool in IntelliJ to verify I had cases for all the ORC encodings.

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 9, 2016

Contributor

My comments are mostly style/formatting. Let me know when this is updated and I'll get it in.

Contributor

dain commented Feb 9, 2016

My comments are mostly style/formatting. Let me know when this is updated and I'll get it in.

@zhenxiao

This comment has been minimized.

Show comment
Hide comment
@zhenxiao

zhenxiao Feb 10, 2016

Contributor

Thank u so much @dain
I get comments addressed

Contributor

zhenxiao commented Feb 10, 2016

Thank u so much @dain
I get comments addressed

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 10, 2016

Contributor

I will land this after the next release goes out (they are working on it now).

Contributor

dain commented Feb 10, 2016

I will land this after the next release goes out (they are working on it now).

@dain

This comment has been minimized.

Show comment
Hide comment
@dain

dain Feb 16, 2016

Contributor

Merged, thanks!

Contributor

dain commented Feb 16, 2016

Merged, thanks!

@dain dain closed this Feb 16, 2016

@zhenxiao zhenxiao deleted the zhenxiao:parquet-row-group branch Feb 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment