Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in New Parquet Reader #4000

Closed
wants to merge 10 commits into from
Closed

Conversation

@zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Nov 20, 2015

Fix 3 things in New Parquet Reader:

  1. ParquetReader reading incorrect row group metadata
  2. LazyRead does not do necessary skips
  3. ParquetReader and ParquetFileReader could be merged into one class
@@ -80,7 +68,9 @@ public ParquetColumnChunkPageReader readColumn(ColumnDescriptor columnDescriptor
{
checkArgument(currentBlockMetadata.getRowCount() > 0, "Row group having 0 rows");

ColumnChunkMetaData metadata = columnMetadata.get(columnDescriptor);
ColumnChunkMetaData metadata = getColumnChunkMetaData(columnDescriptor);
checkArgument(metadata != null, "Could not find column metadata in the parquet file");
Copy link
Contributor

@dain dain Nov 20, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put this in getColumnChunkMetaData and then the method would never return null

Copy link
Contributor

@dain dain Nov 20, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be some kind of Presto corruption exception (like we have for ORC)?

@dain dain removed their assignment Nov 20, 2015
@zhenxiao zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from 1f9f915 to 5b4f318 Nov 21, 2015
@zhenxiao
Copy link
Collaborator Author

@zhenxiao zhenxiao commented Nov 21, 2015

thank you @dain , get comments addressed
Just found no need to make ParquetColumnReader indexed by both rowGroupNumber and ColumnDescriptor, just indexed by ColumnDescriptor is OK

@zhenxiao zhenxiao changed the title Fix reading row group in New Parquet Reader Fix bugs in New Parquet Reader Nov 24, 2015
@zhenxiao
Copy link
Collaborator Author

@zhenxiao zhenxiao commented Nov 24, 2015

@dain I updated this PR with more fixes in the new Parquet Reader, your comments and suggestions are appreciated

@zhenxiao zhenxiao force-pushed the parquet-row-group branch from f0f650f to 779b0bb Nov 24, 2015
@zhenxiao
Copy link
Collaborator Author

@zhenxiao zhenxiao commented Nov 24, 2015

also merge the testcase PR here

@zhenxiao
Copy link
Collaborator Author

@zhenxiao zhenxiao commented Dec 3, 2015

Add another bug fix, following @electrum 's handling of missing columns where columns were added to the metastore after table data was written:
e4b6ee5

Also fix the handling of missing columns in the new Parquet Reader

@dain dain self-assigned this Dec 3, 2015
@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

I'm repeating this comment from one of the commits....

For all of the Parquet code, can you take a look at the calls like checkArgument and if they are actually checks for file corruption (as opposed to programming errors), switch them to ParquetCorruptionException.

@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

Additionally, look at all uses of the new ParquetCorruptionException and remove unnecessary call to toString on the args

@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

This PR seems to have this. on most uses of class fields. Generally, we don't add an unnecessary this. to field accesses unless it helps with readability (you see this in some constructors).

@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

Can you scan through all the commits and fix any place where the method arguments are aligned, instead of using two indents (8 spaces)?

@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

For the new AbstractTestParquetReader code, the goal should be to test data streams that cover all of the various encodings in Parquet. My guess is you don't need cases like testLongStrideDictionary and testLongPatchedBase since these are specifically designed for funky ORC encodings. Additionally, it is possible you are missing some to stress your code. I'd used the code coverage tool in IntelliJ to verify I had cases for all the ORC encodings.

@dain
Copy link
Contributor

@dain dain commented Feb 9, 2016

My comments are mostly style/formatting. Let me know when this is updated and I'll get it in.

@zhenxiao zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from 6dcd646 to 622fe5a Feb 10, 2016
@zhenxiao
Copy link
Collaborator Author

@zhenxiao zhenxiao commented Feb 10, 2016

Thank u so much @dain
I get comments addressed

@zhenxiao zhenxiao force-pushed the parquet-row-group branch 2 times, most recently from f08a1dc to ed843c0 Feb 10, 2016
@dain
Copy link
Contributor

@dain dain commented Feb 10, 2016

I will land this after the next release goes out (they are working on it now).

@dain
Copy link
Contributor

@dain dain commented Feb 16, 2016

Merged, thanks!

@dain dain closed this Feb 16, 2016
@zhenxiao zhenxiao deleted the parquet-row-group branch Feb 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants