Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native parquet writer #2004

Open
wants to merge 19 commits into
base: master
from
Open

Conversation

@qqibrow
Copy link
Contributor

qqibrow commented Nov 12, 2019

fixes #1145

@cla-bot cla-bot bot added the cla-signed label Nov 12, 2019
@qqibrow qqibrow force-pushed the qqibrow:prestosql_parquet branch 2 times, most recently from bcfba52 to b3f2ad3 Nov 13, 2019
@dain dain requested review from dain and martint Nov 18, 2019
Copy link
Member

dain left a comment

Mostly minor comments. For future work:

  • I feel like there should be a simple way to do repetition and definition levels. The code it pretty abstract and likely not very fast, but I don't see an obvious simpler technique.
  • It would be good to have a simple after write validator like ORC has. It save a lot of problems when rolling out the ORC writer. The biggest thing it checks is a checksum of the columns to see if the data is the same. It also validates stats, which as we know has been a problem for the mainline parquet code.
}
}

private static BiConsumer<Block, Integer> getWriteFunction(ValuesWriter valuesWriter, Type type, Supplier<Statistics<?>> statisticsSupplier)

This comment has been minimized.

Copy link
@dain

dain Nov 29, 2019

Member

I would move the if statements to write and have three loops. It is faster to branch on the outside of the loop.

This comment has been minimized.

Copy link
@qqibrow

qqibrow Jan 3, 2020

Author Contributor

The if ... else if ... branch only runs once. This function returns BiConsumer<Block, Integer>.

This comment has been minimized.

Copy link
@dain

dain Jan 10, 2020

Member

I am referring to the caller of this in the write function above. In the function above, you have a for loop containing a call to a dynamic implementation. This will prevent the JVM from inlineing the implemetnation, which disables a bunch of JVM optimizations. Changing the write method to have three conditions, each containing the for loop should be much more efficient.

@qqibrow qqibrow force-pushed the qqibrow:prestosql_parquet branch from e0daeb6 to 818a4be Jan 3, 2020
@qqibrow qqibrow force-pushed the qqibrow:prestosql_parquet branch from 818a4be to f2c0a26 Jan 3, 2020
@dain dain self-requested a review Jan 7, 2020
Copy link
Member

dain left a comment

A couple of minor comments. This seems good enough for the first revision.

Can you squash the update commits?

}
}

private static BiConsumer<Block, Integer> getWriteFunction(ValuesWriter valuesWriter, Type type, Supplier<Statistics<?>> statisticsSupplier)

This comment has been minimized.

Copy link
@dain

dain Jan 10, 2020

Member

I am referring to the caller of this in the write function above. In the function above, you have a for loop containing a call to a dynamic implementation. This will prevent the JVM from inlineing the implemetnation, which disables a bunch of JVM optimizations. Changing the write method to have three conditions, each containing the for loop should be much more efficient.

return result;
}

private static void addToList(final List<SchemaElement> result, org.apache.parquet.schema.Type field)

This comment has been minimized.

Copy link
@dain

dain Jan 10, 2020

Member

I don't think the final is required here anymore. We don't put final on parameters (or local variables) unless necessary.

Util.writeFileMetaData(fileMetaData, dynamicSliceOutput);
}
catch (IOException e) {
throw new UncheckedIOException(e);

This comment has been minimized.

Copy link
@dain

dain Jan 10, 2020

Member

Actually we don't need to use UncheckedIOException here. The call chain is close -> writeFooter -> getFooter, and close throws IOEexception

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.