Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading LZ4 and ZSTD-compresed parquet data #16960

Merged
merged 1 commit into from Nov 26, 2021

Conversation

shangxinli
Copy link
Contributor

@shangxinli shangxinli commented Nov 9, 2021

Add support for reading LZ4 and ZSTD when writing Hive tables.
Cherry-pick of trinodb/trino#910
Co-authored-by: Martin Traverso mtraverso@gmail.com

More details about ZSTD in Hive tables:
ZSTD is a more efficient compression mechanism, and in our production, we see ~7% reduction in storage size when converting from GZIP to ZSTD, and ~39% reduction from SNAPPY to ZSTD using default compression level (3).

Test Plan - Original commit in TrinoDB didn't have unit tests. Unit tests is added in this PR.
Tests of reading ZSTD files were done internally and also this change is used in production for a lot of tables.

== RELEASE NOTES ==

General Changes
* Add support for reading LZ4 and ZSTD-compressed parquet data. 

@shangxinli shangxinli force-pushed the zstd_new branch 2 times, most recently from 3b17c5c to ab195a2 Compare Nov 9, 2021
@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 11, 2021

@beinan Can you have a look?

beinan
beinan approved these changes Nov 11, 2021
Copy link
Member

@beinan beinan left a comment

The code change looks good to me, could you update the release note part of this PR, thanks!

Just one clarification question, what if we upgraded presto-apache-hadoop2 to 2.9 and included the native zstd lib, do we need to update the decompressZstd in your PR?
@zhenxiao do you wanna take a second look?

@zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Nov 12, 2021

looks good. @shangxinli is it a cherry-pick? if so, how about adding the cherry-pick original commit in commit message?

@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 14, 2021

Thanks @beinan @zhenxiao! I just updated the release note and added a link to the original commit. Please let me know if that looks good.

@zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Nov 15, 2021

thank you, @shangxinli
how about adding cherry-pick info in the commit message, e.g. 7ee814c

@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 16, 2021

Got it @zhenxiao! Just updated it.

@beinan
Copy link
Member

@beinan beinan commented Nov 16, 2021

Got it @zhenxiao! Just updated it.

f.y.r the example of commit message

Fix OOM caused by foo in bar

Foo was pack ratting ByteBuffers causing OOM.

Cherry-pick of https://github.com/foo/bar/pull/123 (https://github.com/foo/bar/pull/123)

Co-authored-by: Foo Bar <foo@bar.com>

@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 16, 2021

Just change it based on your suggestion @beinan

@beinan
Copy link
Member

@beinan beinan commented Nov 16, 2021

Just change it based on your suggestion @beinan

Hi @shangxinli , looks like you just change the message in this pull request, but you have not changed the commit message. You might need use git rebase or git commit --amend to update the commit message and push -f again. Thanks!

@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 18, 2021

Just changed the commit message

@shangxinli shangxinli force-pushed the zstd_new branch 3 times, most recently from 5c74971 to 42bef53 Compare Nov 21, 2021
Cherry-pick of trinodb/trino#910
Co-authored-by: Martin Traverso mtraverso@gmail.com

More details about ZSTD in Hive tables:

ZSTD is a more efficient compression mechanism, and in our production, we see ~7% reduction in storage size when converting from GZIP to ZSTD, and ~39% reduction from SNAPPY to ZSTD using default compression level (3).

Test Plan - Original commit in TrinoDB didn't have unit tests. Unit tests is added in this PR.
Tests of reading ZSTD files were done internally and also this change is used in production for a lot of tables.
@shangxinli
Copy link
Contributor Author

@shangxinli shangxinli commented Nov 25, 2021

@beinan @zhenxiao Thanks for spending time on this PR. I just got all the builds/tests to succeed. Let me know if you have more comments.

@zhenxiao zhenxiao merged commit e17ad78 into prestodb:master Nov 26, 2021
40 checks passed
@ajaygeorge ajaygeorge mentioned this pull request Dec 13, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants