Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decryption functionality to presto. #17791

Conversation

shangxinli
Copy link
Collaborator

@shangxinli shangxinli commented May 20, 2022

Co-authored-by: ggershinsky ggershinsky@users.noreply.github.com

Summary: This is to port parquet-mr decryption functionality. The main commits in parquet-mr for encryption/decryption are apache/parquet-java@65b95fb and several other fixes. This change only port the decryption only.

Test plan - (Please fill in how you tested your changes)

This feature was tested in the Uber environment and then rolled out to production for 2+ years.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Add decryption functionality to Presto. When a Parquet file is encrypted following [Parquet Modular Encryption](https://github.com/apache/parquet-format/blob/master/Encryption.md), this change enables Presto to be able to decrypt.  

Hive Changes
* No flag is introduced. Presto-Hive was changed by adding the loading DecryptionPropertiesFactory(implemented in parquet-mr) and using it to get the file decryptor and pass it to presto-parquet.  

@shangxinli
Copy link
Collaborator Author

This is a rebased PR for #17728

@shangxinli
Copy link
Collaborator Author

@zhenxiao @beinan After rebase, the PR is finally green now. Thanks.

Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli I think mostly good. Only 2 minor issues.

@@ -140,4 +163,14 @@ public static long getFirstRowIndex(int pageIndex, OffsetIndex offsetIndex)
{
return offsetIndex == null ? -1 : offsetIndex.getFirstRowIndex(pageIndex);
}

// additional authenticated data for AES cipher
private Slice decryptSliceIfNeeded(Slice slice, byte[] aad) throws IOException
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throws IOException in new line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

private int pageIndex;
private byte[] dataPageAAD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/dataPageAAD/dataPageAdditionalAuthenticationData/g
s/dictionaryPageAAD/dictionaryPageAdditionalAuthenticationData/g
I feel AAD is kind of hard to understand, shall we replace all AAD to Additional Authentication Data?

Copy link
Collaborator Author

@shangxinli shangxinli May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from.setPosition(0);
from.read(serializedFooter, 0, serializedFooter.length);

byte[] signedFooterAAD = AesCipher.createFooterAAD(fileDecryptor.getFileAAD());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAD, let's replace with Additional Authentication Data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

}
Decryptor footerDecryptor = null;
// additional authenticated data for AES cipher
byte[] aad = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/aad/additionalAuthenticationData/g

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

throws IOException
{
LinkedList<DataPage> pages = new LinkedList<>();
DictionaryPage dictionaryPage = null;
long valueCount = 0;
int dataPageCount = 0;
int pageOrdinal = 0;
byte[] dataPageHeaderAAD = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAD

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

while (hasMorePages(valueCount, dataPageCount)) {
PageHeader pageHeader = readPageHeader();
byte[] pageHeaderAAD = dataPageHeaderAAD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAD

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (stats != null) {
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
}
else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else not needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of stats == null, we will rely on encodings

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the logic is needed, shall we remove:
else {
since the if statement already returns

import java.util.List;
import java.util.Map;

public class EncDecPropertiesHelper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/EncDecPropertiesHelper/EncryptDecryptUtil/g

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@kewang1024 kewang1024 self-requested a review May 24, 2022 00:24
Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't started reviewing the functionality, but can we

  1. Fix the release note part
  2. For commit message, Limit the subject line to 50 characters to ensure that they are readable (https://cbea.ms/git-commit/)

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you squash two commits?

@shangxinli
Copy link
Collaborator Author

@kewang1024 Thanks for the review! Fixed the release notes and the commit message, and addressed the feedback.

@shangxinli shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch 2 times, most recently from 7c70f16 to b5804ad Compare May 24, 2022 16:18
@shangxinli
Copy link
Collaborator Author

Can you squash two commits?

Just did. Thanks.

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of understanding the overall logic, so my comments are mainly abut the tests

  1. Can we also add an additional benchmark test for parquetReader when it has fileDecryptor?
  2. I have an improvement idea on the design of class EncryptDecryptUtil
    I think it would be better to be able to pass in key information (FOOTER_KEY, FOOTER_KEY_METADATA, COL_KEY, COL_KEY_METADATA) and create a EncryptDecryptGenerator instance, which can generate a pair of FileDecryptionProperties and FileEncryptionProperties
    With this design, in the future (or in this PR) we can also test edge cases for EncryptDecryptGenerator

FileSystem fileSystem = path.getFileSystem(conf);
FSDataInputStream inputStream = fileSystem.open(path);
long fileSize = fileSystem.getFileStatus(path).getLen();
Optional<InternalFileDecryptor> fileDecryptor = createFileDecryptor();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fileDecryptor be shared? and looks like no need to make it an Optional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shareable? yes, made it a member variable.
Optional? From a test perspective, we don't need it but the signature of readFooter() need optional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: It turned out that I am wrong. The decryptor is not sharable. After I share, I got the error: 'ParquetCryptoRuntimeException: Decryptor re-use'.

Comment on lines 55 to 65
private static final byte[] FOOTER_KEY_METADATA = "footkey".getBytes(StandardCharsets.UTF_8);
private static final byte[] COL_KEY = {0x02, 0x03, 0x4, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
private static final byte[] COL_KEY_METADATA = "col".getBytes(StandardCharsets.UTF_8);

public static FileDecryptionProperties getFileDecryptionProperties()
throws IOException
{
DecryptionKeyRetrieverMock keyRetriever = new DecryptionKeyRetrieverMock();
keyRetriever.putKey("footkey", FOOTER_KEY);
keyRetriever.putKey("col", COL_KEY);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract "footkey" and "col" to static final variable

Correct me if I'm wrong, my understanding is that the names have be the same for (L55, L64), (L58, L65), otherwise, the decryptor and encryptor won't match. Thus we should avoid typing them manually in tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already final static, right? Not able to understand the ask.

No, we don't need them match. The encrypt and decrypt using the same key would be fine.

@shangxinli
Copy link
Collaborator Author

  • Can we also add an additional benchmark test for parquetReader when it has fileDecryptor?
  • I have an improvement idea on the design of class EncryptDecryptUtil
    I think it would be better to be able to pass in key information (FOOTER_KEY, FOOTER_KEY_METADATA, COL_KEY, COL_KEY_METADATA) and create a EncryptDecryptGenerator instance, which can generate a pair of FileDecryptionProperties and FileEncryptionProperties
    With this design, in the future (or in this PR) we can also test edge cases for EncryptDecryptGenerator
  1. The benchmarking can be found in the blog.
  2. For now I think we can test what we need. Let me know what is missing and I can add.

@shangxinli shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch from 9f4e66d to 0b8bc6f Compare May 27, 2022 05:36
Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished my review and most of my comments are regarding coding style and misleading naming; Resolving those would put us in a good state.

I'm not an expert in the parquet encryption/decryption logic, I will leave that to @zhenxiao for a final approval

FileDecryptionProperties fileDecryptionProperties = (cryptoFactory == null) ?
null : cryptoFactory.getFileDecryptionProperties(configuration, path);
Optional<InternalFileDecryptor> fileDecryptor = (fileDecryptionProperties == null) ?
Optional.empty() : Optional.of(new InternalFileDecryptor(fileDecryptionProperties));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@@ -172,6 +176,7 @@ public ParquetReader(MessageColumnIO
}
this.currentBlock = -1;
this.columnIndexFilterEnabled = columnIndexFilterEnabled;
this.fileDecryptor = fileDecryptor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: requireNonNull

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Comment on lines 226 to 232
// Lambda expression below requires final variable, so we define a new variable parquetDataSource.
final ParquetDataSource parquetDataSource = buildHdfsParquetDataSource(inputStream, path, stats);
dataSource = parquetDataSource;
DecryptionPropertiesFactory cryptoFactory = DecryptionPropertiesFactory.loadFactory(configuration);
FileDecryptionProperties fileDecryptionProperties = (cryptoFactory == null) ?
null : cryptoFactory.getFileDecryptionProperties(configuration, path);
Optional<InternalFileDecryptor> fileDecryptor = (fileDecryptionProperties == null) ?
Optional.empty() : Optional.of(new InternalFileDecryptor(fileDecryptionProperties));
ParquetMetadata parquetMetadata = hdfsEnvironment.doAs(user, () -> MetadataReader.readFooter(parquetDataSource, fileSize, fileDecryptor).getParquetMetadata());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is shared in all those three places “DeltaPageSourceProvider”, “”, “ParquetPageSourceFactory”, “IcebergPageSourceProvider”, can we extract them to be an util function / static function in presto-parquet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. In addition to the new change now, there are a few other existing places that should do the same. I create the issue #17835 to work on this after this PR is merged. This PR is already too large.

Comment on lines 109 to 111
if (!HiddenColumnChunkMetaData.isHiddenColumn(columnMetaData)) {
Statistics<?> columnStatistics = columnMetaData.getStatistics();
if (columnStatistics != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Merge those two ifs, too many nested statements hurt code readability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between these two diffs, we can save the .toArray() call if the first 'if' is not true. But fine, not a big deal anyway. I just merged it.

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge all the appending commits that addressed the comments back into original commit, overall looks good to me
@zhenxiao a final approval for the decryption and encryption logic?

Comment on lines 31 to 34
requireNonNull(path, "path should not be null");
this.path = path;
requireNonNull(filePath, "filePath should not be null");
this.filePath = filePath;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

this.path = requireNonNull(path, "path should not be null");
this.filePath = requireNonNull(filePath, "filePath should not be null");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch!

Copy link
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, many thanks for the contribution!

@shangxinli shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch 2 times, most recently from 0c00b2a to cbbe3c4 Compare June 6, 2022 23:44
@shangxinli
Copy link
Collaborator Author

Thanks for all who provided comments! I just squashed all the commits and addressed the last feedback from @kewang1024

Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, @shangxinli
I am good to merge this PR
could you please fix the 2 remaining issues, and squash into 1 commit?

private final FSDataInputStream inputStream;
private long readTimeNanos;
private long readBytes;
//private final ParquetReaderOptions options;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove this line? @shangxinli

if (stats != null) {
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
}
else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the logic is needed, shall we remove:
else {
since the if statement already returns

Co-authored-by: ggershinsky <ggershinsky@users.noreply.github.com>

Summary: This is to port parquet-mr decryption apache/parquet-java@65b95fb
@shangxinli shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch from 4c0ade5 to b1f3d03 Compare June 14, 2022 14:16
@shangxinli
Copy link
Collaborator Author

looks nice, @shangxinli I am good to merge this PR could you please fix the 2 remaining issues, and squash into 1 commit?

Fixed

@shangxinli
Copy link
Collaborator Author

Addressed the last two comments from @zhenxiao. Created a new PR #17881 for this change because resolving the conflict with HudiParquetPageSource failed. Please look at the new PR for review.

@shangxinli shangxinli closed this Jun 17, 2022
@shangxinli shangxinli deleted the column_indexes_dev_new_4_rebase.new branch June 17, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants