Add decryption functionality to presto. #17791

shangxinli · 2022-05-20T23:17:49Z

Co-authored-by: ggershinsky ggershinsky@users.noreply.github.com

Summary: This is to port parquet-mr decryption functionality. The main commits in parquet-mr for encryption/decryption are apache/parquet-java@65b95fb and several other fixes. This change only port the decryption only.

Test plan - (Please fill in how you tested your changes)

This feature was tested in the Uber environment and then rolled out to production for 2+ years.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Add decryption functionality to Presto. When a Parquet file is encrypted following [Parquet Modular Encryption](https://github.com/apache/parquet-format/blob/master/Encryption.md), this change enables Presto to be able to decrypt.  

Hive Changes
* No flag is introduced. Presto-Hive was changed by adding the loading DecryptionPropertiesFactory(implemented in parquet-mr) and using it to get the file decryptor and pass it to presto-parquet.

shangxinli · 2022-05-20T23:20:52Z

This is a rebased PR for #17728

shangxinli · 2022-05-21T00:42:23Z

@zhenxiao @beinan After rebase, the PR is finally green now. Thanks.

zhenxiao

@shangxinli I think mostly good. Only 2 minor issues.

zhenxiao · 2022-05-23T12:40:21Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/PageReader.java

@@ -140,4 +163,14 @@ public static long getFirstRowIndex(int pageIndex, OffsetIndex offsetIndex)
    {
        return offsetIndex == null ? -1 : offsetIndex.getFirstRowIndex(pageIndex);
    }
+
+    // additional authenticated data for AES cipher
+    private Slice decryptSliceIfNeeded(Slice slice, byte[] aad) throws IOException


throws IOException in new line

zhenxiao · 2022-05-23T12:41:23Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/PageReader.java

    private int pageIndex;
+    private byte[] dataPageAAD;


s/dataPageAAD/dataPageAdditionalAuthenticationData/g
s/dictionaryPageAAD/dictionaryPageAdditionalAuthenticationData/g
I feel AAD is kind of hard to understand, shall we replace all AAD to Additional Authentication Data?

zhenxiao · 2022-05-23T12:42:00Z

presto-parquet/src/main/java/com/facebook/presto/parquet/cache/MetadataReader.java

+        from.setPosition(0);
+        from.read(serializedFooter, 0, serializedFooter.length);
+
+        byte[] signedFooterAAD = AesCipher.createFooterAAD(fileDecryptor.getFileAAD());


AAD, let's replace with Additional Authentication Data

zhenxiao · 2022-05-23T12:42:26Z

presto-parquet/src/main/java/com/facebook/presto/parquet/cache/MetadataReader.java

+        }
+        Decryptor footerDecryptor = null;
+        // additional authenticated data for AES cipher
+        byte[] aad = null;


s/aad/additionalAuthenticationData/g

zhenxiao · 2022-05-23T12:43:21Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

            throws IOException
    {
        LinkedList<DataPage> pages = new LinkedList<>();
        DictionaryPage dictionaryPage = null;
        long valueCount = 0;
        int dataPageCount = 0;
+        int pageOrdinal = 0;
+        byte[] dataPageHeaderAAD = null;


zhenxiao · 2022-05-23T12:43:38Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

        while (hasMorePages(valueCount, dataPageCount)) {
-            PageHeader pageHeader = readPageHeader();
+            byte[] pageHeaderAAD = dataPageHeaderAAD;


zhenxiao · 2022-05-23T12:43:56Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

+        if (stats != null) {
+            return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
+        }
+        else {


else not needed?

In case of stats == null, we will rely on encodings

yep, the logic is needed, shall we remove:
else {
since the if statement already returns

zhenxiao · 2022-05-23T12:45:10Z

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/EncDecPropertiesHelper.java

+import java.util.List;
+import java.util.Map;
+
+public class EncDecPropertiesHelper


s/EncDecPropertiesHelper/EncryptDecryptUtil/g

kewang1024

Haven't started reviewing the functionality, but can we

Fix the release note part
For commit message, Limit the subject line to 50 characters to ensure that they are readable (https://cbea.ms/git-commit/)

kewang1024

Can you squash two commits?

...to-parquet/src/main/java/com/facebook/presto/parquet/cache/CachingParquetMetadataSource.java

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

shangxinli · 2022-05-24T16:13:46Z

@kewang1024 Thanks for the review! Fixed the release notes and the commit message, and addressed the feedback.

shangxinli · 2022-05-24T16:19:22Z

Can you squash two commits?

Just did. Thanks.

kewang1024

I'm still in the process of understanding the overall logic, so my comments are mainly abut the tests

Can we also add an additional benchmark test for parquetReader when it has fileDecryptor?
I have an improvement idea on the design of class EncryptDecryptUtil
I think it would be better to be able to pass in key information (FOOTER_KEY, FOOTER_KEY_METADATA, COL_KEY, COL_KEY_METADATA) and create a EncryptDecryptGenerator instance, which can generate a pair of FileDecryptionProperties and FileEncryptionProperties
With this design, in the future (or in this PR) we can also test edge cases for EncryptDecryptGenerator

presto-parquet/src/main/java/com/facebook/presto/parquet/cache/MetadataReader.java

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/EncryptionTestFile.java

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/TestEncryption.java

kewang1024 · 2022-05-25T23:11:01Z

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/TestEncryption.java

+        FileSystem fileSystem = path.getFileSystem(conf);
+        FSDataInputStream inputStream = fileSystem.open(path);
+        long fileSize = fileSystem.getFileStatus(path).getLen();
+        Optional<InternalFileDecryptor> fileDecryptor = createFileDecryptor();


Can fileDecryptor be shared? and looks like no need to make it an Optional?

Shareable? yes, made it a member variable.
Optional? From a test perspective, we don't need it but the signature of readFooter() need optional.

Update: It turned out that I am wrong. The decryptor is not sharable. After I share, I got the error: 'ParquetCryptoRuntimeException: Decryptor re-use'.

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/TestEncryption.java

kewang1024 · 2022-05-25T23:20:54Z

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/EncryptDecryptUtil.java

+    private static final byte[] FOOTER_KEY_METADATA = "footkey".getBytes(StandardCharsets.UTF_8);
+    private static final byte[] COL_KEY = {0x02, 0x03, 0x4, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b,
+            0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+    private static final byte[] COL_KEY_METADATA = "col".getBytes(StandardCharsets.UTF_8);
+
+    public static FileDecryptionProperties getFileDecryptionProperties()
+            throws IOException
+    {
+        DecryptionKeyRetrieverMock keyRetriever = new DecryptionKeyRetrieverMock();
+        keyRetriever.putKey("footkey", FOOTER_KEY);
+        keyRetriever.putKey("col", COL_KEY);


Can we extract "footkey" and "col" to static final variable

Correct me if I'm wrong, my understanding is that the names have be the same for (L55, L64), (L58, L65), otherwise, the decryptor and encryptor won't match. Thus we should avoid typing them manually in tests

It is already final static, right? Not able to understand the ask.

No, we don't need them match. The encrypt and decrypt using the same key would be fine.

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/EncryptDecryptUtil.java

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/TestEncryption.java

shangxinli · 2022-05-27T02:48:41Z

Can we also add an additional benchmark test for parquetReader when it has fileDecryptor?

I have an improvement idea on the design of class EncryptDecryptUtil
I think it would be better to be able to pass in key information (FOOTER_KEY, FOOTER_KEY_METADATA, COL_KEY, COL_KEY_METADATA) and create a EncryptDecryptGenerator instance, which can generate a pair of FileDecryptionProperties and FileEncryptionProperties
With this design, in the future (or in this PR) we can also test edge cases for EncryptDecryptGenerator

The benchmarking can be found in the blog.
For now I think we can test what we need. Let me know what is missing and I can add.

kewang1024

I have finished my review and most of my comments are regarding coding style and misleading naming; Resolving those would put us in a good state.

I'm not an expert in the parquet encryption/decryption logic, I will leave that to @zhenxiao for a final approval

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

kewang1024 · 2022-05-31T18:42:23Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

+            FileDecryptionProperties fileDecryptionProperties = (cryptoFactory == null) ?
+                    null : cryptoFactory.getFileDecryptionProperties(configuration, path);
+            Optional<InternalFileDecryptor> fileDecryptor = (fileDecryptionProperties == null) ?
+                    Optional.empty() : Optional.of(new InternalFileDecryptor(fileDecryptionProperties));


Sounds good

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

kewang1024 · 2022-05-31T18:58:09Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

@@ -172,6 +176,7 @@ public ParquetReader(MessageColumnIO
        }
        this.currentBlock = -1;
        this.columnIndexFilterEnabled = columnIndexFilterEnabled;
+        this.fileDecryptor = fileDecryptor;


NIT: requireNonNull

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

kewang1024 · 2022-06-01T05:27:27Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

+            // Lambda expression below requires final variable, so we define a new variable parquetDataSource.
+            final ParquetDataSource parquetDataSource = buildHdfsParquetDataSource(inputStream, path, stats);
+            dataSource = parquetDataSource;
+            DecryptionPropertiesFactory cryptoFactory = DecryptionPropertiesFactory.loadFactory(configuration);
+            FileDecryptionProperties fileDecryptionProperties = (cryptoFactory == null) ?
+                    null : cryptoFactory.getFileDecryptionProperties(configuration, path);
+            Optional<InternalFileDecryptor> fileDecryptor = (fileDecryptionProperties == null) ?
+                    Optional.empty() : Optional.of(new InternalFileDecryptor(fileDecryptionProperties));
+            ParquetMetadata parquetMetadata = hdfsEnvironment.doAs(user, () -> MetadataReader.readFooter(parquetDataSource, fileSize, fileDecryptor).getParquetMetadata());


This code block is shared in all those three places “DeltaPageSourceProvider”, “”, “ParquetPageSourceFactory”, “IcebergPageSourceProvider”, can we extract them to be an util function / static function in presto-parquet?

I agree. In addition to the new change now, there are a few other existing places that should do the same. I create the issue #17835 to work on this after this PR is merged. This PR is already too large.

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

kewang1024 · 2022-06-01T05:38:59Z

presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/PredicateUtils.java

+            if (!HiddenColumnChunkMetaData.isHiddenColumn(columnMetaData)) {
+                Statistics<?> columnStatistics = columnMetaData.getStatistics();
+                if (columnStatistics != null) {


NIT: Merge those two ifs, too many nested statements hurt code readability

Between these two diffs, we can save the .toArray() call if the first 'if' is not true. But fine, not a big deal anyway. I just merged it.

presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/PredicateUtils.java

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/EncryptDecryptUtil.java

kewang1024

Merge all the appending commits that addressed the comments back into original commit, overall looks good to me
@zhenxiao a final approval for the decryption and encryption logic?

kewang1024 · 2022-06-03T22:02:44Z

presto-parquet/src/main/java/org/apache/parquet/crypto/HiddenColumnChunkMetaData.java

+        requireNonNull(path, "path should not be null");
+        this.path = path;
+        requireNonNull(filePath, "filePath should not be null");
+        this.filePath = filePath;


NIT:

this.path = requireNonNull(path, "path should not be null"); this.filePath = requireNonNull(filePath, "filePath should not be null");

Yeah, good catch!

beinan

lgtm, many thanks for the contribution!

shangxinli · 2022-06-07T02:43:00Z

Thanks for all who provided comments! I just squashed all the commits and addressed the last feedback from @kewang1024

zhenxiao

looks nice, @shangxinli
I am good to merge this PR
could you please fix the 2 remaining issues, and squash into 1 commit?

zhenxiao · 2022-06-14T07:13:02Z

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/MockParquetDataSource.java

+    private final FSDataInputStream inputStream;
+    private long readTimeNanos;
+    private long readBytes;
+    //private final ParquetReaderOptions options;


shall we remove this line? @shangxinli

zhenxiao · 2022-06-14T07:27:44Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

+        if (stats != null) {
+            return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
+        }
+        else {


yep, the logic is needed, shall we remove:
else {
since the if statement already returns

Co-authored-by: ggershinsky <ggershinsky@users.noreply.github.com> Summary: This is to port parquet-mr decryption apache/parquet-java@65b95fb

shangxinli · 2022-06-14T22:13:56Z

looks nice, @shangxinli I am good to merge this PR could you please fix the 2 remaining issues, and squash into 1 commit?

Fixed

shangxinli · 2022-06-14T22:16:11Z

Addressed the last two comments from @zhenxiao. Created a new PR #17881 for this change because resolving the conflict with HudiParquetPageSource failed. Please look at the new PR for review.

shangxinli requested a review from a team as a code owner May 20, 2022 23:17

shangxinli requested a review from NikhilCollooru May 20, 2022 23:17

shangxinli mentioned this pull request May 20, 2022

Add decryption functionality to presto #17728

Closed

zhenxiao reviewed May 23, 2022

View reviewed changes

kewang1024 self-requested a review May 24, 2022 00:24

kewang1024 reviewed May 24, 2022

View reviewed changes

shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch 2 times, most recently from 7c70f16 to b5804ad Compare May 24, 2022 16:18

kewang1024 reviewed May 25, 2022

View reviewed changes

shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch from 9f4e66d to 0b8bc6f Compare May 27, 2022 05:36

kewang1024 reviewed Jun 1, 2022

View reviewed changes

kewang1024 approved these changes Jun 3, 2022

View reviewed changes

beinan approved these changes Jun 4, 2022

View reviewed changes

shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch 2 times, most recently from 0c00b2a to cbbe3c4 Compare June 6, 2022 23:44

zhenxiao approved these changes Jun 14, 2022

View reviewed changes

Add parquet ecryption functionality into presto

b1f3d03

Co-authored-by: ggershinsky <ggershinsky@users.noreply.github.com> Summary: This is to port parquet-mr decryption apache/parquet-java@65b95fb

shangxinli force-pushed the column_indexes_dev_new_4_rebase.new branch from 4c0ade5 to b1f3d03 Compare June 14, 2022 14:16

shangxinli mentioned this pull request Jun 14, 2022

Add parquet ecryption functionality into presto #17881

Merged

shangxinli closed this Jun 17, 2022

shangxinli deleted the column_indexes_dev_new_4_rebase.new branch June 17, 2022 17:57

Add decryption functionality to presto. #17791

Add decryption functionality to presto. #17791

Conversation

shangxinli commented May 20, 2022 • edited Loading

shangxinli commented May 20, 2022

shangxinli commented May 21, 2022

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli May 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 left a comment • edited Loading

Choose a reason for hiding this comment

kewang1024 left a comment

Choose a reason for hiding this comment

shangxinli commented May 24, 2022

shangxinli commented May 24, 2022

kewang1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented May 27, 2022

kewang1024 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beinan left a comment

Choose a reason for hiding this comment

shangxinli commented Jun 7, 2022

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented Jun 14, 2022

shangxinli commented Jun 14, 2022

shangxinli commented May 20, 2022 •

edited

Loading

shangxinli May 23, 2022 •

edited

Loading

kewang1024 left a comment •

edited

Loading

kewang1024 left a comment •

edited

Loading