New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spill to disk compression and encryption #778
Conversation
@cla-bot check |
if (!compressor.isPresent()) { | ||
return new SerializedPage(serializationBuffer.slice(), PageCodecMarker.none(), page.getPositionCount(), serializationBuffer.size()); | ||
if (compressor.isPresent()) { | ||
ByteBuffer compressionBuffer = ByteBuffer.allocate(compressor.get().maxCompressedLength(uncompressedSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we should avoid ByteBuffer
. Here is the code from the ORC reader that unwraps a Slice into the underlying byte[]
, and we can be sure this is heap based buffer since we are creating the stream above:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll use the code you linked here for compression as well and encryption. I think, however, that forcing callers of the Slice
API to use constants from sun.misc.Unsafe
and blindly casting is a pretty bad code smell. I trust you when you say it's safe to assume these slices are always backed by heap arrays, but obviously that may not always hold.
If these patterns of slice access are common (which it appears they are becoming at least semi-common), then maybe Slice needshasArray()
, arrayOffset()
and array()
methods to keep the caller from doing these casts. I'm not a huge fan of ByteBuffer
as an API design, but those methods do give you a sane way to accomplish this and handle non-heap buffers if they're encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of adding these APIs to Slice. @dain what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, it would be nice to add an optimal dependency on aircompressor and have a simple util to do the for you... maybe another helper for crypto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@electrum I've opened airlift/slice#118 to expose the byte array unwrapping methods we discussed here.
presto-main/src/main/java/io/prestosql/execution/buffer/PagesSerde.java
Outdated
Show resolved
Hide resolved
this.mask = (1 << (bit - 1)); | ||
} | ||
|
||
public boolean isSet(byte value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing the bit orientedness of this to the callers, I suggest we use a Set< PageCodecMarker>
in the main code, and then we in SerializedPage
we convert to a byte
for memory efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some tricky memory overhead considerations of going with a full Set. I think I can come up with a happy medium of making the interface a little less awkward by implementing a PageCodecMarkerSet class that works a little bit like a BitSet over the underlying byte so we don’t incur the per Page overhead associated with Set (even EnumSet) instances. I’ll put that together and see what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in the process of cleaning up the code based on other comments, but here's a revised first commit with the bitset-like API we talked about. Let me know if you think it's an improvement over the raw byte based API (note: don't review it for anything other than style of API, I haven't even tried to compile it yet). IMO- it still creates some awkwardness where the choice is either to still expose the PageCodecMarker#isSet(byteValue)
or to force the code to box the byte
into a MarkerSet
just to do a MarkerSet#contains(PageCodecMarker)
check. I chose to leave that API public in the commit for that reason.
presto-main/src/main/java/io/prestosql/execution/buffer/PagesSerde.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pettyjamesm thank you for your contribution!
Major points (see my comments inline for details / discussion):
- Design decision -- should compression and encryption be done on page level or file stream level?
- Naming -- "spill cipher" or "page cipher"?
- API --
byte[]
,ByteBuffer
orSlice
?
presto-main/src/main/java/io/prestosql/execution/buffer/SerializedPage.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/AesSpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/AesSpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/AesSpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/FileSingleStreamSpillerFactory.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/FileSingleStreamSpiller.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/SpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/SpillCipher.java
Outdated
Show resolved
Hide resolved
07445f9
to
1e5e575
Compare
presto-main/src/test/java/io/prestosql/operator/spiller/BenchmarkBinaryFileSpiller.java
Outdated
Show resolved
Hide resolved
54546e3
to
3f1324f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Most of my comments are very minor style things. I think the most important ones are in the AesSpillCipher
, but they also readability issues.
|
||
private SecretKey key; | ||
// Instance used only for determining encrypted output lengths | ||
private Cipher encryptSizer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to keep this around after the constructor? It seems like it is only used for throwCipherClosedIfNull(encryptSizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the encryptSizer
is necessary because of the encryptedMaxLength(int inputLength)
method. Creating a new encrypt cipher just to determine the output size is slow-ish and expensive since we know the immediate next operation is going to be an encrypt which will create a new cipher instance. Keeping that cipher around (until the instance is closed) to do sizing operations seemed preferable to duplicating the internal logic of the AES engine sizing calculation or creating a fresh instance just to compute the required buffer length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh missed encryptSizer
sizer there due to the chained assignment.
The code might be easier to read if you explicity checked for close in each public method and then just used the fields directly:
Failures.checkCondition(key != null, GENERIC_INTERNAL_ERROR, "Spill cipher already closed")); // with a static import
I believe it is safe to use the fields directly because this class isn't thread safe anyway.
public int encryptedMaxLength(int inputLength) | ||
{ | ||
checkArgument(inputLength >= 0, "inputLength must be >= 0, inputLength=%s", inputLength); | ||
return ivBytes + throwCipherClosedIfNull(encryptSizer).getOutputSize(inputLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more readable if you added an instance checkNotClosed()
method to this class and then call that after the checkArgument
s of each public method... specifically, I find the passthrough check in the middle of a more complex statement a bit hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that a checkNotClosed()
method would still make it possible for a concurrent close()
call to null the field between the check and subsequent usage. It seemed preferable to avoid adding synchronizing calls just for that case and instead do a null check + local variable usage pattern. I can separate the check from the call though if that's an improvement on readability, ie:
Cipher sizer = throwCipherClosedIfNull(encryptSizer);
return ivBytes + sizer.getOutputSize(inputLength);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class thread safe? I didn't think it was. If so, can you mark it with @ThreadSafe
? BTW, I missed this reply, due to the "outdated" github thing, when writing my review comments below.
presto-main/src/main/java/io/prestosql/spiller/AesSpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/FileSingleStreamSpiller.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/spiller/SpillCipher.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/spiller/TestFileSingleStreamSpiller.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/spiller/TestSpillCipherPagesSerde.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/execution/buffer/PageCodecMarker.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/execution/buffer/PageCodecMarker.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/execution/buffer/PageCodecMarker.java
Outdated
Show resolved
Hide resolved
PageCodecMarker supports up to 8 flags for page serialization encoded in a single byte. It will be used to specify encryption and compression.
Spill compression is disabled by default and can be enabled by setting experimental.spill-compression-enabled to true. It functions on top of PagesSerde just like exchange compression does, using the same algorithm.
Spill to disk encryption is disabled by default and controlled by setting experimental.spill-encryption-enabled to true. The implementation allows spilled pages to be encrypted using a randomly generated key per spiller instance which is destroyed when Spiller#close() is called.
3f1324f
to
ce2e119
Compare
@dain resolved your comments in places where I made the requested changes, left comments in the other places where I didn't. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I had one comment, but I think it is just a judgement call, so feel free to ignore.
We'll merge after the in-progress release goes out.
|
||
private SecretKey key; | ||
// Instance used only for determining encrypted output lengths | ||
private Cipher encryptSizer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh missed encryptSizer
sizer there due to the chained assignment.
The code might be easier to read if you explicity checked for close in each public method and then just used the fields directly:
Failures.checkCondition(key != null, GENERIC_INTERNAL_ERROR, "Spill cipher already closed")); // with a static import
I believe it is safe to use the fields directly because this class isn't thread safe anyway.
@pettyjamesm Merged. Thanks! For that last, minor, comment above, if you still want to change it just send a PR. |
Thanks @dain - to address your point, the check-closed thing would work but in the event of an concurrent call to |
Cross port of prestodb/presto#12617