Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spill to disk compression and encryption #778

Merged
merged 3 commits into from May 30, 2019

Conversation

pettyjamesm
Copy link
Member

Cross port of prestodb/presto#12617

@martint
Copy link
Member

martint commented May 15, 2019

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label May 15, 2019
@trinodb trinodb deleted a comment from cla-bot bot May 15, 2019
@trinodb trinodb deleted a comment from cla-bot bot May 15, 2019
presto-docs/src/main/sphinx/admin/spill.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/spill.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/spill.rst Outdated Show resolved Hide resolved
if (!compressor.isPresent()) {
return new SerializedPage(serializationBuffer.slice(), PageCodecMarker.none(), page.getPositionCount(), serializationBuffer.size());
if (compressor.isPresent()) {
ByteBuffer compressionBuffer = ByteBuffer.allocate(compressor.get().maxCompressedLength(uncompressedSize));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we should avoid ByteBuffer. Here is the code from the ORC reader that unwraps a Slice into the underlying byte[], and we can be sure this is heap based buffer since we are creating the stream above:

https://github.com/prestosql/presto/blob/master/presto-orc/src/main/java/io/prestosql/orc/stream/CompressedOrcChunkLoader.java#L129-L133

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll use the code you linked here for compression as well and encryption. I think, however, that forcing callers of the Slice API to use constants from sun.misc.Unsafe and blindly casting is a pretty bad code smell. I trust you when you say it's safe to assume these slices are always backed by heap arrays, but obviously that may not always hold.

If these patterns of slice access are common (which it appears they are becoming at least semi-common), then maybe Slice needshasArray(), arrayOffset() and array() methods to keep the caller from doing these casts. I'm not a huge fan of ByteBuffer as an API design, but those methods do give you a sane way to accomplish this and handle non-heap buffers if they're encountered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding these APIs to Slice. @dain what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, it would be nice to add an optimal dependency on aircompressor and have a simple util to do the for you... maybe another helper for crypto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum I've opened airlift/slice#118 to expose the byte array unwrapping methods we discussed here.

this.mask = (1 << (bit - 1));
}

public boolean isSet(byte value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing the bit orientedness of this to the callers, I suggest we use a Set< PageCodecMarker> in the main code, and then we in SerializedPage we convert to a byte for memory efficiency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some tricky memory overhead considerations of going with a full Set. I think I can come up with a happy medium of making the interface a little less awkward by implementing a PageCodecMarkerSet class that works a little bit like a BitSet over the underlying byte so we don’t incur the per Page overhead associated with Set (even EnumSet) instances. I’ll put that together and see what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of cleaning up the code based on other comments, but here's a revised first commit with the bitset-like API we talked about. Let me know if you think it's an improvement over the raw byte based API (note: don't review it for anything other than style of API, I haven't even tried to compile it yet). IMO- it still creates some awkwardness where the choice is either to still expose the PageCodecMarker#isSet(byteValue) or to force the code to box the byte into a MarkerSet just to do a MarkerSet#contains(PageCodecMarker) check. I chose to leave that API public in the commit for that reason.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pettyjamesm thank you for your contribution!

Major points (see my comments inline for details / discussion):

  1. Design decision -- should compression and encryption be done on page level or file stream level?
  2. Naming -- "spill cipher" or "page cipher"?
  3. API -- byte[], ByteBuffer or Slice?

@pettyjamesm pettyjamesm force-pushed the spill-to-disk-encryption branch 2 times, most recently from 07445f9 to 1e5e575 Compare May 17, 2019 19:39
@pettyjamesm
Copy link
Member Author

@dain, @findepi - I've addressed your previous feedback on the PR. Ready for another review.

@pettyjamesm pettyjamesm force-pushed the spill-to-disk-encryption branch 3 times, most recently from 54546e3 to 3f1324f Compare May 24, 2019 15:30
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Most of my comments are very minor style things. I think the most important ones are in the AesSpillCipher, but they also readability issues.


private SecretKey key;
// Instance used only for determining encrypted output lengths
private Cipher encryptSizer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to keep this around after the constructor? It seems like it is only used for throwCipherClosedIfNull(encryptSizer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the encryptSizer is necessary because of the encryptedMaxLength(int inputLength) method. Creating a new encrypt cipher just to determine the output size is slow-ish and expensive since we know the immediate next operation is going to be an encrypt which will create a new cipher instance. Keeping that cipher around (until the instance is closed) to do sizing operations seemed preferable to duplicating the internal logic of the AES engine sizing calculation or creating a fresh instance just to compute the required buffer length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh missed encryptSizer sizer there due to the chained assignment.

The code might be easier to read if you explicity checked for close in each public method and then just used the fields directly:

Failures.checkCondition(key != null, GENERIC_INTERNAL_ERROR, "Spill cipher already closed")); // with a static import

I believe it is safe to use the fields directly because this class isn't thread safe anyway.

public int encryptedMaxLength(int inputLength)
{
checkArgument(inputLength >= 0, "inputLength must be >= 0, inputLength=%s", inputLength);
return ivBytes + throwCipherClosedIfNull(encryptSizer).getOutputSize(inputLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more readable if you added an instance checkNotClosed() method to this class and then call that after the checkArguments of each public method... specifically, I find the passthrough check in the middle of a more complex statement a bit hard to follow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that a checkNotClosed() method would still make it possible for a concurrent close() call to null the field between the check and subsequent usage. It seemed preferable to avoid adding synchronizing calls just for that case and instead do a null check + local variable usage pattern. I can separate the check from the call though if that's an improvement on readability, ie:

  Cipher sizer = throwCipherClosedIfNull(encryptSizer);
  return ivBytes + sizer.getOutputSize(inputLength);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class thread safe? I didn't think it was. If so, can you mark it with @ThreadSafe? BTW, I missed this reply, due to the "outdated" github thing, when writing my review comments below.

PageCodecMarker supports up to 8 flags for page serialization encoded in a
single byte. It will be used to specify encryption and compression.
Spill compression is disabled by default and can be enabled by setting
experimental.spill-compression-enabled to true. It functions on top of
PagesSerde just like exchange compression does, using the same algorithm.
Spill to disk encryption is disabled by default and controlled by setting
experimental.spill-encryption-enabled to true. The implementation
allows spilled pages to be encrypted using a randomly generated key
per spiller instance which is destroyed when Spiller#close() is
called.
@pettyjamesm
Copy link
Member Author

@dain resolved your comments in places where I made the requested changes, left comments in the other places where I didn't.

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I had one comment, but I think it is just a judgement call, so feel free to ignore.

We'll merge after the in-progress release goes out.


private SecretKey key;
// Instance used only for determining encrypted output lengths
private Cipher encryptSizer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh missed encryptSizer sizer there due to the chained assignment.

The code might be easier to read if you explicity checked for close in each public method and then just used the fields directly:

Failures.checkCondition(key != null, GENERIC_INTERNAL_ERROR, "Spill cipher already closed")); // with a static import

I believe it is safe to use the fields directly because this class isn't thread safe anyway.

@dain dain merged commit ef67673 into trinodb:master May 30, 2019
@dain
Copy link
Member

dain commented May 30, 2019

@pettyjamesm Merged. Thanks! For that last, minor, comment above, if you still want to change it just send a PR.

@pettyjamesm
Copy link
Member Author

Thanks @dain - to address your point, the check-closed thing would work but in the event of an concurrent call to close() from another thread (which is not a good thing and would indicate a bug) the result would be a NullPointerException instead of "Spill cipher already closed". I chose to use the pattern that always guarantees you either can use the cipher or you see the right error even though the class is not designed or intended to be safe to access concurrently from multiple threads. Thanks again for the merge!

@pettyjamesm pettyjamesm deleted the spill-to-disk-encryption branch May 30, 2019 18:51
@dain dain mentioned this pull request May 30, 2019
@dain dain added this to the 313 milestone May 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants