[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

danking · 2023-10-02T18:28:08Z

What happened?

We encode literals from Python as JSON. This dramatically inflates the encoding of hl.tarray(hl.tstruct(...)) because the struct fields are repeated. We should instead use EncodedLiteral. There are some subtleties:

We need to implement an encoder in Python. We currently have a decoder (_from_encoding and _convert_from_encoding) which implements {"name":"StreamBufferSpec"}. We should write an encoder for that.
Our IR is plaintext, so we'll need to encode this binary encoding as base64 so it may be included in plaintext.
EncodedLiteral is not currently parsable (it throws UnsupportedOperationException in ir/Parser.scala). As mentioned in (2), we'll need to implement parsing.

For (3), just represent the AbstractTypedCodecSpec in terms of a virtual type ~~and a buffer spec~~ (assume the buffer spec is StreamBufferSpec). When parsing, we can construct an EncodedLiteral with the base64-decoded value and a codec spec we construct like this:

val codec = TypedCodecSpec(
    EType.fromTypeAllOptional(virtualType),
    virtualType,
    BufferSpec.parseOrDefault(bufferSpecString)
)

For (1), it should be simple enough to invert the logic from _convert_from_encoding to encode rather than decode. Write tests that round-trip in Python for lots of types.

Version

0.2.142

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

…13814) CHANGELOG: Pipelines that are memory-bound by copious use of `hl.literal`, such as `vds.filter_intervals`, require substantially less memory. Closes #13757

danking · 2023-11-02T16:25:41Z

GVS team confirms their pipeline containing interval literals went from >50 GB (crashing at that point) to less than 11GB! 👏

danking added the new-feature label Oct 2, 2023

danking mentioned this issue Oct 2, 2023

[qob] Since 0.2.117 vds.filter_intervals unnecessarily uses a lot of RAM #13748

Closed

danking assigned daniel-goldstein Oct 12, 2023

This was referenced Oct 12, 2023

Release 0.2.125 #13806

Closed

[query] Use EncodedLiteral instead of Literal from python to scala #13814

Merged

danking added the query label Oct 23, 2023

danking closed this as completed in #13814 Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

danking commented Oct 2, 2023 •

edited

danking commented Nov 2, 2023

[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

Comments

danking commented Oct 2, 2023 • edited

What happened?

Version

Relevant log output

danking commented Nov 2, 2023

danking commented Oct 2, 2023 •

edited