Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] when sending large literals from Python to Scala, Query should use a compact binary format #13757

Closed
danking opened this issue Oct 2, 2023 · 1 comment · Fixed by #13814
Assignees

Comments

@danking
Copy link
Collaborator

danking commented Oct 2, 2023

What happened?

We encode literals from Python as JSON. This dramatically inflates the encoding of hl.tarray(hl.tstruct(...)) because the struct fields are repeated. We should instead use EncodedLiteral. There are some subtleties:

  1. We need to implement an encoder in Python. We currently have a decoder (_from_encoding and _convert_from_encoding) which implements {"name":"StreamBufferSpec"}. We should write an encoder for that.
  2. Our IR is plaintext, so we'll need to encode this binary encoding as base64 so it may be included in plaintext.
  3. EncodedLiteral is not currently parsable (it throws UnsupportedOperationException in ir/Parser.scala). As mentioned in (2), we'll need to implement parsing.

For (3), just represent the AbstractTypedCodecSpec in terms of a virtual type and a buffer spec (assume the buffer spec is StreamBufferSpec). When parsing, we can construct an EncodedLiteral with the base64-decoded value and a codec spec we construct like this:

val codec = TypedCodecSpec(
    EType.fromTypeAllOptional(virtualType),
    virtualType,
    BufferSpec.parseOrDefault(bufferSpecString)
)

For (1), it should be simple enough to invert the logic from _convert_from_encoding to encode rather than decode. Write tests that round-trip in Python for lots of types.


See also #13748

Version

0.2.142

Relevant log output

No response

@danking danking added the query label Oct 23, 2023
danking pushed a commit that referenced this issue Oct 25, 2023
…13814)

CHANGELOG: Pipelines that are memory-bound by copious use of
`hl.literal`, such as `vds.filter_intervals`, require substantially less
memory.

Closes #13757
@danking
Copy link
Collaborator Author

danking commented Nov 2, 2023

GVS team confirms their pipeline containing interval literals went from >50 GB (crashing at that point) to less than 11GB! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants