[query] Don't use py4j for Backend operations #13756

daniel-goldstein · 2023-10-02T17:45:06Z

What happened?

The spark and local backends use py4j to execute methods on java backends. py4j uses a TCP socket and a text-based protocol to communicate between python and the jvm and handles marshaling of data between the two processes. Unfortunately it has poor memory performance with large byte arrays, as the text protocol requires base64 encoding byte arrays and it uses Java Strings which, being UTF-16, more than double the size of the original data in memory.

Hail should not use py4j for these operations and just open its own connection to the java backend. This gives us the control to not use more memory than is necessary to just ship bytes back and forth. This also provides an opportunity to deduplicate some code as the ServiceBackend already communicates writes its inputs over a socket instead of using py4j (there is no live JVM to communicate to in the ServiceBackend case, so it must serialize the requested operation to be run at a later time on a different machine).

Version

0.2.124

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

daniel-goldstein · 2023-10-02T17:52:04Z

There remain a couple questions that a solution should answer:

TCP or Unix Domain Socket? Current consensus feels that TCP is a reasonable and more portable way to go (allows for backend deployment over the web/in K8s for example)
Should we use just TCP or also use HTTP? If the Java backends can multiplex requests, HTTP sounds favorable, otherwise it's unclear to me what advantages it would give us over TCP + JSON. Ergonomically HTTP might be easier, but one tends have certain default expectations of HTTP servers (I would assume an HTTP server should be able to serve requests concurrently, are we just going to use all POSTs?, etc.). Either way this feels like a minor adjustment.

danking · 2023-10-13T19:46:59Z

draft PR: #13797

CHANGELOG: Fixes #13756: operations that collect large results such as `to_pandas` may require up to 3x less memory. This turns all "actions", i.e. backend methods supported by QoB into HTTP endpoints on the spark and local backends. This intentionally avoids py4j because py4j was really designed to pass function names and references around and does not handle large payloads well (such as results from a `collect`). Specifically, py4j uses a text-based protocol on top of TCP that substantially inflates the memory requirement for communicating large byte arrays. On the Java side, py4j serializes every binary payload as a Base64-encoded `java.lang.String`, which between the Base64 encoding and `String`'s use of UTF-16 results in a memory footprint of the `String` being `4/3 * 2 = 8/3` nearly three times the size of the byte array on either side of the py4j pipe. py4j also appears to do an entire copy of this payload, which means nearly a 6x memory requirement for sending back bytes. Using our own socket means we can directly send back the response bytes to python without any of this overhead, even going so far as to encode results directly into the TCP output stream. Formalizing the API between python and java also allows us to reuse the same payload schema across all three backends.

danking · 2023-11-02T16:25:37Z

GVS team confirms their pipeline containing interval literals went from >50 GB (crashing at that point) to less than 11GB! 👏

daniel-goldstein added the needs-triage A brand new issue that needs triaging. label Oct 2, 2023

daniel-goldstein self-assigned this Oct 2, 2023

danking added new-feature and removed needs-triage A brand new issue that needs triaging. labels Oct 2, 2023

danking mentioned this issue Oct 2, 2023

[qob] Since 0.2.117 vds.filter_intervals unnecessarily uses a lot of RAM #13748

Closed

danking mentioned this issue Oct 12, 2023

Release 0.2.125 #13806

Closed

danking mentioned this issue Oct 17, 2023

[query] Avoid py4j for python-backend interactions #13797

Merged

danking closed this as completed in #13797 Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] Don't use py4j for Backend operations #13756

[query] Don't use py4j for Backend operations #13756

daniel-goldstein commented Oct 2, 2023

daniel-goldstein commented Oct 2, 2023 •

edited

danking commented Oct 13, 2023

danking commented Nov 2, 2023

[query] Don't use py4j for Backend operations #13756

[query] Don't use py4j for Backend operations #13756

Comments

daniel-goldstein commented Oct 2, 2023

What happened?

Version

Relevant log output

daniel-goldstein commented Oct 2, 2023 • edited

danking commented Oct 13, 2023

danking commented Nov 2, 2023

daniel-goldstein commented Oct 2, 2023 •

edited