-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Don't use py4j for Backend operations #13756
Labels
Comments
There remain a couple questions that a solution should answer:
|
danking
added
new-feature
and removed
needs-triage
A brand new issue that needs triaging.
labels
Oct 2, 2023
Closed
draft PR: #13797 |
danking
pushed a commit
that referenced
this issue
Oct 20, 2023
CHANGELOG: Fixes #13756: operations that collect large results such as `to_pandas` may require up to 3x less memory. This turns all "actions", i.e. backend methods supported by QoB into HTTP endpoints on the spark and local backends. This intentionally avoids py4j because py4j was really designed to pass function names and references around and does not handle large payloads well (such as results from a `collect`). Specifically, py4j uses a text-based protocol on top of TCP that substantially inflates the memory requirement for communicating large byte arrays. On the Java side, py4j serializes every binary payload as a Base64-encoded `java.lang.String`, which between the Base64 encoding and `String`'s use of UTF-16 results in a memory footprint of the `String` being `4/3 * 2 = 8/3` nearly three times the size of the byte array on either side of the py4j pipe. py4j also appears to do an entire copy of this payload, which means nearly a 6x memory requirement for sending back bytes. Using our own socket means we can directly send back the response bytes to python without any of this overhead, even going so far as to encode results directly into the TCP output stream. Formalizing the API between python and java also allows us to reuse the same payload schema across all three backends.
GVS team confirms their pipeline containing interval literals went from >50 GB (crashing at that point) to less than 11GB! 👏 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened?
The spark and local backends use
py4j
to execute methods on java backends.py4j
uses a TCP socket and a text-based protocol to communicate between python and the jvm and handles marshaling of data between the two processes. Unfortunately it has poor memory performance with large byte arrays, as the text protocol requires base64 encoding byte arrays and it uses JavaString
s which, being UTF-16, more than double the size of the original data in memory.Hail should not use
py4j
for these operations and just open its own connection to the java backend. This gives us the control to not use more memory than is necessary to just ship bytes back and forth. This also provides an opportunity to deduplicate some code as theServiceBackend
already communicates writes its inputs over a socket instead of usingpy4j
(there is no live JVM to communicate to in theServiceBackend
case, so it must serialize the requested operation to be run at a later time on a different machine).Version
0.2.124
Relevant log output
No response
The text was updated successfully, but these errors were encountered: