Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Don't use py4j for Backend operations #13756

Closed
daniel-goldstein opened this issue Oct 2, 2023 · 3 comments · Fixed by #13797
Closed

[query] Don't use py4j for Backend operations #13756

daniel-goldstein opened this issue Oct 2, 2023 · 3 comments · Fixed by #13797
Assignees

Comments

@daniel-goldstein
Copy link
Contributor

What happened?

The spark and local backends use py4j to execute methods on java backends. py4j uses a TCP socket and a text-based protocol to communicate between python and the jvm and handles marshaling of data between the two processes. Unfortunately it has poor memory performance with large byte arrays, as the text protocol requires base64 encoding byte arrays and it uses Java Strings which, being UTF-16, more than double the size of the original data in memory.

Hail should not use py4j for these operations and just open its own connection to the java backend. This gives us the control to not use more memory than is necessary to just ship bytes back and forth. This also provides an opportunity to deduplicate some code as the ServiceBackend already communicates writes its inputs over a socket instead of using py4j (there is no live JVM to communicate to in the ServiceBackend case, so it must serialize the requested operation to be run at a later time on a different machine).

Version

0.2.124

Relevant log output

No response

@daniel-goldstein daniel-goldstein added the needs-triage A brand new issue that needs triaging. label Oct 2, 2023
@daniel-goldstein daniel-goldstein self-assigned this Oct 2, 2023
@daniel-goldstein
Copy link
Contributor Author

daniel-goldstein commented Oct 2, 2023

There remain a couple questions that a solution should answer:

  1. TCP or Unix Domain Socket? Current consensus feels that TCP is a reasonable and more portable way to go (allows for backend deployment over the web/in K8s for example)
  2. Should we use just TCP or also use HTTP? If the Java backends can multiplex requests, HTTP sounds favorable, otherwise it's unclear to me what advantages it would give us over TCP + JSON. Ergonomically HTTP might be easier, but one tends have certain default expectations of HTTP servers (I would assume an HTTP server should be able to serve requests concurrently, are we just going to use all POSTs?, etc.). Either way this feels like a minor adjustment.

@danking
Copy link
Contributor

danking commented Oct 13, 2023

draft PR: #13797

danking pushed a commit that referenced this issue Oct 20, 2023
CHANGELOG: Fixes #13756: operations that collect large results such as
`to_pandas` may require up to 3x less memory.

This turns all "actions", i.e. backend methods supported by QoB into
HTTP endpoints on the spark and local backends. This intentionally
avoids py4j because py4j was really designed to pass function names and
references around and does not handle large payloads well (such as
results from a `collect`). Specifically, py4j uses a text-based protocol
on top of TCP that substantially inflates the memory requirement for
communicating large byte arrays. On the Java side, py4j serializes every
binary payload as a Base64-encoded `java.lang.String`, which between the
Base64 encoding and `String`'s use of UTF-16 results in a memory
footprint of the `String` being `4/3 * 2 = 8/3` nearly three times the
size of the byte array on either side of the py4j pipe. py4j also
appears to do an entire copy of this payload, which means nearly a 6x
memory requirement for sending back bytes. Using our own socket means we
can directly send back the response bytes to python without any of this
overhead, even going so far as to encode results directly into the TCP
output stream. Formalizing the API between python and java also allows
us to reuse the same payload schema across all three backends.
@danking
Copy link
Contributor

danking commented Nov 2, 2023

GVS team confirms their pipeline containing interval literals went from >50 GB (crashing at that point) to less than 11GB! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants