[query] prevent sudden unceremonious death of driver JVM #14066
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHANGELOG: Since 0.2.110,
hailctl dataprocset the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue #13960 for details.In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced "memory protection" which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed.
From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards,
hailctl dataproc, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets thespark:spark.driver.memoryproperty to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory.Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster:
Notice:
The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount.
Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB.
This PR modifies
hailctl dataproc startand the meaning of--master-memory-fraction. Now,--master-memory-fractionis the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%.For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.