New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] In 0.2.126, the Dataproc OOMKiller kills the Hail driver process unexpectedly. #13960
Comments
The script in question is located at: gs://danking/1_Generate_Variant_Stats_NVXvC_v1.py . Ping me if you need access. |
Whatever is failing here is likely different from the interval pipeline failures seen in #13748 and related tickets because GVS team has confirmed that 0.2.126 reduces peak RAM usage from >50GB to 11GB. |
Watching - AoU is seeing this error with 0.2.126 |
@daniel-goldstein, here's a much smaller test-case, though the VDS is quite large. You might try this on import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)
vds = hl.vds.read_vds("gs://...")
test_intervals = ['chr13:32355250-32355251']
vds = hl.vds.filter_intervals(
vds,
[hl.parse_locus_interval(x,)
for x in test_intervals]) Log statements like
suggest to me that the master node is dying, then its removed from DNS, then workers are unable to communicate with it. |
This is a lot simpler, I'll try to run this tomorrow morning. |
Cannot reproduce in Dataproc or locally. @daniel-goldstein will contact Yonghao to set up a debug zoom where we can sort this out. Use slack thread to contact. |
An update. I'm working with debugging info from the AoU VDS creation cluster. A VDS creation was run using an n1-highmem-8 driver. The cluster is created by hailctl with no custom driver settings template for hailctl dataproc start
I have the driver node syslogs as well as the Hail log file. For some reason all logs other than the Hail logs are missing from this file. We separately need to determine why all the Spark logs etc. are missing. Based on the syslog, after system start up and just before the Jupyter notebook starts, the system is already using ~8,500MiB:
So, the effective maximum memory that Hail could possibly use is around 43808MiB. After the Notebook and Spark initialize we're down to 42,700 MiB (about ~1000MiB more in use).
At time of sigkill the total memory allocated by the JVM was about 2000MiB below the max heap size. Note that the heap is contained within all memory allocated by the JVM.
Indeed, the VmRSS is the memory in use from the kernel's perspective so it includes any off-heap memory created by Hail. The Hail log indicates the region pools are tiny, ~10s of MiB. Not a concern. After the JVM is killed, memory jumps back up to 40683MiB (which checks out, that's roughly what the killed process was using).
|
We ran the same pipeline with an n1-highmem-16 driver node and it made it through 50 sample groups (each sample group has ~4000 samples) before crashing. Unfortunately, we do not have the syslogs from this run. We also do not have the Hail log from this run. We do have the stdout/stderr from the Python process. There's not much of value there. The Python process exited with code 256. That doesn't make a lot of sense to me because exit codes should be an unsigned 8-bit integer. On a highmem-16, total RAM is 106,496 MiB. Hail's JVM will use 85,197 MiB. We establish above that the system uses about 9,500 MiB (unclear if it would use more on a larger VM). This all leaves 11,799 MiB for the Python process. That seems extremely generous, but apparently not? |
crossposting from a message I sent to the variants team. executive summaryExcess JVM memory use is almost certainly not the issue. I've taken a close look at the import_gvs.py loop and the related Hail Python code. No obvious accumulation of RAM use. AFAICT, the oomkiller keeps killing the pipelines. We need to stop this because the oomkiller (a) acts before the JVM GC can free things and (b) prevents us from getting JVM diagnostics on failure. We control the JVM's max heap with hailctl's --master-memory-fraction (default is 0.8 for 80% of the master machine type's advertised RAM). I suggest we set this down to 0.6 and continue using an n1-highmem-16 driver. To understand what's going on, we gotta see what is using RAM in the n1-highmem-16 case. If I could SSH to the cluster, a simple solution is a screen with top -s 300 -n 100 >memory.log (I'd guess no more than 500KiB per hour of logs) and retrieve that file if the cluster fails. If we could get Google Monitoring set up to retrieve process-level memory statistics from the driver node that should also work. Just to be clear, I don't anticipate any changes to Hail in the next week that would change the memory use of this pipeline. There could be a memory leak, but I have no clews that lead to it. I realize this is an unsatisfying answer. I'm pretty perplexed as to what could be the issue here. technical detailsWe'll call the second to most recent run Run A and the most recent run Run B. Run A (like all runs before it) only manages two sample groups before failing. Run B made it through 50 groups before failing on 51. Why did they fail? The syslog for Run A is clear: the oomkiller killed Run A. We lack syslogs for Run B, so we cannot be certain but the lack of a JVM stack trace suggests to me that (a) the driver failed and (b) the driver was killed by the system. I assume that an n1-highmem-16 uses the same amount of memory for system daemons, so I'd expect just over ten GiB that is used neither by system daemons nor the JVM. Assuming that's right, I can't explain why the oomkiller killed the JVM in Run B. |
I'm fairly certain I know understand this and the AoU VDS creation issue. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced "memory protection" which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster:
Notice:
We must address this situation. It seems safe to assume that the system daemons will use a constant 9.5 GiB of RAM. Moreover the advertised RAM amount is at least 1 GiB larger than reality. I propose:
For what it's worth, I think the reason we didn't get an outcry from our local scientific community is that many of them have transitioned to Query-on-Batch where we have exact and total control over the memory available to the driver and the workers. |
This ticket is complete when:
|
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. Large scale pipelines are likely to be killed by the oom killer. See issue hail-is#13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. Large scale pipelines are likely to be killed by the oom killer. See issue hail-is#13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue hail-is#13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
I'll leave assigned to you, Daniel, to finalize the conversation with Yonghao. |
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue #13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
The AoU Researcher Workbench issue was, in the end, only superficially related to this issue. There is more explanation at DataBiosphere/leonardo#4034 and in these slack threads: |
What happened?
No hail log file is available.
Version
0.2.126
Relevant log output
The text was updated successfully, but these errors were encountered: