[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

chrisvittal · 2023-08-01T22:07:32Z

The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals.

Errors observed:

BGZ validation errors
Unexpected end of input

Both of these point to issues in the interface between the FSSeekableInputStream that underpins GoogleFS and the BGZipInputStream that contains it at least in the presence of more than one seek.

Unfortunately, the conditions that reproduce this are rare, and when our clusters are quieter (nighttime) the errors are even less frequent.

The text was updated successfully, but these errors were encountered:

danking · 2023-08-10T15:13:31Z

See also #13409

danking · 2023-08-14T18:47:41Z

We will attempt to reproduce this by reading the VDS with a new partitioning and force counting. That should trigger lots of seeks.

danking · 2023-09-14T13:58:03Z

Here's a clear instance of buffer corruption after a transient error (in this case an SSLException).

https://batch.hail.is/batches/7996481/jobs/182741

2023-09-13 16:37:36.612 JVMEntryway: INFO: is.hail.JVMEntryway received arguments:
2023-09-13 16:37:36.613 JVMEntryway: INFO: 0: /hail-jars/gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar
2023-09-13 16:37:36.613 JVMEntryway: INFO: 1: is.hail.backend.service.Main
2023-09-13 16:37:36.613 JVMEntryway: INFO: 2: /batch/1c00c7157d4d41bcbf508f12d75329b1
2023-09-13 16:37:36.613 JVMEntryway: INFO: 3: /batch/1c00c7157d4d41bcbf508f12d75329b1/log
2023-09-13 16:37:36.613 JVMEntryway: INFO: 4: gs://hail-query-ger0g/jars/be9d88a80695b04a2a9eb5826361e0897d94c042.jar
2023-09-13 16:37:36.613 JVMEntryway: INFO: 5: worker
2023-09-13 16:37:36.613 JVMEntryway: INFO: 6: gs://gnomad-tmp-4day/parallelizeAndComputeWithIndex/s_yyHm37RY7YTSWH29gP5SM0RwKxgs9EXbg9_YMf7ho=
2023-09-13 16:37:36.613 JVMEntryway: INFO: 7: 38854
2023-09-13 16:37:36.613 JVMEntryway: INFO: 8: 47960
2023-09-13 16:37:36.613 JVMEntryway: INFO: Yielding control to the QoB Job.
2023-09-13 16:37:36.614 Worker$: INFO: is.hail.backend.service.Worker be9d88a80695b04a2a9eb5826361e0897d94c042
2023-09-13 16:37:36.614 Worker$: INFO: running job 38854/47960 at root gs://gnomad-tmp-4day/parallelizeAndComputeWithIndex/s_yyHm37RY7YTSWH29gP5SM0RwKxgs9EXbg9_YMf7ho= with scratch directory '/batch/1c00c7157d4d41bcbf508f12d75329b1'
2023-09-13 16:37:36.617 GoogleStorageFS$: INFO: Initializing google storage client from service account key
2023-09-13 16:37:36.821 services: WARN: A limited retry error has occured. We will automatically retry 4 more times. Do not be alarmed. (next delay: 1938). The most recent error was javax.net.ssl.SSLException: Connection reset.
2023-09-13 16:37:38.893 WorkerTimer$: INFO: readInputs took 2278.496020 ms.
2023-09-13 16:37:38.893 : INFO: RegionPool: initialized for thread 9: pool-2-thread-1
2023-09-13 16:37:38.903 : INFO: TaskReport: stage=0, partition=38854, attempt=0, peakBytes=65536, peakBytesReadable=64.00 KiB, chunks requested=0, cache hits=0
2023-09-13 16:37:38.903 : INFO: RegionPool: FREE: 64.0K allocated (64.0K blocks / 0 chunks), regions.size = 1, 0 current java objects, thread 9: pool-2-thread-1
2023-09-13 16:37:38.903 JVMEntryway: ERROR: QoB Job threw an exception.
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ~[jvm-entryway.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.15.jar:?]
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:552) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readInt(InputBuffers.scala:409) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at __C16collect_distributed_array_table_coerce_sortedness.__m20INPLACE_DECODE_r_int32_TO_r_int32(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.__m19INPLACE_DECODE_r_struct_of_r_int32ANDr_int32ANDr_binaryANDr_int64ANDr_int64ANDr_boolEND_TO_r_struct_of_r_int32ANDr_int32ANDr_stringANDr_int64ANDr_int64ANDr_boolEND(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.__m18DECODE_r_struct_of_r_struct_of_r_int32ANDr_int32ANDr_binaryANDr_int64ANDr_int64ANDr_boolENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.apply(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.apply(Unknown Source) ~[?:?]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:52) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:635) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:51) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:635) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.main(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main$.main(Main.scala:14) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main.main(Main.scala) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	... 11 more

danking · 2023-09-27T14:31:12Z

I think I have a fix for this.

CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.

danking · 2023-09-27T19:20:10Z

See details at #13409 (comment) and fix at #13730.

…13730) CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely. As I described in [this comment to #13409](#13409 (comment)), I have a 10K partition pipeline which was reliably producing this error but now reliably *does not* produce this error (it produces another one, #13721, fix forthcoming for that too).

chrisvittal self-assigned this Aug 1, 2023

danking added the bug label Aug 2, 2023

danking changed the title ~~The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals.~~ [query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. Sep 14, 2023

danking mentioned this issue Sep 27, 2023

[qob] In GCS, recreate the ReadChannel if a transient error occurs #13730

Merged

danking closed this as completed in #13730 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

chrisvittal commented Aug 1, 2023 •

edited by danking

danking commented Aug 10, 2023

danking commented Aug 14, 2023

danking commented Sep 14, 2023

danking commented Sep 27, 2023

danking commented Sep 27, 2023

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

Comments

chrisvittal commented Aug 1, 2023 • edited by danking

danking commented Aug 10, 2023

danking commented Aug 14, 2023

danking commented Sep 14, 2023

danking commented Sep 27, 2023

danking commented Sep 27, 2023

chrisvittal commented Aug 1, 2023 •

edited by danking