Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

Closed
chrisvittal opened this issue Aug 1, 2023 · 5 comments · Fixed by #13730
Assignees
Labels

Comments

@chrisvittal
Copy link
Collaborator

chrisvittal commented Aug 1, 2023

The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals.

Errors observed:

  • BGZ validation errors
  • Unexpected end of input

Both of these point to issues in the interface between the FSSeekableInputStream that underpins GoogleFS and the BGZipInputStream that contains it at least in the presence of more than one seek.

Unfortunately, the conditions that reproduce this are rare, and when our clusters are quieter (nighttime) the errors are even less frequent.

@chrisvittal chrisvittal self-assigned this Aug 1, 2023
@danking danking added the bug label Aug 2, 2023
@danking
Copy link
Collaborator

danking commented Aug 10, 2023

See also #13409

@danking
Copy link
Collaborator

danking commented Aug 14, 2023

We will attempt to reproduce this by reading the VDS with a new partitioning and force counting. That should trigger lots of seeks.

@danking
Copy link
Collaborator

danking commented Sep 14, 2023

Here's a clear instance of buffer corruption after a transient error (in this case an SSLException).

https://batch.hail.is/batches/7996481/jobs/182741

2023-09-13 16:37:36.612 JVMEntryway: INFO: is.hail.JVMEntryway received arguments:
2023-09-13 16:37:36.613 JVMEntryway: INFO: 0: /hail-jars/gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar
2023-09-13 16:37:36.613 JVMEntryway: INFO: 1: is.hail.backend.service.Main
2023-09-13 16:37:36.613 JVMEntryway: INFO: 2: /batch/1c00c7157d4d41bcbf508f12d75329b1
2023-09-13 16:37:36.613 JVMEntryway: INFO: 3: /batch/1c00c7157d4d41bcbf508f12d75329b1/log
2023-09-13 16:37:36.613 JVMEntryway: INFO: 4: gs://hail-query-ger0g/jars/be9d88a80695b04a2a9eb5826361e0897d94c042.jar
2023-09-13 16:37:36.613 JVMEntryway: INFO: 5: worker
2023-09-13 16:37:36.613 JVMEntryway: INFO: 6: gs://gnomad-tmp-4day/parallelizeAndComputeWithIndex/s_yyHm37RY7YTSWH29gP5SM0RwKxgs9EXbg9_YMf7ho=
2023-09-13 16:37:36.613 JVMEntryway: INFO: 7: 38854
2023-09-13 16:37:36.613 JVMEntryway: INFO: 8: 47960
2023-09-13 16:37:36.613 JVMEntryway: INFO: Yielding control to the QoB Job.
2023-09-13 16:37:36.614 Worker$: INFO: is.hail.backend.service.Worker be9d88a80695b04a2a9eb5826361e0897d94c042
2023-09-13 16:37:36.614 Worker$: INFO: running job 38854/47960 at root gs://gnomad-tmp-4day/parallelizeAndComputeWithIndex/s_yyHm37RY7YTSWH29gP5SM0RwKxgs9EXbg9_YMf7ho= with scratch directory '/batch/1c00c7157d4d41bcbf508f12d75329b1'
2023-09-13 16:37:36.617 GoogleStorageFS$: INFO: Initializing google storage client from service account key
2023-09-13 16:37:36.821 services: WARN: A limited retry error has occured. We will automatically retry 4 more times. Do not be alarmed. (next delay: 1938). The most recent error was javax.net.ssl.SSLException: Connection reset.
2023-09-13 16:37:38.893 WorkerTimer$: INFO: readInputs took 2278.496020 ms.
2023-09-13 16:37:38.893 : INFO: RegionPool: initialized for thread 9: pool-2-thread-1
2023-09-13 16:37:38.903 : INFO: TaskReport: stage=0, partition=38854, attempt=0, peakBytes=65536, peakBytesReadable=64.00 KiB, chunks requested=0, cache hits=0
2023-09-13 16:37:38.903 : INFO: RegionPool: FREE: 64.0K allocated (64.0K blocks / 0 chunks), regions.size = 1, 0 current java objects, thread 9: pool-2-thread-1
2023-09-13 16:37:38.903 JVMEntryway: ERROR: QoB Job threw an exception.
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ~[jvm-entryway.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.15.jar:?]
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:552) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readInt(InputBuffers.scala:409) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at __C16collect_distributed_array_table_coerce_sortedness.__m20INPLACE_DECODE_r_int32_TO_r_int32(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.__m19INPLACE_DECODE_r_struct_of_r_int32ANDr_int32ANDr_binaryANDr_int64ANDr_int64ANDr_boolEND_TO_r_struct_of_r_int32ANDr_int32ANDr_stringANDr_int64ANDr_int64ANDr_boolEND(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.__m18DECODE_r_struct_of_r_struct_of_r_int32ANDr_int32ANDr_binaryANDr_int64ANDr_int64ANDr_boolENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.apply(Unknown Source) ~[?:?]
	at __C16collect_distributed_array_table_coerce_sortedness.apply(Unknown Source) ~[?:?]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:52) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:635) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:51) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:635) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.main(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main$.main(Main.scala:14) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main.main(Main.scala) ~[gs:__hail-query-ger0g_jars_be9d88a80695b04a2a9eb5826361e0897d94c042.jar.jar:0.0.1-SNAPSHOT]
	... 11 more

@danking danking changed the title The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals. [query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. Sep 14, 2023
@danking
Copy link
Collaborator

danking commented Sep 27, 2023

I think I have a fix for this.

danking added a commit to danking/hail that referenced this issue Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not
assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if
we encounter any error, we blow away the ByteBuffer and restart our read entirely.
danking added a commit to danking/hail that referenced this issue Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not
assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if
we encounter any error, we blow away the ByteBuffer and restart our read entirely.
@danking
Copy link
Collaborator

danking commented Sep 27, 2023

See details at #13409 (comment) and fix at #13730.

danking added a commit that referenced this issue Sep 28, 2023
…13730)

CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more
partitions, transient "Corrupted block detected" errors were common.
This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs
during read. We also do not assume that the ReadChannel "atomically", in
some sense, modifies the ByteBuffer. In particular, if we encounter any
error, we blow away the ByteBuffer and restart our read entirely.

As I described in [this comment to
#13409](#13409 (comment)),
I have a 10K partition pipeline which was reliably producing this error
but now reliably *does not* produce this error (it produces another one,
#13721, fix forthcoming for that too).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants