Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] automatic retry code in Java is broken #13704

Closed
danking opened this issue Sep 25, 2023 · 1 comment · Fixed by #13713
Closed

[query] automatic retry code in Java is broken #13704

danking opened this issue Sep 25, 2023 · 1 comment · Fixed by #13713
Assignees
Labels

Comments

@danking
Copy link
Collaborator

danking commented Sep 25, 2023

What happened?

https://batch.hail.is/batches/8043502/jobs/43724

Caused by: java.lang.IllegalArgumentException: bound must be positive
	at java.util.Random.nextInt(Random.java:388)
	at scala.util.Random.nextInt(Random.scala:70)
	at is.hail.services.package$.delayMsForTry(package.scala:47)
	at is.hail.services.package$.retryTransientErrors(package.scala:186)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.retryingRead(GoogleStorageFS.scala:220)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.readHandlingRequesterPays(GoogleStorageFS.scala:226)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.fill(GoogleStorageFS.scala:257)
	at is.hail.io.fs.FSSeekableInputStream.read(FS.scala:170)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at is.hail.utils.ByteTrackingInputStream.read(ByteTrackingInputStream.scala:28)
	at is.hail.utils.richUtils.RichInputStream$.readRepeatedly$extension0(RichInputStream.scala:21)
	at is.hail.utils.richUtils.RichInputStream$.readFully$extension1(RichInputStream.scala:12)
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:549)
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:643)
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384)
	at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:402)
	at is.hail.io.LEB128InputBuffer.readByte(InputBuffers.scala:219)
	at __C372collect_distributed_array_matrix_native_writer.__m478readLeafNode(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region16_290(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region4_318(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region2_501(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90)
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166)
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.service.Worker$.main(Worker.scala:164)
	at is.hail.backend.service.Main$.main(Main.scala:14)
	at is.hail.backend.service.Main.main(Main.scala)
	... 11 more

Version

0.2.124

Relevant log output

No response

@danking danking added needs-triage A brand new issue that needs triaging. bug and removed needs-triage A brand new issue that needs triaging. labels Sep 25, 2023
@danking
Copy link
Collaborator Author

danking commented Sep 25, 2023

Error
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 2272, in run
    await self.jvm.execute(
  File "/usr/local/lib/python3.9/dist-packages/batch/worker/worker.py", line 2872, in execute
    raise JVMUserError(exception)
JVMUserError: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at is.hail.JVMEntryway.retrieveException(JVMEntryway.java:253)
	at is.hail.JVMEntryway.finishFutures(JVMEntryway.java:215)
	at is.hail.JVMEntryway.main(JVMEntryway.java:185)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:122)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119)
	... 7 more
Caused by: java.lang.IllegalArgumentException: bound must be positive
	at java.util.Random.nextInt(Random.java:388)
	at scala.util.Random.nextInt(Random.scala:70)
	at is.hail.services.package$.delayMsForTry(package.scala:47)
	at is.hail.services.package$.retryTransientErrors(package.scala:186)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.retryingRead(GoogleStorageFS.scala:220)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.readHandlingRequesterPays(GoogleStorageFS.scala:226)
	at is.hail.io.fs.GoogleStorageFS$$anon$1.fill(GoogleStorageFS.scala:257)
	at is.hail.io.fs.FSSeekableInputStream.read(FS.scala:170)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at is.hail.utils.ByteTrackingInputStream.read(ByteTrackingInputStream.scala:28)
	at is.hail.utils.richUtils.RichInputStream$.readRepeatedly$extension0(RichInputStream.scala:21)
	at is.hail.utils.richUtils.RichInputStream$.readFully$extension1(RichInputStream.scala:12)
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:549)
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:643)
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384)
	at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:402)
	at is.hail.io.LEB128InputBuffer.readByte(InputBuffers.scala:219)
	at __C372collect_distributed_array_matrix_native_writer.__m478readLeafNode(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region16_290(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region4_318(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply_region2_501(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at __C372collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90)
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166)
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.service.Worker$.main(Worker.scala:164)
	at is.hail.backend.service.Main$.main(Main.scala:14)
	at is.hail.backend.service.Main.main(Main.scala)
	... 11 more

Logs
Main
Log 
2023-09-24 17:23:30.055 JVMEntryway: ERROR: Exception encountered in QoB cancel thread.
org.newsclub.net.unix.SocketClosedException: Not open
	at org.newsclub.net.unix.AFCore.validFdOrException(AFCore.java:90) ~[jvm-entryway.jar:?]
	at org.newsclub.net.unix.AFSocketImpl$AFInputStreamImpl.read(AFSocketImpl.java:510) ~[jvm-entryway.jar:?]
	at java.io.DataInputStream.readInt(DataInputStream.java:388) ~[?:1.8.0_382]
	at is.hail.JVMEntryway$2.run(JVMEntryway.java:136) ~[jvm-entryway.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]

@ehigham ehigham self-assigned this Sep 25, 2023
danking added a commit to danking/hail that referenced this issue Sep 26, 2023
CHANGELOG: Fixes hail-is#13704, in which Hail could encounter an IllegalArgumentException if there are too many transient errors.

I need to do the multiplication in 64-bits so that it does not wrap around to a large negative value. Then I can use `math.min` with the maxDelayMs to get us back into 32-bits.
danking added a commit that referenced this issue Sep 27, 2023
CHANGELOG: Fixes #13704, in which Hail could encounter an
IllegalArgumentException if there are too many transient errors.

I need to do the multiplication in 64-bits so that it does not wrap
around to a large negative value. Then I can use `math.min` with the
maxDelayMs to get us back into 32-bits.

I'm just pushing through a bunch of bugs to get Wenhan unblocked today.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants