Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] permissions error is not propagated back to driver job properly #13697

Closed
danking opened this issue Sep 22, 2023 · 1 comment · Fixed by #13715
Closed

[qob] permissions error is not propagated back to driver job properly #13697

danking opened this issue Sep 22, 2023 · 1 comment · Fixed by #13715
Assignees
Labels

Comments

@danking
Copy link
Collaborator

danking commented Sep 22, 2023

What happened?

Try writing to a bucket to which your service account has read-only access:

hl.utils.range_table(5,n_partitions=5).write('gs://neale-bge/foo.ht')

https://batch.hail.is/batches/8042383

The client gets an error like this:

Java stack trace:
is.hail.relocated.com.google.cloud.storage.StorageException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/1-day/o/parallelizeAndComputeWithIndex%2FO3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=%2Fresult.0?alt=media
No such object: 1-day/parallelizeAndComputeWithIndex/O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=/result.0
	at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:165)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:298)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:729)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$20(StorageImpl.java:610)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:65)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1515)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:610)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:599)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:280)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:278)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:25)
	at is.hail.backend.service.ServiceBackend.$anonfun$parallelizeAndComputeWithIndex$15(ServiceBackend.scala:225)
	at scala.util.Try$.apply(Try.scala:213)
	at is.hail.utils.package$.$anonfun$runAll$2(package.scala:995)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

The driver will have log output like this:

2023-09-22 19:11:13.051 Requester: INFO: request GET http://batch.hail/api/v1alpha/batches/8042383 response 200
2023-09-22 19:11:13.052 ServiceBackend$: INFO: parallelizeAndComputeWithIndex: O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=: reading results
2023-09-22 19:11:13.125 ServiceBackend$: INFO: all results read. 0.072746861 s. 0.0 result/s. 0.0 MiB/s.
2023-09-22 19:11:13.125 : INFO: [collectDArray|table_native_writer]: executed 5 tasks in 1.822s
2023-09-22 19:11:13.126 : INFO: RegionPool: initialized for thread 10: pool-2-thread-2
2023-09-22 19:11:13.126 : INFO: TaskReport: stage=0, partition=0, attempt=0, peakBytes=0, peakBytesReadable=0.00 B, chunks requested=0, cache hits=0
2023-09-22 19:11:13.126 : INFO: RegionPool: FREE: 0 allocated (0 blocks / 0 chunks), regions.size = 0, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:13.127 : INFO: RegionPool: initialized for thread 10: pool-2-thread-2
2023-09-22 19:11:13.127 : INFO: TaskReport: stage=0, partition=0, attempt=0, peakBytes=0, peakBytesReadable=0.00 B, chunks requested=0, cache hits=0
2023-09-22 19:11:13.127 : INFO: RegionPool: FREE: 0 allocated (0 blocks / 0 chunks), regions.size = 0, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:13.127 : INFO: RegionPool: FREE: 128.0K allocated (128.0K blocks / 0 chunks), regions.size = 2, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:13.138 : ERROR: GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/1-day/o/parallelizeAndComputeWithIndex%2FO3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=%2Fresult.0?alt=media
No such object: 1-day/parallelizeAndComputeWithIndex/O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=/result.0
From is.hail.relocated.com.google.cloud.storage.StorageException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/1-day/o/parallelizeAndComputeWithIndex%2FO3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=%2Fresult.0?alt=media
No such object: 1-day/parallelizeAndComputeWithIndex/O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=/result.0
	at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:165)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:298)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:729)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$20(StorageImpl.java:610)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:65)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1515)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:610)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:599)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:280)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:278)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:25)
	at is.hail.backend.service.ServiceBackend.$anonfun$parallelizeAndComputeWithIndex$15(ServiceBackend.scala:225)
	at scala.util.Try$.apply(Try.scala:213)
	at is.hail.utils.package$.$anonfun$runAll$2(package.scala:995)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

but the worker looks like this:

2023-09-22 19:11:12.125 JVMEntryway: INFO: is.hail.JVMEntryway received arguments:
2023-09-22 19:11:12.125 JVMEntryway: INFO: 0: /hail-jars/gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar
2023-09-22 19:11:12.125 JVMEntryway: INFO: 1: is.hail.backend.service.Main
2023-09-22 19:11:12.125 JVMEntryway: INFO: 2: /batch/fe537a243a3046d29d76861ffee94b92
2023-09-22 19:11:12.125 JVMEntryway: INFO: 3: /batch/fe537a243a3046d29d76861ffee94b92/log
2023-09-22 19:11:12.126 JVMEntryway: INFO: 4: gs://hail-query-ger0g/jars/b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar
2023-09-22 19:11:12.126 JVMEntryway: INFO: 5: worker
2023-09-22 19:11:12.126 JVMEntryway: INFO: 6: gs://1-day/parallelizeAndComputeWithIndex/O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss=
2023-09-22 19:11:12.126 JVMEntryway: INFO: 7: 0
2023-09-22 19:11:12.126 JVMEntryway: INFO: 8: 5
2023-09-22 19:11:12.126 JVMEntryway: INFO: Yielding control to the QoB Job.
2023-09-22 19:11:12.131 Worker$: INFO: is.hail.backend.service.Worker b115f6a6ec23f111a4512b562b52d9f8a52ec41c
2023-09-22 19:11:12.131 Worker$: INFO: running job 0/5 at root gs://1-day/parallelizeAndComputeWithIndex/O3mcL5QoyfBBMz0eSi7uIjGQkV3FuD5vCON_8i4r0ss= with scratch directory '/batch/fe537a243a3046d29d76861ffee94b92'
2023-09-22 19:11:12.143 GoogleStorageFS$: INFO: Initializing google storage client from service account key
2023-09-22 19:11:12.456 WorkerTimer$: INFO: readInputs took 325.065544 ms.
2023-09-22 19:11:12.456 : INFO: RegionPool: initialized for thread 10: pool-2-thread-2
2023-09-22 19:11:12.481 GoogleStorageFS$: INFO: createNoCompression: gs://neale-bge/foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index
2023-09-22 19:11:12.486 : INFO: RegionPool: REPORT_THRESHOLD: 257.0K allocated (129.0K blocks / 128.0K chunks), regions.size = 3, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:12.486 : INFO: RegionPool: REPORT_THRESHOLD: 577.0K allocated (193.0K blocks / 384.0K chunks), regions.size = 4, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:12.486 GoogleStorageFS$: INFO: createNoCompression: gs://neale-bge/foo.ht/rows/parts/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79
2023-09-22 19:11:12.625 GoogleStorageFS$: INFO: close: gs://neale-bge/foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index
2023-09-22 19:11:12.656 : INFO: TaskReport: stage=0, partition=0, attempt=0, peakBytes=656384, peakBytesReadable=641.00 KiB, chunks requested=4, cache hits=2
2023-09-22 19:11:12.656 : INFO: RegionPool: FREE: 641.0K allocated (257.0K blocks / 384.0K chunks), regions.size = 5, 0 current java objects, thread 10: pool-2-thread-2
2023-09-22 19:11:12.656 JVMEntryway: ERROR: QoB Job threw an exception.
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ~[jvm-entryway.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
Caused by: is.hail.relocated.com.google.cloud.storage.StorageException: 403 Forbidden
POST https://storage.googleapis.com/upload/storage/v1/b/neale-bge/o?name=foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index&uploadType=resumable
{
  "error": {
    "code": 403,
    "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
    "errors": [
      {
        "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}

	at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:165) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:298) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1029) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$0(ResumableMedia.java:40) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:65) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:674) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:95) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$2(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1$adapted(GoogleStorageFS.scala:299) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS.is$hail$io$fs$GoogleStorageFS$$handleRequesterPays(GoogleStorageFS.scala:181) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.doHandlingRequesterPays(GoogleStorageFS.scala:304) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.flush(GoogleStorageFS.scala:314) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at java.io.DataOutputStream.flush(DataOutputStream.java:123) ~[?:1.8.0_382]
	at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_382]
	at is.hail.utils.richUtils.ByteTrackingOutputStream.close(ByteTrackingOutputStream.scala:23) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.index.IndexWriterUtils.close(IndexWriter.scala:225) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
	at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.main(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main$.main(Main.scala:14) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main.main(Main.scala) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	... 12 more
	Suppressed: is.hail.relocated.com.google.cloud.storage.StorageException: 403 Forbidden
POST https://storage.googleapis.com/upload/storage/v1/b/neale-bge/o?name=foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index&uploadType=resumable
{
  "error": {
    "code": 403,
    "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
    "errors": [
      {
        "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}

		at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:165) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:298) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1029) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$0(ResumableMedia.java:40) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:65) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:674) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:95) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$2(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1$adapted(GoogleStorageFS.scala:299) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS.is$hail$io$fs$GoogleStorageFS$$handleRequesterPays(GoogleStorageFS.scala:181) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.doHandlingRequesterPays(GoogleStorageFS.scala:304) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$close$1(GoogleStorageFS.scala:326) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
		at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.fs.GoogleStorageFS$$anon$2.close(GoogleStorageFS.scala:324) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at java.io.FilterOutputStream.close(FilterOutputStream.java:159) ~[?:1.8.0_382]
		at is.hail.utils.richUtils.ByteTrackingOutputStream.close(ByteTrackingOutputStream.scala:23) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.io.index.IndexWriterUtils.close(IndexWriter.scala:225) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
		at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
		at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
		at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Worker$.main(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Main$.main(Main.scala:14) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.backend.service.Main.main(Main.scala) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
		at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
		at is.hail.JVMEntryway$1.run(JVMEntryway.java:119) ~[jvm-entryway.jar:?]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
		at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382]
		at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_382]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
		at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
	Caused by: com.google.api.client.http.HttpResponseException: 403 Forbidden
POST https://storage.googleapis.com/upload/storage/v1/b/neale-bge/o?name=foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index&uploadType=resumable
{
  "error": {
    "code": 403,
    "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
    "errors": [
      {
        "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}

		at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1022) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
		... 48 more
Caused by: com.google.api.client.http.HttpResponseException: 403 Forbidden
POST https://storage.googleapis.com/upload/storage/v1/b/neale-bge/o?name=foo.ht/index/part-0-c7ba7549-bf68-42db-a8ef-0f1b13721c79.idx/index&uploadType=resumable
{
  "error": {
    "code": 403,
    "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
    "errors": [
      {
        "message": "dking-ae4q6@hail-vdc.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist).",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1022) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$0(ResumableMedia.java:40) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:65) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:674) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:95) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$2(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1(GoogleStorageFS.scala:300) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$doHandlingRequesterPays$1$adapted(GoogleStorageFS.scala:299) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS.is$hail$io$fs$GoogleStorageFS$$handleRequesterPays(GoogleStorageFS.scala:181) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.doHandlingRequesterPays(GoogleStorageFS.scala:304) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.fs.GoogleStorageFS$$anon$2.flush(GoogleStorageFS.scala:314) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at java.io.DataOutputStream.flush(DataOutputStream.java:123) ~[?:1.8.0_382]
	at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_382]
	at is.hail.utils.richUtils.ByteTrackingOutputStream.close(ByteTrackingOutputStream.scala:23) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.index.IndexWriterUtils.close(IndexWriter.scala:225) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
	at __C30collect_distributed_array_table_native_writer.apply(Unknown Source) ~[?:?]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$12(Worker.scala:167) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at is.hail.services.package$.retryTransientErrors(package.scala:182) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11(Worker.scala:166) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.$anonfun$main$11$adapted(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.utils.package$.using(package.scala:637) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Worker$.main(Worker.scala:164) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main$.main(Main.scala:14) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.backend.service.Main.main(Main.scala) ~[gs:__hail-query-ger0g_jars_b115f6a6ec23f111a4512b562b52d9f8a52ec41c.jar.jar:0.0.1-SNAPSHOT]
	... 12 more

Version

0.2.124

Relevant log output

No response

@danking danking added the needs-triage A brand new issue that needs triaging. label Sep 22, 2023
@danking
Copy link
Collaborator Author

danking commented Sep 22, 2023

The root cause is that we only catch HailException, not all exceptions. I suppose we should catch all exceptions?

@danking danking added bug and removed needs-triage A brand new issue that needs triaging. labels Sep 22, 2023
danking added a commit to danking/hail that referenced this issue Sep 26, 2023
CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place.
danking added a commit to danking/hail that referenced this issue Sep 26, 2023
CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place.
danking added a commit to danking/hail that referenced this issue Sep 26, 2023
CHANGELOG: Fixes hail-is#13697, a long standing issue with QoB, in which a failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place.
danking added a commit that referenced this issue Oct 16, 2023
…13715)

CHANGELOG: Fixes #13697, a long standing issue with QoB, in which a
failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place. If a
JVMJob raises an exception, Batch will mark the job as failed. Ergo, we
should raise an exception when a driver or a worker fails!

Here's an example: I used a simple pipeline that write to a bucket to
which I have read-only access. You can see an example Batch (where every
partition fails): https://batch.hail.is/batches/8046901. [1]

```python3
import hail as hl
hl.utils.range_table(3, n_partitions=3).write('gs://neale-bge/foo.ht')
```

NB: I removed the `log.error` in `handleForPython` because that log is
never necessary. That function converts a stack of exceptions into a
triplet of the short message, the full exception with stack trace, and a
Hail error id (if present). That triplet is always passed along to
someone else who logs the exception.

(FWIW, the error id indicates a Python source location that is
associated with the error. On the Python-side, we can look up that error
id and provide a better stack trace.)

[1] You'll notice the logs are missing. I noticed this as well, it's a
new bug. I fixed it in #13729.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant