[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

danking · 2023-08-10T15:12:45Z

What happened?

Notify these threads on completion:

https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/exporting.20sites.20only.20VCF/near/376801844

Using QoB, reading out of GCS, we encounter corrupted blocks on this simple pipeline.

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
    at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
    at com.github.luben.zstd.Zstd.decompressByteArray(Zstd.java:409) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
    at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:649) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
    at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
    at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:402) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
....

A simplified version of the script:

import hail as hl
import gnomad.utils.sparse_mt


tmp_dir = 'gs://bucket/'
vds_file = 'gs://neale-bge/bge-wave-1.vds'
out = 'gs://bucket/foo.vcf.bgz'

hl.init(default_reference = 'GRCh38',
        tmp_dir = tmp_dir)

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info=t.info.drop('AS_SB_TABLE'))
t = t.annotate(info = t.info.drop(
    'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')

hl.methods.export_vcf(dataset = t, output = out, tabix = True)

batch-7751958-2713-main.log

Version

0.2.120

Relevant log output

Traceback (most recent call last):
  File "/Users/rye/Projects/VQSR/formatting-VQSR-vcf.py", line 102, in <module>
    main(args)
  File "/Users/rye/Projects/VQSR/formatting-VQSR-vcf.py", line 66, in main
    hl.methods.export_vcf(dataset = t, output = args.out, tabix = False)
  File "<decorator-gen-1440>", line 2, in export_vcf
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/methods/impex.py", line 592, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 535, in execute
    return self._cancel_on_ctrl_c(self._async_execute(ir, timed=timed, **kwargs))
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 526, in _cancel_on_ctrl_c
    return async_to_blocking(coro)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 150, in async_to_blocking
    return loop.run_until_complete(task)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete
    return f.result()
  File "/Users/rye/opt/anaconda3/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/Users/rye/opt/anaconda3/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 555, in _async_execute
    _, resp, timings = await self._rpc(
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 487, in _rpc
    result_bytes = await retry_transient_errors(self._read_output, ir, iodir + '/out', iodir + '/in')
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 779, in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 792, in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 515, in _read_output
    raise reconstructed_error.maybe_user_error(ir)
hail.utils.java.FatalError: GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706

Java stack trace:
is.hail.relocated.com.google.cloud.storage.StorageException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706
	at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:163)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:297)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:730)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$24(StorageImpl.java:574)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:60)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1476)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:574)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:563)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:288)
	at is.hail.services.package$.retryTransientErrors(package.scala:163)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:286)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:26)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:239)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:235)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:490)
	at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6523)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:726)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$24(StorageImpl.java:574)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:60)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1476)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:574)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:563)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:288)
	at is.hail.services.package$.retryTransientErrors(package.scala:163)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:286)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:26)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:239)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:235)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)




Hail version: 0.2.120-f00f916faf78
Error summary: GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706

The text was updated successfully, but these errors were encountered:

danking · 2023-09-27T14:31:12Z

I think I have a fix for this.

danking · 2023-09-27T14:35:08Z

Another example: https://batch.hail.is/batches/7653388/jobs/3332

https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/exporting.20sites.20only.20VCF/near/376801844

print("make sites level ht")
from gnomad.utils.sparse_mt import default_compute_info

vds = hl.vds.read_vds('gs://schema_jsealock/combined/combined_non_bge_data.vds')
mt = vds.variant_data

# run default compute_info on non-ref sites
sites_only_ht = default_compute_info(mt, site_annotations=True)

# convert int64 to float
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_SB_TABLE=sites_only_ht.info.AS_SB_TABLE.map(lambda x: hl.delimit(x, '|'))))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(QUALapprox = hl.float64(sites_only_ht.info.QUALapprox)))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_QUALapprox = sites_only_ht.info.AS_QUALapprox.map(hl.float64)))

# save
hl.export_vcf(sites_only_ht, 'gs://schema_jsealock/combined/combine_non_bge_data_sites_only.vcf.bgz')

CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.

danking · 2023-09-27T18:59:57Z

OK, I seem to have resolved this error, but now another transient error has dramatically increased
its frequency.

I included my test code which was reliably reproducing this error approximately once per run. I ran
this three times using a commit very similar to main [1]. All three runs failed:

In run 1, three partitions had this error.
In run 2, one partition had a different error ([query] rare Google Cloud Storage error #13721 to be exact).
In run 3, two partitions had this error.

After my fix [2] for this issues bug, the #13721 bug became super common! I saw it 50 times in my first run:

Caused by: is.hail.relocated.com.google.cloud.storage.StorageException: Missing Range header in response
	|> PUT https://storage.googleapis.com/upload/storage/v1/b/aou_tmp/o?name=tmp/hail/icullIwHC8dQXtq8JU2uDW/aggregate_intermediates/-ntpjdAQ9sKaR8lK26cV0p5790a4d87-9035-41ae-afc6-326f710d9a89&uploadType=resumable&upload_id=ADPycdtl5JSqwvftT4W190_-ueC032I_oZcwLAlVVMFkqp06W4eY8b-XMwf8DeT7If9I7uIgmI_PLCuFsExsT0aEh2b4FrHtAiUktumQbvgl1U0icw
	|> content-range: bytes */*
	|  
	|< HTTP/1.1 308 Resume Incomplete
	|< content-length: 0
	|< content-type: text/plain; charset=utf-8
	|< x-guploader-uploadid: ADPycdtl5JSqwvftT4W190_-ueC032I_oZcwLAlVVMFkqp06W4eY8b-XMwf8DeT7If9I7uIgmI_PLCuFsExsT0aEh2b4FrHtAiUktumQbvgl1U0icw
	|

Luckily, that one is actually trivial to fix, we just need to update to the latest GCS client
library.

Test Code

import hail as hl
import gnomad.utils.sparse_mt


tmp_dir = 'gs://danking/tmp/'
vds_file = 'gs://neale-bge/bge-wave-1.vds'
out = 'gs://danking/foo.vcf.bgz'

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info=t.info.drop('AS_SB_TABLE'))
t = t.annotate(info = t.info.drop(
    'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')

hl.methods.export_vcf(dataset = t, output = out, tabix = True)

Failing Batch (in my namespace)

https://internal.hail.is/dking/batch/batches/8?q=state%3Dbad

Footnotes

[1] I was using de009fdb89, which I pushed as fixes-sans-gcs-read-fix.
[2] 64c4c6e248, part of unblock-wenhan, specifically this commit.

CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.

…13730) CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely. As I described in [this comment to #13409](#13409 (comment)), I have a 10K partition pipeline which was reliably producing this error but now reliably *does not* produce this error (it produces another one, #13721, fix forthcoming for that too).

danking added the bug label Aug 10, 2023

danking mentioned this issue Aug 10, 2023

[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356

Closed

danking self-assigned this Sep 26, 2023

danking mentioned this issue Sep 27, 2023

[qob] In GCS, recreate the ReadChannel if a transient error occurs #13730

Merged

danking closed this as completed in #13730 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

danking commented Aug 10, 2023 •

edited

danking commented Sep 27, 2023

danking commented Sep 27, 2023 •

edited

danking commented Sep 27, 2023

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

Comments

danking commented Aug 10, 2023 • edited

What happened?

Version

Relevant log output

danking commented Sep 27, 2023

danking commented Sep 27, 2023 • edited

danking commented Sep 27, 2023

Test Code

Failing Batch (in my namespace)

Footnotes

danking commented Aug 10, 2023 •

edited

danking commented Sep 27, 2023 •

edited