Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

Closed
danking opened this issue Aug 10, 2023 · 3 comments · Fixed by #13730
Closed

[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409

danking opened this issue Aug 10, 2023 · 3 comments · Fixed by #13730
Assignees
Labels

Comments

@danking
Copy link
Collaborator

danking commented Aug 10, 2023

What happened?

Notify these threads on completion:

Using QoB, reading out of GCS, we encounter corrupted blocks on this simple pipeline.

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
    at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
    at com.github.luben.zstd.Zstd.decompressByteArray(Zstd.java:409) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
    at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:649) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
    at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:384) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
    at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:402) ~[gs:__hail-query-ger0g_jars_f00f916faf783b89cc2fc00bfc3e39df5485d8b0.jar.jar:0.0.1-SNAPSHOT]
....

A simplified version of the script:

import hail as hl
import gnomad.utils.sparse_mt


tmp_dir = 'gs://bucket/'
vds_file = 'gs://neale-bge/bge-wave-1.vds'
out = 'gs://bucket/foo.vcf.bgz'

hl.init(default_reference = 'GRCh38',
        tmp_dir = tmp_dir)

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info=t.info.drop('AS_SB_TABLE'))
t = t.annotate(info = t.info.drop(
    'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')

hl.methods.export_vcf(dataset = t, output = out, tabix = True)

batch-7751958-2713-main.log

Version

0.2.120

Relevant log output

Traceback (most recent call last):
  File "/Users/rye/Projects/VQSR/formatting-VQSR-vcf.py", line 102, in <module>
    main(args)
  File "/Users/rye/Projects/VQSR/formatting-VQSR-vcf.py", line 66, in main
    hl.methods.export_vcf(dataset = t, output = args.out, tabix = False)
  File "<decorator-gen-1440>", line 2, in export_vcf
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/methods/impex.py", line 592, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 535, in execute
    return self._cancel_on_ctrl_c(self._async_execute(ir, timed=timed, **kwargs))
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 526, in _cancel_on_ctrl_c
    return async_to_blocking(coro)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 150, in async_to_blocking
    return loop.run_until_complete(task)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete
    return f.result()
  File "/Users/rye/opt/anaconda3/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/Users/rye/opt/anaconda3/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 555, in _async_execute
    _, resp, timings = await self._rpc(
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 487, in _rpc
    result_bytes = await retry_transient_errors(self._read_output, ir, iodir + '/out', iodir + '/in')
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 779, in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hailtop/utils/utils.py", line 792, in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
  File "/Users/rye/opt/anaconda3/lib/python3.9/site-packages/hail/backend/service_backend.py", line 515, in _read_output
    raise reconstructed_error.maybe_user_error(ir)
hail.utils.java.FatalError: GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706

Java stack trace:
is.hail.relocated.com.google.cloud.storage.StorageException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706
	at is.hail.relocated.com.google.cloud.storage.StorageException.translate(StorageException.java:163)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:297)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:730)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$24(StorageImpl.java:574)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:60)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1476)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:574)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:563)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:288)
	at is.hail.services.package$.retryTransientErrors(package.scala:163)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:286)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:26)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:239)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:235)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:490)
	at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6523)
	at is.hail.relocated.com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:726)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.lambda$readAllBytes$24(StorageImpl.java:574)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:60)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.run(StorageImpl.java:1476)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:574)
	at is.hail.relocated.com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:563)
	at is.hail.io.fs.GoogleStorageFS.$anonfun$readNoCompression$1(GoogleStorageFS.scala:288)
	at is.hail.services.package$.retryTransientErrors(package.scala:163)
	at is.hail.io.fs.GoogleStorageFS.readNoCompression(GoogleStorageFS.scala:286)
	at is.hail.io.fs.RouterFS.readNoCompression(RouterFS.scala:26)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:239)
	at is.hail.backend.service.ServiceBackend$$anon$4.call(ServiceBackend.scala:235)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)




Hail version: 0.2.120-f00f916faf78
Error summary: GoogleJsonResponseException: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/wes-bipolar-tmp-4day/o/bge-wave-1-VQSR%2FparallelizeAndComputeWithIndex%2FgCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=%2Fresult.2706?alt=media
No such object: wes-bipolar-tmp-4day/bge-wave-1-VQSR/parallelizeAndComputeWithIndex/gCyfD7XOt_MQrrCGc4Q-RrrWPb3cTAbhhcV28BCntiU=/result.2706
@danking
Copy link
Collaborator Author

danking commented Sep 27, 2023

I think I have a fix for this.

@danking
Copy link
Collaborator Author

danking commented Sep 27, 2023

Another example: https://batch.hail.is/batches/7653388/jobs/3332

https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/exporting.20sites.20only.20VCF/near/376801844

print("make sites level ht")
from gnomad.utils.sparse_mt import default_compute_info

vds = hl.vds.read_vds('gs://schema_jsealock/combined/combined_non_bge_data.vds')
mt = vds.variant_data

# run default compute_info on non-ref sites
sites_only_ht = default_compute_info(mt, site_annotations=True)

# convert int64 to float
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_SB_TABLE=sites_only_ht.info.AS_SB_TABLE.map(lambda x: hl.delimit(x, '|'))))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(QUALapprox = hl.float64(sites_only_ht.info.QUALapprox)))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_QUALapprox = sites_only_ht.info.AS_QUALapprox.map(hl.float64)))

# save
hl.export_vcf(sites_only_ht, 'gs://schema_jsealock/combined/combine_non_bge_data_sites_only.vcf.bgz')

danking added a commit to danking/hail that referenced this issue Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not
assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if
we encounter any error, we blow away the ByteBuffer and restart our read entirely.
@danking
Copy link
Collaborator Author

danking commented Sep 27, 2023

OK, I seem to have resolved this error, but now another transient error has dramatically increased
its frequency.

I included my test code which was reliably reproducing this error approximately once per run. I ran
this three times using a commit very similar to main [1]. All three runs failed:

  1. In run 1, three partitions had this error.
  2. In run 2, one partition had a different error ([query] rare Google Cloud Storage error #13721 to be exact).
  3. In run 3, two partitions had this error.

After my fix [2] for this issues bug, the #13721 bug became super common! I saw it 50 times in my first run:

Caused by: is.hail.relocated.com.google.cloud.storage.StorageException: Missing Range header in response
	|> PUT https://storage.googleapis.com/upload/storage/v1/b/aou_tmp/o?name=tmp/hail/icullIwHC8dQXtq8JU2uDW/aggregate_intermediates/-ntpjdAQ9sKaR8lK26cV0p5790a4d87-9035-41ae-afc6-326f710d9a89&uploadType=resumable&upload_id=ADPycdtl5JSqwvftT4W190_-ueC032I_oZcwLAlVVMFkqp06W4eY8b-XMwf8DeT7If9I7uIgmI_PLCuFsExsT0aEh2b4FrHtAiUktumQbvgl1U0icw
	|> content-range: bytes */*
	|  
	|< HTTP/1.1 308 Resume Incomplete
	|< content-length: 0
	|< content-type: text/plain; charset=utf-8
	|< x-guploader-uploadid: ADPycdtl5JSqwvftT4W190_-ueC032I_oZcwLAlVVMFkqp06W4eY8b-XMwf8DeT7If9I7uIgmI_PLCuFsExsT0aEh2b4FrHtAiUktumQbvgl1U0icw
	|  

Luckily, that one is actually trivial to fix, we just need to update to the latest GCS client
library
.

Test Code

import hail as hl
import gnomad.utils.sparse_mt


tmp_dir = 'gs://danking/tmp/'
vds_file = 'gs://neale-bge/bge-wave-1.vds'
out = 'gs://danking/foo.vcf.bgz'

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info=t.info.drop('AS_SB_TABLE'))
t = t.annotate(info = t.info.drop(
    'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')

hl.methods.export_vcf(dataset = t, output = out, tabix = True)

Failing Batch (in my namespace)

https://internal.hail.is/dking/batch/batches/8?q=state%3Dbad

Footnotes

[1] I was using de009fdb89, which I pushed as fixes-sans-gcs-read-fix.
[2] 64c4c6e248, part of unblock-wenhan, specifically this commit.

danking added a commit to danking/hail that referenced this issue Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not
assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if
we encounter any error, we blow away the ByteBuffer and restart our read entirely.
danking added a commit that referenced this issue Sep 28, 2023
…13730)

CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more
partitions, transient "Corrupted block detected" errors were common.
This was caused by incorrect retry logic. That logic has been fixed.

I now assume we cannot reuse a ReadChannel after any exception occurs
during read. We also do not assume that the ReadChannel "atomically", in
some sense, modifies the ByteBuffer. In particular, if we encounter any
error, we blow away the ByteBuffer and restart our read entirely.

As I described in [this comment to
#13409](#13409 (comment)),
I have a 10K partition pipeline which was reliably producing this error
but now reliably *does not* produce this error (it produces another one,
#13721, fix forthcoming for that too).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant