New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query][qob] Zstandard corrupted block detected when reading VDS out of GCS #13409
Comments
I think I have a fix for this. |
Another example: https://batch.hail.is/batches/7653388/jobs/3332 print("make sites level ht")
from gnomad.utils.sparse_mt import default_compute_info
vds = hl.vds.read_vds('gs://schema_jsealock/combined/combined_non_bge_data.vds')
mt = vds.variant_data
# run default compute_info on non-ref sites
sites_only_ht = default_compute_info(mt, site_annotations=True)
# convert int64 to float
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_SB_TABLE=sites_only_ht.info.AS_SB_TABLE.map(lambda x: hl.delimit(x, '|'))))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(QUALapprox = hl.float64(sites_only_ht.info.QUALapprox)))
sites_only_ht = sites_only_ht.annotate(info=sites_only_ht.info.annotate(AS_QUALapprox = sites_only_ht.info.AS_QUALapprox.map(hl.float64)))
# save
hl.export_vcf(sites_only_ht, 'gs://schema_jsealock/combined/combine_non_bge_data_sites_only.vcf.bgz') |
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.
OK, I seem to have resolved this error, but now another transient error has dramatically increased I included my test code which was reliably reproducing this error approximately once per run. I ran
After my fix [2] for this issues bug, the #13721 bug became super common! I saw it 50 times in my first run:
Luckily, that one is actually trivial to fix, we just need to update to the latest GCS client Test Codeimport hail as hl
import gnomad.utils.sparse_mt
tmp_dir = 'gs://danking/tmp/'
vds_file = 'gs://neale-bge/bge-wave-1.vds'
out = 'gs://danking/foo.vcf.bgz'
vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info=t.info.drop('AS_SB_TABLE'))
t = t.annotate(info = t.info.drop(
'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')
hl.methods.export_vcf(dataset = t, output = out, tabix = True) Failing Batch (in my namespace)https://internal.hail.is/dking/batch/batches/8?q=state%3Dbad Footnotes[1] I was using |
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.
…13730) CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely. As I described in [this comment to #13409](#13409 (comment)), I have a 10K partition pipeline which was reliably producing this error but now reliably *does not* produce this error (it produces another one, #13721, fix forthcoming for that too).
What happened?
Notify these threads on completion:
Using QoB, reading out of GCS, we encounter corrupted blocks on this simple pipeline.
A simplified version of the script:
batch-7751958-2713-main.log
Version
0.2.120
Relevant log output
The text was updated successfully, but these errors were encountered: