[query] corrupted zstd block #13979

danking · 2023-11-06T18:30:43Z

What happened?

https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/QoB.20Error.3A.20GoogleJsonResponseException.3A.20404.20Not.20Found/near/398355473

I was running hl.pca on the wheel you created for me -> 0.2.124-fcaafc533ec1. and there seems to be a transient error going on https://batch.hail.is/batches/8069235?q=state+%3D+failed, not sure whether this is the same as the previous ones. I just cancelled the job before error summary appears. and here is the code I am running:

        vat_ht = hl.read_table(get_aou_util_path(name="vat"))
        vat_ht = vat_ht.collect_by_key()
        meta_ht = hl.read_table(get_sample_meta_path(annotation=True))
        meta_ht = meta_ht.filter(~meta_ht.related)
        pops = args.pops.split(",") if (args.pops is not None) else POPS
        for pop in pops:
            mt = get_filtered_mt(analysis_type='variant', filter_variants=True, filter_samples=False,
                                 adj_filter=True, pop=pop)
            variants_to_keep = vat_ht.filter(
                (vat_ht.locus.in_autosome()) &
                (hl.is_snp(vat_ht.alleles[0], vat_ht.alleles[1])) &
                (vat_ht['values'][f'gvs_{pop}_af'][0] >= 0.0001) &
                ((vat_ht.values[f"gvs_{pop}_an"][0] >= (N_SAMPLES[pop] * 2 * MIN_CALL_RATE[pop])))
            )
            print('Filtering Variants...')
            mt = mt.filter_rows(hl.is_defined(variants_to_keep[mt.row_key])) # filter to high quality variants
            print('Filtering Samples...')
            mt = mt.filter_cols(hl.is_defined(meta_ht[mt.col_key])) # filter to unrelated samples -> later to project
            print('Running PCA...')
            eigenvalues, scores, loadings = hl.pca(
                hl.int(hl.is_defined(mt.GT)),
                compute_loadings=True,
                k=50,
            )
            print('Writing tables...')
            eigenvalues.write(
                get_pca_ht_path(pop=pop, name='evals'),
                overwrite=args.overwrite,
            )
            scores.write(
                get_pca_ht_path(pop=pop, name='scores'),
                overwrite=args.overwrite,
            )
            loadings.write(
                get_pca_ht_path(pop=pop, name='loadings'),
                overwrite=args.overwrite,
            )

Version

0.2.126

Relevant log output

No response

danking · 2023-11-06T18:36:18Z

Get a python stack trace from wenhan

danking · 2023-11-30T17:06:56Z

Summary

It has happened twice. The failing partition is different in each run.

The pipeline runs two table collects to get sample information, then converts the matrix table to a table of ndarrays of the value hl.int(hl.is_defined(mt.GT)).

The entries are getting subsetted, so there is skipping going on. In both cases, we are decoding the entry array when the corrupted block is discovered. In the first case, we are skipping an int (must be RGQ based on the etype and type). In the second case, we are decoding a string (must be FT).

Since the error happens on a seemingly arbitrary partition, it seems likely this is related to our transient error handling. Both runs use a version of Hail after we fixed the broken transient error handling in GoogleStorageFS (run 1 used fcaafc5, run 2 used 0.2.126 / ee77707).

Path forward

If it is a transient error, we need to fix how we handle transient errors. Maybe our position handling logic is wrong? If it is not a transient error, maybe our skipping logic is wrong? FT appears immediately after RGQ and we know RGQ is getting skipped.

Our implementation of seek for the compressed block buffers looks sketchy to me, but we're using PartitionNativeReader which does no seeking.

Action items:

Log every transient error.
Log the file name and the offset on failure.

Debugging information

EType:

+EBaseStruct{
  `the entries! [877f12a8827e18f61222c6c8c5fb04a8]`:+EArray[
    EBaseStruct{
      GT:EInt32,
      GQ:EInt32,
      RGQ:EInt32,
      FT:EBinary,
      AD:EArray[EInt32]}]}

(zipped) Type:

Struct{
  locus:Locus(GRCh38),
  alleles:Array[String],
  filters:Set[String],
  info:Struct{
    AC:Array[Int32]},
  `the entries! [877f12a8827e18f61222c6c8c5fb04a8]`:Array[
    Struct{
      GT:Call,
      GQ:Int32,
      FT:String,
      AD:Array[Int32]}]}

Source buffer spec:

{"name":"LEB128BufferSpec","child":
  {"name":"BlockingBufferSpec","blockSize":65536,"child":
    {"name":"ZstdBlockBufferSpec","blockSize":65536,"child":
      {"name":"StreamBlockBufferSpec"}}}}

Error for run 1.

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
	at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:655) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:385) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:403) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.readByte(InputBuffers.scala:220) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.skipInt(InputBuffers.scala:260) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at __C796collect_distributed_array_table_native_writer.__m872SKIP_o_int32(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m869INPLACE_DECODE_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m868INPLACE_DECODE_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_r_array_of_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m867DECODE_r_struct_of_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32ENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]

Error for run 2

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
	at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:655) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readBytes(InputBuffers.scala:444) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.readBytes(InputBuffers.scala:253) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at __C816collect_distributed_array_table_native_writer.__m893INPLACE_DECODE_o_binary_TO_o_string(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m889INPLACE_DECODE_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m888INPLACE_DECODE_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_r_array_of_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m887DECODE_r_struct_of_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32ENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]

danking · 2023-11-30T17:28:25Z

I've asked Wenhan to run this pipeline with a JAR that has extra debugging information enabled main...danking:hail:debug-13979.

danking · 2023-11-30T21:43:29Z

OK, here's the most recent failure https://batch.hail.is/batches/8090848/jobs/21993

Don't be duped by my bad log message! There were zero transient errors. I added a log statement that increments the number of errors and prints that message after every error, even if it's not transient.

This time it was partition 20053 (we keep moving earlier?). I forgot to catch and rethrow the error with the toString of the input buffer, but I'm not sure there is much to learn from that anyway.

FWIW, 20053 was successful in the two previous executions:

Interestingly the peak bytes are not consistent:

2023-10-24 19:59:47.756 : INFO: TaskReport: stage=0, partition=20053, attempt=0, peakBytes=58394624, peakBytesReadable=55.69 MiB, chunks requested=5513, cache hits=5501
2023-10-24 19:59:47.759 : INFO: RegionPool: FREE: 55.7M allocated (7.7M blocks / 48.0M chunks), regions.size = 21, 0 current java objects, thread 9: pool-2-thread-1

2023-11-08 19:42:40.000 : INFO: TaskReport: stage=0, partition=20053, attempt=0, peakBytes=61343744, peakBytesReadable=58.50 MiB, chunks requested=5513, cache hits=5501
2023-11-08 19:42:40.000 : INFO: RegionPool: FREE: 58.5M allocated (10.5M blocks / 48.0M chunks), regions.size = 21, 0 current java objects, thread 10: pool-2-thread-2

Whatever is causing this bug is rare. Approximately once every 31,000 partitions.

The CDA IR is the same except for a couple iruid names and the order of the aggregators in the aggregator array is swapped (collect & take vs take & collect).

AFAICT, the GCS Java library doesn't do any streaming verification of the hash. We could compute the CRC32c in a streaming manner and fail if/when we get to the end of the object, but this wouldn't work when we read intervals. I'm really mystified.

danking · 2023-12-06T20:42:45Z

Now five times: partitions 41869, 45088, 46688, 47294, and 59799.

danking · 2023-12-06T21:43:27Z

43809 execute(...)_stage2_table_native_writer_job41869 Failed 16s 647ms $0.0001
47028 execute(...)_stage2_table_native_writer_job45088 Failed 21s 132ms $0.0002
48628 execute(...)_stage2_table_native_writer_job46688 Failed 15s 179ms $0.0001
49234 execute(...)_stage2_table_native_writer_job47294 Failed 17s 6ms $0.0001
61739 execute(...)_stage2_table_native_writer_job59799 Failed 28s 941ms $0.0002

danking · 2023-12-06T21:44:18Z

We tried updated to zstd-jni 1.5.5-11 from 1.5.5-2. 4 Failures.

6873 execute(...)_stage2_table_native_writer_job4933 Failed 13s 631ms $0.0001
7157 execute(...)_stage2_table_native_writer_job5217 Failed 15s 919ms $0.0001
8854 execute(...)_stage2_table_native_writer_job6914 Failed 1 minute 12s $0.0006
12795 execute(...)_stage2_table_native_writer_job10855 Failed 21s 305ms $0.0002

danking · 2023-12-06T22:05:49Z

In the latter two cases, the error does not come from zstd decompression. It comes later during region allocation and using isHet on a Call with ploidy 3.

When zstd does notice a decompression issue, it's always immediately after a read. In this case, immediately after a read of the entries data, but in the past we've seen reads of other MTs/HTs. Note that the entries are the bulk of the bytes, so if there's something that's rare in terms of bytes processed, we're just much more likely to see it in the entries.

danking · 2023-12-06T22:36:56Z

Let's compare 8093951-8854 to 8093977-8854. The latter is a failed task (partition 6914) the former is successful. We'll download the logs and make toss away some debug info that changed between the experiments

cat log | rg StreamBlockInputBuffer: | sed 's/bytes.*//'  > newlog

Since the latter failed, the log obviously ends earlier, but there are no differences (besides timestamps) in the size of the blocks read from GCS. Since these block sizes are read from the input stream, this is pretty good evidence that the bytes aren't corrupted up until now.

# git diff --no-index --word-diff good bad 
...
2023-12-06 [-19:47:11.500-]{+21:39:00.885+} StreamBlockInputBuffer: INFO: reading 2081[-2023-12-06 19:47:11.531 StreamBlockInputBuffer: INFO: reading 2499-]

The decompressed data size is the same: 65536. It's worth noting this is a relatively small compressed buffer after a series of much larger compressed buffers. This one is 2081 and the immediately previous one is 14675. Most of the ones before this are also in the 14k range.

Same experiment on job 7157 again shows no differences in bytes read before the exception occurs.

2023-12-06 [-19:45:18.693-]{+21:36:52.116+} StreamBlockInputBuffer: INFO: reading 17923 
2023-12-06 [-19:45:18.809-]{+21:36:52.388+} StreamBlockInputBuffer: INFO: reading 17843[-2023-12-06 19:45:18.810 StreamBlockInputBuffer: INFO: reading 17657-]
[-2023-12-06 19:45:18.811 StreamBlockInputBuffer: INFO: reading 17646-]

The network reads are identical other than the size of the first read. That first read is the serialized function. I'm not that surprised it differs in size between different commits of Hail. The byte counting is done in our code. If we're counting bytes correctly, then it seems like we're reading the same series of chunks from GCS.

GoogleStorageFS$: INFO: read 1755052 (0 of 1755052) oldbb(0, 8388608) newbb(0, 1755052)
GoogleStorageFS$: INFO: read 8388608 (62604 of 58870664) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 82569 (0 of 82569) oldbb(0, 8388608) newbb(0, 82569)
GoogleStorageFS$: INFO: read 8388608 (0 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (0 of 17833031) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (8388608 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)

GoogleStorageFS$: INFO: read 1753672 (0 of 1753672) oldbb(0, 8388608) newbb(0, 1753672)
GoogleStorageFS$: INFO: read 8388608 (62604 of 58870664) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 82569 (0 of 82569) oldbb(0, 8388608) newbb(0, 82569)
GoogleStorageFS$: INFO: read 8388608 (0 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (0 of 17833031) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (8388608 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)

danking · 2023-12-06T22:57:53Z

I'm really starting to grasp at straws here. Hail's code has been around the block several times, so I have some confidence in it. What's new here? Well, this is Query-on-Batch meaning we're using the Google Cloud Storage Java API directly rather than whatever Hadoop does. A recently posted bug reporting data corruption was fixed in 2.30.0 googleapis/java-storage#2301 .

It wouldn't be the first time we found bugs in the Google Cloud Storage Java API. #13937

CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.

CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. #13721 2. #13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.

CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.

danking added bug query labels Nov 6, 2023

danking assigned jigold and danking and unassigned jigold Nov 6, 2023

danking mentioned this issue Dec 7, 2023

[query] update past a broken google cloud storage java library #14080

Merged

danking closed this as completed in #14080 Dec 7, 2023

danking mentioned this issue Dec 7, 2023

[query] matrix multiply fails #13688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] corrupted zstd block #13979

[query] corrupted zstd block #13979

danking commented Nov 6, 2023

danking commented Nov 6, 2023

danking commented Nov 30, 2023

danking commented Nov 30, 2023

danking commented Nov 30, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023 •

edited

Loading

danking commented Dec 6, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023

[query] corrupted zstd block #13979

[query] corrupted zstd block #13979

Comments

danking commented Nov 6, 2023

What happened?

Version

Relevant log output

danking commented Nov 6, 2023

danking commented Nov 30, 2023

Summary

Path forward

Debugging information

danking commented Nov 30, 2023

danking commented Nov 30, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023 • edited Loading

danking commented Dec 6, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023

danking commented Dec 6, 2023 •

edited

Loading