Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] corrupted zstd block #13979

Closed
danking opened this issue Nov 6, 2023 · 10 comments · Fixed by #14080
Closed

[query] corrupted zstd block #13979

danking opened this issue Nov 6, 2023 · 10 comments · Fixed by #14080
Assignees

Comments

@danking
Copy link
Contributor

danking commented Nov 6, 2023

What happened?

https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/QoB.20Error.3A.20GoogleJsonResponseException.3A.20404.20Not.20Found/near/398355473

I was running hl.pca on the wheel you created for me -> 0.2.124-fcaafc533ec1. and there seems to be a transient error going on https://batch.hail.is/batches/8069235?q=state+%3D+failed, not sure whether this is the same as the previous ones. I just cancelled the job before error summary appears. and here is the code I am running:

        vat_ht = hl.read_table(get_aou_util_path(name="vat"))
        vat_ht = vat_ht.collect_by_key()
        meta_ht = hl.read_table(get_sample_meta_path(annotation=True))
        meta_ht = meta_ht.filter(~meta_ht.related)
        pops = args.pops.split(",") if (args.pops is not None) else POPS
        for pop in pops:
            mt = get_filtered_mt(analysis_type='variant', filter_variants=True, filter_samples=False,
                                 adj_filter=True, pop=pop)
            variants_to_keep = vat_ht.filter(
                (vat_ht.locus.in_autosome()) &
                (hl.is_snp(vat_ht.alleles[0], vat_ht.alleles[1])) &
                (vat_ht['values'][f'gvs_{pop}_af'][0] >= 0.0001) &
                ((vat_ht.values[f"gvs_{pop}_an"][0] >= (N_SAMPLES[pop] * 2 * MIN_CALL_RATE[pop])))
            )
            print('Filtering Variants...')
            mt = mt.filter_rows(hl.is_defined(variants_to_keep[mt.row_key])) # filter to high quality variants
            print('Filtering Samples...')
            mt = mt.filter_cols(hl.is_defined(meta_ht[mt.col_key])) # filter to unrelated samples -> later to project
            print('Running PCA...')
            eigenvalues, scores, loadings = hl.pca(
                hl.int(hl.is_defined(mt.GT)),
                compute_loadings=True,
                k=50,
            )
            print('Writing tables...')
            eigenvalues.write(
                get_pca_ht_path(pop=pop, name='evals'),
                overwrite=args.overwrite,
            )
            scores.write(
                get_pca_ht_path(pop=pop, name='scores'),
                overwrite=args.overwrite,
            )
            loadings.write(
                get_pca_ht_path(pop=pop, name='loadings'),
                overwrite=args.overwrite,
            )

Version

0.2.126

Relevant log output

No response

@danking
Copy link
Contributor Author

danking commented Nov 6, 2023

  1. Get a python stack trace from wenhan

@danking danking assigned jigold and danking and unassigned jigold Nov 6, 2023
@danking
Copy link
Contributor Author

danking commented Nov 30, 2023

Summary

It has happened twice. The failing partition is different in each run.

  1. 49340 https://batch.hail.is/batches/8069235/jobs/51280
  2. 25997 https://batch.hail.is/batches/8083195/jobs/27937

The pipeline runs two table collects to get sample information, then converts the matrix table to a table of ndarrays of the value hl.int(hl.is_defined(mt.GT)).

The entries are getting subsetted, so there is skipping going on. In both cases, we are decoding the entry array when the corrupted block is discovered. In the first case, we are skipping an int (must be RGQ based on the etype and type). In the second case, we are decoding a string (must be FT).

Since the error happens on a seemingly arbitrary partition, it seems likely this is related to our transient error handling. Both runs use a version of Hail after we fixed the broken transient error handling in GoogleStorageFS (run 1 used fcaafc5, run 2 used 0.2.126 / ee77707).


Path forward

If it is a transient error, we need to fix how we handle transient errors. Maybe our position handling logic is wrong? If it is not a transient error, maybe our skipping logic is wrong? FT appears immediately after RGQ and we know RGQ is getting skipped.

Our implementation of seek for the compressed block buffers looks sketchy to me, but we're using PartitionNativeReader which does no seeking.

Action items:

  1. Log every transient error.
  2. Log the file name and the offset on failure.

Debugging information

EType:

+EBaseStruct{
  `the entries! [877f12a8827e18f61222c6c8c5fb04a8]`:+EArray[
    EBaseStruct{
      GT:EInt32,
      GQ:EInt32,
      RGQ:EInt32,
      FT:EBinary,
      AD:EArray[EInt32]}]}

(zipped) Type:

Struct{
  locus:Locus(GRCh38),
  alleles:Array[String],
  filters:Set[String],
  info:Struct{
    AC:Array[Int32]},
  `the entries! [877f12a8827e18f61222c6c8c5fb04a8]`:Array[
    Struct{
      GT:Call,
      GQ:Int32,
      FT:String,
      AD:Array[Int32]}]}

Source buffer spec:

{"name":"LEB128BufferSpec","child":
  {"name":"BlockingBufferSpec","blockSize":65536,"child":
    {"name":"ZstdBlockBufferSpec","blockSize":65536,"child":
      {"name":"StreamBlockBufferSpec"}}}}

Error for run 1.

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
	at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:655) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.ensure(InputBuffers.scala:385) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readByte(InputBuffers.scala:403) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.readByte(InputBuffers.scala:220) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.skipInt(InputBuffers.scala:260) ~[gs:__hail-query-ger0g_jars_fcaafc533ec130ae210b152afa81c0b5ac04592b.jar.jar:0.0.1-SNAPSHOT]
	at __C796collect_distributed_array_table_native_writer.__m872SKIP_o_int32(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m869INPLACE_DECODE_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m868INPLACE_DECODE_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_r_array_of_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C796collect_distributed_array_table_native_writer.__m867DECODE_r_struct_of_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32ENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]

Error for run 2

Caused by: com.github.luben.zstd.ZstdException: Corrupted block detected
	at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) ~[zstd-jni-1.5.2-1.jar:1.5.2-1]
	at is.hail.io.ZstdInputBlockBuffer.readBlock(InputBuffers.scala:655) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.BlockingInputBuffer.readBytes(InputBuffers.scala:444) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at is.hail.io.LEB128InputBuffer.readBytes(InputBuffers.scala:253) ~[gs:__hail-query-ger0g_jars_ee77707f4fab22b1c253321b082a70aff3f44d1c.jar.jar:0.0.1-SNAPSHOT]
	at __C816collect_distributed_array_table_native_writer.__m893INPLACE_DECODE_o_binary_TO_o_string(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m889INPLACE_DECODE_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m888INPLACE_DECODE_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32END_TO_r_array_of_o_struct_of_o_callANDo_int32ANDo_stringANDo_array_of_o_int32END(Unknown Source) ~[?:?]
	at __C816collect_distributed_array_table_native_writer.__m887DECODE_r_struct_of_r_array_of_o_struct_of_o_int32ANDo_int32ANDo_int32ANDo_binaryANDo_array_of_o_int32ENDEND_TO_SBaseStructPointer(Unknown Source) ~[?:?]

@danking
Copy link
Contributor Author

danking commented Nov 30, 2023

I've asked Wenhan to run this pipeline with a JAR that has extra debugging information enabled main...danking:hail:debug-13979.

@danking
Copy link
Contributor Author

danking commented Nov 30, 2023

OK, here's the most recent failure https://batch.hail.is/batches/8090848/jobs/21993

Don't be duped by my bad log message! There were zero transient errors. I added a log statement that increments the number of errors and prints that message after every error, even if it's not transient.

This time it was partition 20053 (we keep moving earlier?). I forgot to catch and rethrow the error with the toString of the input buffer, but I'm not sure there is much to learn from that anyway.

FWIW, 20053 was successful in the two previous executions:

  1. https://batch.hail.is/batches/8069235/jobs/21993
  2. https://batch.hail.is/batches/8083195/jobs/21993

Interestingly the peak bytes are not consistent:

2023-10-24 19:59:47.756 : INFO: TaskReport: stage=0, partition=20053, attempt=0, peakBytes=58394624, peakBytesReadable=55.69 MiB, chunks requested=5513, cache hits=5501
2023-10-24 19:59:47.759 : INFO: RegionPool: FREE: 55.7M allocated (7.7M blocks / 48.0M chunks), regions.size = 21, 0 current java objects, thread 9: pool-2-thread-1
2023-11-08 19:42:40.000 : INFO: TaskReport: stage=0, partition=20053, attempt=0, peakBytes=61343744, peakBytesReadable=58.50 MiB, chunks requested=5513, cache hits=5501
2023-11-08 19:42:40.000 : INFO: RegionPool: FREE: 58.5M allocated (10.5M blocks / 48.0M chunks), regions.size = 21, 0 current java objects, thread 10: pool-2-thread-2

Whatever is causing this bug is rare. Approximately once every 31,000 partitions.

The CDA IR is the same except for a couple iruid names and the order of the aggregators in the aggregator array is swapped (collect & take vs take & collect).

AFAICT, the GCS Java library doesn't do any streaming verification of the hash. We could compute the CRC32c in a streaming manner and fail if/when we get to the end of the object, but this wouldn't work when we read intervals. I'm really mystified.

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

Now five times: partitions 41869, 45088, 46688, 47294, and 59799.

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

43809 execute(...)_stage2_table_native_writer_job41869 Failed 16s 647ms $0.0001
47028 execute(...)_stage2_table_native_writer_job45088 Failed 21s 132ms $0.0002
48628 execute(...)_stage2_table_native_writer_job46688 Failed 15s 179ms $0.0001
49234 execute(...)_stage2_table_native_writer_job47294 Failed 17s 6ms $0.0001
61739 execute(...)_stage2_table_native_writer_job59799 Failed 28s 941ms $0.0002

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

We tried updated to zstd-jni 1.5.5-11 from 1.5.5-2. 4 Failures.

6873 execute(...)_stage2_table_native_writer_job4933 Failed 13s 631ms $0.0001
7157 execute(...)_stage2_table_native_writer_job5217 Failed 15s 919ms $0.0001
8854 execute(...)_stage2_table_native_writer_job6914 Failed 1 minute 12s $0.0006
12795 execute(...)_stage2_table_native_writer_job10855 Failed 21s 305ms $0.0002

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

In the latter two cases, the error does not come from zstd decompression. It comes later during region allocation and using isHet on a Call with ploidy 3.

When zstd does notice a decompression issue, it's always immediately after a read. In this case, immediately after a read of the entries data, but in the past we've seen reads of other MTs/HTs. Note that the entries are the bulk of the bytes, so if there's something that's rare in terms of bytes processed, we're just much more likely to see it in the entries.

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

Let's compare 8093951-8854 to 8093977-8854. The latter is a failed task (partition 6914) the former is successful. We'll download the logs and make toss away some debug info that changed between the experiments

cat log | rg StreamBlockInputBuffer: | sed 's/bytes.*//'  > newlog

Since the latter failed, the log obviously ends earlier, but there are no differences (besides timestamps) in the size of the blocks read from GCS. Since these block sizes are read from the input stream, this is pretty good evidence that the bytes aren't corrupted up until now.

# git diff --no-index --word-diff good bad 
...
2023-12-06 [-19:47:11.500-]{+21:39:00.885+} StreamBlockInputBuffer: INFO: reading 2081[-2023-12-06 19:47:11.531 StreamBlockInputBuffer: INFO: reading 2499-]

The decompressed data size is the same: 65536. It's worth noting this is a relatively small compressed buffer after a series of much larger compressed buffers. This one is 2081 and the immediately previous one is 14675. Most of the ones before this are also in the 14k range.


Same experiment on job 7157 again shows no differences in bytes read before the exception occurs.

2023-12-06 [-19:45:18.693-]{+21:36:52.116+} StreamBlockInputBuffer: INFO: reading 17923 
2023-12-06 [-19:45:18.809-]{+21:36:52.388+} StreamBlockInputBuffer: INFO: reading 17843[-2023-12-06 19:45:18.810 StreamBlockInputBuffer: INFO: reading 17657-]
[-2023-12-06 19:45:18.811 StreamBlockInputBuffer: INFO: reading 17646-]

The network reads are identical other than the size of the first read. That first read is the serialized function. I'm not that surprised it differs in size between different commits of Hail. The byte counting is done in our code. If we're counting bytes correctly, then it seems like we're reading the same series of chunks from GCS.

GoogleStorageFS$: INFO: read 1755052 (0 of 1755052) oldbb(0, 8388608) newbb(0, 1755052)
GoogleStorageFS$: INFO: read 8388608 (62604 of 58870664) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 82569 (0 of 82569) oldbb(0, 8388608) newbb(0, 82569)
GoogleStorageFS$: INFO: read 8388608 (0 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (0 of 17833031) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (8388608 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 1753672 (0 of 1753672) oldbb(0, 8388608) newbb(0, 1753672)
GoogleStorageFS$: INFO: read 8388608 (62604 of 58870664) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 82569 (0 of 82569) oldbb(0, 8388608) newbb(0, 82569)
GoogleStorageFS$: INFO: read 8388608 (0 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (0 of 17833031) oldbb(0, 8388608) newbb(0, 8388608)
GoogleStorageFS$: INFO: read 8388608 (8388608 of 214947046) oldbb(0, 8388608) newbb(0, 8388608)

@danking
Copy link
Contributor Author

danking commented Dec 6, 2023

I'm really starting to grasp at straws here. Hail's code has been around the block several times, so I have some confidence in it. What's new here? Well, this is Query-on-Batch meaning we're using the Google Cloud Storage Java API directly rather than whatever Hadoop does. A recently posted bug reporting data corruption was fixed in 2.30.0 googleapis/java-storage#2301 .

It wouldn't be the first time we found bugs in the Google Cloud Storage Java API. #13937

danking pushed a commit to danking/hail that referenced this issue Dec 7, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
danking added a commit that referenced this issue Dec 7, 2023
CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most
frequently as "com.github.luben.zstd.ZstdException: Corrupted block
detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The
google-cloud-storage java library has a bug present at least since
2.29.0 in which simply incorrect data was returned.
googleapis/java-storage#2301 . The issue seems
related to their use of multiple intremediate ByteBuffers. As far as I
can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate
ByteBuffer.
3. While attempting to read more data into a subsequent intermediate
ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which
frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the
`position` hasn't been updated. When we recreate the channel we will
jump to the wrong position and re-read some data. Lucky for us, between
Zstd and our assertions, this usually crashes the program instead of
silently returning bad data.

This is the third bug we have found in Google's cloud storage java
library. The previous two:

1. #13721
2. #13937

Be forewarned: the next time we see bizarre networking or data
corruption issues, check if updating google-cloud-storage fixes the
problem.
danking pushed a commit to danking/hail that referenced this issue Dec 16, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants