Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed panics on read #150

Merged
merged 2 commits into from Jun 26, 2022
Merged

Removed panics on read #150

merged 2 commits into from Jun 26, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Jun 14, 2022

Removed all panics on invalid data:

  • reading schemas is now panic free (there was a bug in the thrift deserialization causing a panic on invalid union input)
  • invalid lengths of v1 page of rep and def buffers
  • negative compressed and uncompressed sizes in pages
  • invalid i32 -> usize casts
  • invalid RLE-encoded data
  • invalid num_bits declarations

It is still possible to abort on OOM when reading invalid pages - when we read pages, we should be aware of the total column size or something and avoid reading pages larger than the file size. We need to think a bit on how we wanna do this.

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Jun 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2022

Codecov Report

Merging #150 (e108a10) into main (2cd218f) will decrease coverage by 0.60%.
The diff coverage is 68.91%.

@@            Coverage Diff             @@
##             main     #150      +/-   ##
==========================================
- Coverage   74.69%   74.08%   -0.61%     
==========================================
  Files          78       78              
  Lines        3639     3700      +61     
==========================================
+ Hits         2718     2741      +23     
- Misses        921      959      +38     
Impacted Files Coverage Δ
src/read/page/stream.rs 76.08% <0.00%> (ø)
src/page/mod.rs 70.16% <40.00%> (-11.75%) ⬇️
src/page/page_dict/fixed_len_binary.rs 57.14% <50.00%> (-7.57%) ⬇️
src/schema/io_thrift/from_thrift.rs 82.14% <50.00%> (-1.50%) ⬇️
src/page/page_dict/primitive.rs 60.00% <54.54%> (ø)
src/read/page/reader.rs 84.68% <60.00%> (-8.18%) ⬇️
src/encoding/bitpacking.rs 97.95% <75.00%> (+2.04%) ⬆️
src/metadata/column_chunk_metadata.rs 70.37% <75.00%> (+0.80%) ⬆️
src/page/page_dict/binary.rs 76.66% <75.00%> (-4.82%) ⬇️
src/deserialize/utils.rs 88.09% <81.81%> (-3.80%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cd218f...e108a10. Read the comment docs.

@jorgecarleitao jorgecarleitao changed the title Removed some panics Removed panics Jun 15, 2022
@jorgecarleitao jorgecarleitao changed the title Removed panics Removed panics on read Jun 15, 2022
@jorgecarleitao
Copy link
Owner Author

For reference, the test I am running for this is

fn read_corrupted(data: Vec<u8>, field: &str) -> Result<()> {
    let mut reader = std::io::Cursor::new(data);
    read_column(&mut reader, 0, field)?;
    Ok(())
}

fn stress_test(valid: Vec<u8>) {
    use rand::Rng; // 0.8.0

    for field in [
        "int64",
        "float64",
        "string",
        "bool",
        "date",
        "uint32",
        "fixed_binary",
    ] {
        for _ in 0..10000 {
            let mut data = valid.clone();
            let position: usize = rand::thread_rng().gen_range(0..data.len());
            let new_byte: u8 = rand::thread_rng().gen_range(0..u8::MAX);
            data[position] = new_byte;
            let _ = read_corrupted(data, field);
        }
    }
}

#[test]
fn test_plain_does_not_panic() {
    let path = "fixtures/pyarrow3/v1/non_dict/basic_nullable_10.parquet";
    let valid = std::fs::read(path).unwrap();

    stress_test(valid);
}

#[test]
fn test_dict_does_not_panic() {
    let path = "fixtures/pyarrow3/v1/dict/basic_nullable_10.parquet";
    let valid = std::fs::read(path).unwrap();

    stress_test(valid);
}

this now only aborts, due to a larger than available memory request. I.e. we still have an issue, but it is a different type of issue than a panic due to deserializing invalid data.

@jorgecarleitao jorgecarleitao merged commit 077b02c into main Jun 26, 2022
@jorgecarleitao jorgecarleitao deleted the fix_panic branch June 27, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants