Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed error reading unbounded Avro list #1253

Merged
merged 2 commits into from
Sep 20, 2022
Merged

Fixed error reading unbounded Avro list #1253

merged 2 commits into from
Sep 20, 2022

Conversation

jorgecarleitao
Copy link
Owner

Fixed #1252 - Thanks to @shaeqahmed for the report!

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Sep 19, 2022
@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

Base: 83.15% // Head: 83.16% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (b2fbadf) compared to base (96df5e1).
Patch coverage: 70.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1253      +/-   ##
==========================================
+ Coverage   83.15%   83.16%   +0.01%     
==========================================
  Files         359      359              
  Lines       38063    38158      +95     
==========================================
+ Hits        31650    31735      +85     
- Misses       6413     6423      +10     
Impacted Files Coverage Δ
src/io/avro/read/deserialize.rs 81.09% <70.00%> (-0.02%) ⬇️
src/io/ipc/read/file.rs 97.32% <0.00%> (ø)
src/io/avro/read/nested.rs 66.66% <0.00%> (+4.87%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

-len
} else {
len
};
Copy link
Contributor

@shaeqahmed shaeqahmed Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks correct to me, but it actually fixes a different bug than the one I was referring to in #1252. The fix here addresses the handling the case when an avro list item block encodes an optional byte_size for data skipping as <-1*block_count><byte_size>....

The issue I was referring to, however, is the call to array.try_push_valid()?; on line 145 (136 original), which is called for each block deserialised. Shouldn't this call be moved right outside of the loop {...} onto line 147, so that we only push valid once per list item?

Thinking out loud, for an empty list, looks like the if len == 0 { break; } causes this loop to short circuit currently meaning we won't push a valid for that case too. I wonder if this is also an unintended bug, of having this call to try_push_valid in the block reading loop as opposed to outside.

To clarify, I am concerned about the handling of the following two cases:

  • List [1,2,3,4], encoded as two "blocks"

    • Currently we make 2 calls to try_push_valid whereas I would expect 1
  • List [], encoded as 0 blocks, indicated by a prefixed byte of len == 0

    • Currently we make 0 calls to try_push_valid whereas I would expect 1, as you mentioned before that empty lists/structs are still "valid" in Arrow so as to maintain O(1) validity checks.

Again, I am not well versed in the Avro format, so feel free to correct me if I have an incorrect understanding here. I came across these potential bugs while reading the deserialisation code thoroughly, as I was considering taking a stab at implementing Map type support for Avro, which will be very similar to List. Thank you

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Really good call.

Did another push with an extra fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, LGTM

@jorgecarleitao jorgecarleitao merged commit e972df0 into main Sep 20, 2022
@jorgecarleitao jorgecarleitao deleted the fix_avro_list branch September 20, 2022 19:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential bug in reading lists from avro?
2 participants