Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Add missing call to try_push_valid for nested avro deserialization #1248

Merged
merged 6 commits into from Sep 16, 2022

Conversation

shaeqahmed
Copy link
Contributor

The validity len becomes out of sync compared to len (values[0].len) because there is no call to append valid after sucessfully deserializing a struct item. Add this functionality for MutableSturctArray similar to how it is handled for MutableListArray

@jorgecarleitao
Copy link
Owner

Hi @shaeqahmed . Thank you for the PR!

I am trying to understand why this is needed. My understanding is that deserialize_value only handles values, not nulls. The branching on the Union of ["null", type] happens before this function is called, by deserialize_item, and there we add the validity accordingly.

I added #1250 on an attempt to argue that it works as is.

@codecov
Copy link

codecov bot commented Sep 13, 2022

Codecov Report

Merging #1248 (5ddb6ca) into main (96df5e1) will decrease coverage by 0.01%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##             main    #1248      +/-   ##
==========================================
- Coverage   83.15%   83.13%   -0.02%     
==========================================
  Files         359      359              
  Lines       38063    38069       +6     
==========================================
- Hits        31650    31649       -1     
- Misses       6413     6420       +7     
Impacted Files Coverage Δ
src/io/avro/read/nested.rs 62.21% <80.00%> (+0.41%) ⬆️
src/io/avro/read/deserialize.rs 81.16% <100.00%> (+0.05%) ⬆️
src/bitmap/utils/slice_iterator.rs 97.56% <0.00%> (-1.22%) ⬇️
src/array/utf8/mod.rs 82.71% <0.00%> (-0.93%) ⬇️
src/io/ipc/read/file.rs 96.42% <0.00%> (-0.90%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@shaeqahmed
Copy link
Contributor Author

shaeqahmed commented Sep 13, 2022

The logic for Structs's currently is to lazily init the validity array and pad it with true's followed by a false on the first appearance of a null. After this, whenever we deserialize an item, if it is the null variant of a union, we push_null(). However, if the struct is not null, meaning the struct is non-empty, we don't push a true to the validity after deserializing all the underlying fields. The List equivalent for this is in line 136, where after deserializing all the underlying values for a list, we make a call to try_push_valid to push a true for the list.

I believe similar call is needed for MutableStructArray, and that's what I tried to add in this PR. I have a nested avro file that in the current logic triggers this issue, and fails to deserialize because the length of validity is too short.

I will try share it as an example, in addition to trying to create a minimal reproducible example for you. Please let me know if this makes sense.

@shaeqahmed
Copy link
Contributor Author

shaeqahmed commented Sep 13, 2022

Update: this PR is incorrect, as a naive call to try_push_valid here is not correct since that will run even if all the fields in the struct were null (which means the whole struct is null in parquet/arrow). I'm gonna try and put up another revision with, but would appreciate a sanity check that a change is necessary here since this is my first time contributing to the Arrow format. Thanks!

@jorgecarleitao
Copy link
Owner

You are right - this is a bug and the fix is this PR 👍 Thanks for the explanation.

If all elements are null, we still need to append the "true" to the StructArray. In arrow:

  • the length of the struct's validity must be equal to the length of every field in the struct array.
  • the validity of the struct is independent of the validity of its elements. In particular, there can be valid elements on a null slot and vice-versa. This is so that setting the validity of an array is O(1) (instead of a O(N*D)) operation (where N is length and D the max depth of the type).

You have a very good understanding of Arrow and this crate :)

I think this PR would just benefit from a test to confirm our hypothesis.

@shaeqahmed
Copy link
Contributor Author

Awesome, that makes sense, working on the updated PR revision + test cases for this locally. Gonna put something up tonight. 👍 With this change, I can read in my deeply nested nullable structs to arrow2 now, but looks like the arrow2/parquet2 parquet writer is writing corrupt files.

I see that in the code there are some explicitly unimplemented branches such as reading nulls from parquet, and the docs do mention lack of proper support for deeply nested parquet types (related issue: #1222).

I think we should 1) better document what is currently supported and not for arrow2 parquet functionality, and 2) fix the above mentioned bug so that we don't at least write incorrect files. I've already opened an issue to track this here: #1249 so let's continue the discussion there. I'd appreciate if you could take a look. Thanks for the help

@jorgecarleitao jorgecarleitao merged commit c615095 into jorgecarleitao:main Sep 16, 2022
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Sep 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants