Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow-rs debugging (Error: Expected to read 2166784 metadata bytes, but only read 486.) [Solved] #19

Closed
kylebarron opened this issue Mar 10, 2022 · 1 comment

Comments

@kylebarron
Copy link
Owner

kylebarron commented Mar 10, 2022

For a while, before switching to arrow2/parquet2, (i.e. up until this commit) I was using the arrow and parquet crates from https://github.com/apache/arrow-rs. I repeatedly had an issue with some files, where the Parquet file would be readable in Rust, and then the generated Arrow IPC data wouldn't be readable in JS. This caused a ton of frustration, and switching to Arrow2/Parquet2 seemed to solve it, but I didn't know why.

With more debugging, (crucial was logging the vector in Rust right before returning and the Uint8Array from JS), I realized that the data wasn't successfully being transferred back to JS correctly! E.g. when testing at this commit with the test file 1-partition-snappy.parquet, the arrays on the JS and Rust sides had the same length, but changed data.

It appears the entire issue was the reliance on unsafe { Uint8Array::view(&file) }. When I instead create a new Uint8Array and copy the file into the newly created Uint8Array, the array in JS and in Rust matches, and the file is read successfully by Arrow JS.

From the wasm-bindgen docs

Views into WebAssembly memory are only valid so long as the backing buffer isn’t resized in JS. Once this function is called any future calls to Box::new (or malloc of any form) may cause the returned value here to be invalidated. Use with caution!

Additionally the returned object can be safely mutated but the input slice isn’t guaranteed to be mutable.

Finally, the returned object is disconnected from the input slice’s lifetime, so there’s no guarantee that the data is read at the right time.

To be honest, I'm not entirely sure where I was violating these principles (or maybe it was some internals from the arrow FileWriter). So makes sense (at least for now) to remove the unsafe code and create a new Uint8Array buffer to solve this 🙂 .

Note that creating another Uint8Array buffer would put more memory pressure on WebAssembly, which seems to run out of memory after using 1GB, but that's a problem for the future (ideally we'll be able to return a stream of record batches to JS).

This was referenced Mar 11, 2022
@kylebarron
Copy link
Owner Author

Going to close this now because I exclusively copy the Vec<u8> into a new Uint8Array for returning to the client, and don't return views, at least for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant