New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Arrow Support #202
Comments
Thanks for opening this! "arrow support" could mean a lot of things, can you provide a few specific concrete tasks you want to be able to handle? What would you use this feature for? |
I'm trying to hack together a human + model in the loop dataset management / annotation tool (for computer vision and nlp). It includes:
Ideally I'd like to define the schema for all of these things in one place using msgspec structs, so main thing is mapping from msgspec schema to an arrow schema. Having an efficient way to serialize between msgspec structs and arrow batches/tables without converting to python dicts in the middle would be great too, a dream scenario would be a 0 copy view on top of arrow tables using an immutable version of msgspec.Struct. TLDR for now: an efficient |
Bumping this issue up a bit, I have a pretty narrow usecase, where I'd like to dump Transcoding to Do you have thoughts on which project it might be more appropriate to work on? |
Just to echo this. I use quite a bit of DuckDB and Arrow. This type of functionality would be very useful to me. |
I can think of many binary serialization formats which are more efficient and have a richer type system than the ones provided here. In fact, I'm working on one for which I want to add msgspec support at the moment. IMO the preferred way should be to ship them as separate packages and enable msgspec to do this easily. BTW: Right now, it seems like the framework is focused on massive small data objects rather than big ones, which is typical for web-based applications. I believe there are currently some design limitations for handling larger objects, I'm exploring right now. Out of curiosity: I thought that Arrow is for tabular data. How would it store arbitrary structured data? Wouldn't HDF5 be a more suitable candidate? |
Would be great to have an efficient way to serialize msgspec structs to apache arrow, which would also open it up to using parquet and other tools in the arrow ecosystem like duckdb.
The text was updated successfully, but these errors were encountered: