Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Arrow Support #202

Open
michalwols opened this issue Oct 21, 2022 · 5 comments
Open

Apache Arrow Support #202

michalwols opened this issue Oct 21, 2022 · 5 comments

Comments

@michalwols
Copy link

Would be great to have an efficient way to serialize msgspec structs to apache arrow, which would also open it up to using parquet and other tools in the arrow ecosystem like duckdb.

@jcrist
Copy link
Owner

jcrist commented Oct 21, 2022

Thanks for opening this! "arrow support" could mean a lot of things, can you provide a few specific concrete tasks you want to be able to handle? What would you use this feature for?

@michalwols
Copy link
Author

michalwols commented Oct 21, 2022

I'm trying to hack together a human + model in the loop dataset management / annotation tool (for computer vision and nlp). It includes:

  1. JSON based REST API (django with bounding boxes, masks and model predictions encoded as JSON using msgspec)
  2. online inference / training / background tasks with ray actors, which includes fetching large embedding tables for nearest neighbor search, few shot learning and ranking
  3. OLAP queries with duckdb
  4. structured logs (currently json lines but want to switch to msgpack using msgspec), for logging query results, model predictions and training metrics
  5. storing data snapshots / views in parquet
  6. training models on top of the parquet files using pytorch, for which right now I end up converting samples to dicts, but would be nice to use the same msgspec.Struct definitions with extra methods for encoding the annotations in different formats.

Ideally I'd like to define the schema for all of these things in one place using msgspec structs, so main thing is mapping from msgspec schema to an arrow schema. Having an efficient way to serialize between msgspec structs and arrow batches/tables without converting to python dicts in the middle would be great too, a dream scenario would be a 0 copy view on top of arrow tables using an immutable version of msgspec.Struct.

TLDR for now: an efficient msgspec.arrow.encode and msgspec.arrow.decode, which would also make it easy to do the same for parquet.

@michaelbilow
Copy link

Bumping this issue up a bit, I have a pretty narrow usecase, where I'd like to dump msgspec.Structs into parquet for long-term storage.

Transcoding to arrow through msgspec directly would be great, but it would also be fine if I could just get the schema out via https://github.com/koxudaxi/datamodel-code-generator, which I saw @jcrist's comment on koxudaxi/datamodel-code-generator#1278

Do you have thoughts on which project it might be more appropriate to work on?

@cofin
Copy link

cofin commented Aug 29, 2023

TLDR for now: an efficient msgspec.arrow.encode and msgspec.arrow.decode, which would also make it easy to do the same for parquet.

Just to echo this. I use quite a bit of DuckDB and Arrow. This type of functionality would be very useful to me.

@fungs
Copy link

fungs commented Dec 14, 2023

I can think of many binary serialization formats which are more efficient and have a richer type system than the ones provided here. In fact, I'm working on one for which I want to add msgspec support at the moment. IMO the preferred way should be to ship them as separate packages and enable msgspec to do this easily.

BTW: Right now, it seems like the framework is focused on massive small data objects rather than big ones, which is typical for web-based applications. I believe there are currently some design limitations for handling larger objects, I'm exploring right now.

Out of curiosity: I thought that Arrow is for tabular data. How would it store arbitrary structured data? Wouldn't HDF5 be a more suitable candidate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants