Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow ipc file format #1933

Closed
wants to merge 1 commit into from
Closed

Use arrow ipc file format #1933

wants to merge 1 commit into from

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 23, 2021

According to the documentation, it's identical to the streaming format except that it contains the memory offsets of each sample:

We define a “file format” supporting random access that is build with the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See File.fbs for the precise details of the file footer.

Since it stores more metadata regarding the positions of the examples in the file, it should enable better example retrieval performances. However from the discussion in #1803 it looks like it's not the case unfortunately. Maybe in the future this will allow speed gains.

I think it's still a good idea to start using it anyway for these reasons:

  • in the future we may have speed gains
  • it contains the arrow streaming format data
  • it's compatible with the pyarrow Dataset implementation (it allows to load remote dataframes for example) if we want to use it in the future
  • it's also the format used by arrow feather if we want to use it in the future
  • it's roughly the same size as the streaming format
  • it's easy to have backward compatibility with the streaming format

@albertvillanova
Copy link
Member

Should we close this PR?

@lhoestq
Copy link
Member Author

lhoestq commented Sep 25, 2023

Yes, this one was mostly related to #4542 but now I think the TF support is not needed at the moment.

@lhoestq lhoestq closed this Sep 25, 2023
@albertvillanova albertvillanova deleted the use-arrow-ipc-file-format branch September 25, 2023 10:29
@patrikkj
Copy link

What about enabling the Arrow IPC format through an environment variable/config? Would be very helpful to better interop. with other libraries (e.g. polars) + supporting reads through pyarrow datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants