Use arrow ipc file format #1933

lhoestq · 2021-02-23T10:38:24Z

According to the documentation, it's identical to the streaming format except that it contains the memory offsets of each sample:

We define a “file format” supporting random access that is build with the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See File.fbs for the precise details of the file footer.

Since it stores more metadata regarding the positions of the examples in the file, it should enable better example retrieval performances. However from the discussion in #1803 it looks like it's not the case unfortunately. Maybe in the future this will allow speed gains.

I think it's still a good idea to start using it anyway for these reasons:

in the future we may have speed gains
it contains the arrow streaming format data
it's compatible with the pyarrow Dataset implementation (it allows to load remote dataframes for example) if we want to use it in the future
it's also the format used by arrow feather if we want to use it in the future
it's roughly the same size as the streaming format
it's easy to have backward compatibility with the streaming format

albertvillanova · 2023-09-24T09:52:38Z

Should we close this PR?

lhoestq · 2023-09-25T09:19:58Z

Yes, this one was mostly related to #4542 but now I think the TF support is not needed at the moment.

patrikkj · 2023-10-30T16:20:19Z

What about enabling the Arrow IPC format through an environment variable/config? Would be very helpful to better interop. with other libraries (e.g. polars) + supporting reads through pyarrow datasets.

use arrow ipc file format

1c00477

lhoestq mentioned this pull request Apr 21, 2021

Filtering/mapping on one column is very slow #2193

Closed

lhoestq mentioned this pull request May 26, 2021

Slow dataloading with big datasets issue persists #2252

Closed

lhoestq mentioned this pull request Jun 4, 2021

ArrowDataset.save_to_disk produces files that cannot be read using pyarrow.feather #2377

Open

lhoestq closed this Sep 25, 2023

albertvillanova deleted the use-arrow-ipc-file-format branch September 25, 2023 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use arrow ipc file format #1933

Use arrow ipc file format #1933

lhoestq commented Feb 23, 2021

albertvillanova commented Sep 24, 2023

lhoestq commented Sep 25, 2023

patrikkj commented Oct 30, 2023

Use arrow ipc file format #1933

Use arrow ipc file format #1933

Conversation

lhoestq commented Feb 23, 2021

albertvillanova commented Sep 24, 2023

lhoestq commented Sep 25, 2023

patrikkj commented Oct 30, 2023