Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Support loading Feather v2 (IPC) files with more than 1 million tables #231

Closed
ghuls opened this issue Jul 26, 2021 · 0 comments · Fixed by #240
Closed

Support loading Feather v2 (IPC) files with more than 1 million tables #231

ghuls opened this issue Jul 26, 2021 · 0 comments · Fixed by #240
Labels
enhancement An improvement to an existing feature

Comments

@ghuls
Copy link
Contributor

ghuls commented Jul 26, 2021

As can be seen in pola-rs/polars#1023, loading of Feather v2 (IPC) files with more than 1 million tables does not work.

(py)arrow had the same bug: https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10056

It boils down to the flatbuffer verification code, which has max_tables=1_000_000 by default.
Increasing this limit solves the problem.

In (py)arrow the max table value is determined per dataset based on the footer size, to prevent specially crafted IPC files to take an extraordinary amount of time to verify a very small input IPC file:

apache/arrow#9447

In [33]: %time df2 = pl.read_ipc(''test.v2.feather'', use_pyarrow=False)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

/software/polars/py-polars/polars/io.py in read_ipc(file, use_pyarrow, storage_options)
    415             tbl = pa.feather.read_table(data)
    416             return pl.DataFrame.from_arrow(tbl)
--> 417         return pl.DataFrame.read_ipc(data)
    418 
    419 

/software/polars/py-polars/polars/eager/frame.py in read_ipc(file)
    606         """
    607         self = DataFrame.__new__(DataFrame)
--> 608         self._df = PyDataFrame.read_ipc(file)
    609         return self
    610 

RuntimeError: Any(ArrowError(Ipc("Unable to get root as footer: TooManyTables")))

I think if your you replace gen::File::root_as_footer with gen::File::root_as_footer_with_opts, you can set the max_table option: At

let footer = gen::File::root_as_footer(&footer_data[..])

https://docs.rs/flatbuffers/2.0.0/src/flatbuffers/get_root.rs.html#39-49

ghuls added a commit to ghuls/arrow2 that referenced this issue Jul 30, 2021
…+ arrow implementation.

Set Flatbuffer verification parameters to the same settings as the
C++ arrow implementation (ARROW-11559). This change allows reading
IPC data with more than 1 milion columns.

Closes: jorgecarleitao#231
@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Jul 30, 2021
jorgecarleitao pushed a commit that referenced this issue Jul 31, 2021
…+ arrow implementation. (#240)

Set Flatbuffer verification parameters to the same settings as the
C++ arrow implementation (ARROW-11559). This change allows reading
IPC data with more than 1 milion columns.

Closes: #231
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants