Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting thousands/2M of columns is slow #1023

Closed
ghuls opened this issue Jul 22, 2021 · 21 comments
Closed

Selecting thousands/2M of columns is slow #1023

ghuls opened this issue Jul 22, 2021 · 21 comments

Comments

@ghuls
Copy link
Collaborator

ghuls commented Jul 22, 2021

Are you using Python or Rust?

Python

What version of polars are you using?

0.8.12 (arrow2 branch)

What operating system are you using polars on?

CentOS 7

Describe your bug.

Selecting a huge number of columns from an existing dataframe seems to take more time than it should:

# polars from arrow2 branch to be able to load Feather files properly.
In [1]: import polars as pl
import numpy as np

In [2]: import numpy as np

In [3]: %time df = pl.read_ipc('test.feather')
CPU times: user 30.5 s, sys: 17 s, total: 47.5 s
Wall time: 5min 26s

In [12]: df.shape
Out[12]: (24453, 2208374)

In [4]: region_ids = np.random.permutation(df.columns)[0:2000]

In [5]: %time dfr = df.select(region_ids)
CPU times: user 1min 31s, sys: 5.69 s, total: 1min 37s
Wall time: 20.7 s

In [6]: region_ids = np.random.permutation(df.columns)[0:20000]

In [7]: %time dfr = df.select(region_ids)
CPU times: user 14min 48s, sys: 8.86 s, total: 14min 57s
Wall time: 3min 22s

In [8]: %time dfr_pd = dfr.to_pandas()
CPU times: user 3.09 s, sys: 16.5 s, total: 19.6 s
Wall time: 58.5 s

In [9]: dfr.shape
Out[9]: (24453, 20000)

In [10]: region_ids = np.random.permutation(df.columns)[0:200000]

In [11]: %time dfr = df.select(region_ids)
CPU times: user 2h 19min 23s, sys: 37.2 s, total: 2h 20min
Wall time: 33min 17s

In [15]: %time dfr_pd = dfr.to_pandas()
CPU times: user 27 s, sys: 2min 20s, total: 2min 47s
Wall time: 7min 14s
@ritchie46
Copy link
Member

What are your timings if you run

df[region_ids]

That should not be thrown on the threadpool.

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 22, 2021

# For 20000 columns:
In [8]: %time dfr = df[list(region_ids)]
CPU times: user 5min 14s, sys: 7.56 s, total: 5min 22s
Wall time: 5min 21s

When giving a numpy array to df (df[np_array]) polars returns None.

@ritchie46
Copy link
Member

Oh, it hits the same branch.

if isinstance(item, Sequence):

I will have a look.

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 22, 2021

Also in the beginning of the function there is a:

        if isinstance(item, np.ndarray):
            item = pl.Series("", item)

So I assume it never hits the first if isinstance(item, np.ndarray): branch anymore:

# select rows by mask or index
# df[[1, 2, 3]]
# df[true, false, true]
if isinstance(item, np.ndarray):
if item.dtype == int:
return wrap_df(self._df.take(item))
if isinstance(item[0], str):
return wrap_df(self._df.select(item))
if isinstance(item, (pl.Series, Sequence)):
if isinstance(item, Sequence):
# only bool or integers allowed
if type(item[0]) == bool:
item = pl.Series("", item)
else:
return wrap_df(
self._df.take([self._pos_idx(i, dim=0) for i in item])
)
dtype = item.dtype
if dtype == Boolean:
return wrap_df(self._df.filter(item.inner()))
if dtype == UInt32:
return wrap_df(self._df.take_with_series(item.inner()))

but hits the Series branch (which doesn't support strings as far as I can tell).

@jorgecarleitao
Copy link
Collaborator

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 22, 2021

I am surprised it is able to load it relatively fast, but then later subsetting is so slow (I assume due to python function call overhead).

Also using a polars series with df[]returns None.

@ritchie46
Copy link
Member

ritchie46 commented Jul 22, 2021

Now that I think of it. You have random permutations and every named lookup is a linear search through column names. Given that there are 2 million columns this gets slow.

Maybe we should hash above a certain threshold.

Another thing is that every column name lookup has 3 layers of indirection. So yeah.. that's slow.

@ritchie46
Copy link
Member

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

Yes.. that would be a valuable addition.

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 22, 2021

Indeed hashing might help. I think theoretically a arrow table can have more than one time the same name (not that that is the case for me), so probably there should be a check for that.

A polars DataFrame does not allow that, so it would already have errored.

@ritchie46 ritchie46 changed the title Selecting thousands of columns is slow Selecting thousands/2M of columns is slow Jul 22, 2021
@ghuls
Copy link
Collaborator Author

ghuls commented Jul 22, 2021

I also managed to get a panick when trying to select columns via numerical indexes:

thread '/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsthread 'thread ':<unnamed>123<unnamed>' panicked at '<unnamed>' panicked at ':index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480' panicked at '', 20index out of bounds: the len is 24453 but the index is 262480', 
', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs::/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123123:123:20:20

20thread '<unnamed>' panicked at '
20
index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20123
::thread '20<unnamed>20' panicked at '
', 
/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsindex out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123::2020

thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:thread '20thread '
<unnamed><unnamed>' panicked at '' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20:20

index out of bounds: the len is 24453 but the index is 262480' panicked at '', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480:', ', /lustre1/project/stg_00002/lcb/ghuls/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs20:/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs123
::123:20
20thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:20

@ritchie46
Copy link
Member

I also managed to get a panick when trying to select columns via numerical indexes:

thread '/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsthread 'thread ':<unnamed>123<unnamed>' panicked at '<unnamed>' panicked at ':index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480' panicked at '', 20index out of bounds: the len is 24453 but the index is 262480', 
', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs::/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123123:123:20:20

20thread '<unnamed>' panicked at '
20
index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20123
::thread '20<unnamed>20' panicked at '
', 
/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsindex out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123::2020

thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:thread '20thread '
<unnamed><unnamed>' panicked at '' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20:20

index out of bounds: the len is 24453 but the index is 262480' panicked at '', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480:', ', /lustre1/project/stg_00002/lcb/ghuls/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs20:/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs123
::123:20
20thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:20

Could you create an issue with an example? I will take a look.

And I will add a hashing algorithm to the column selection code.

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 25, 2021

Hashing for column selection improved performance enormously (arrow2 branch):

In [12]: region_ids = np.random.permutation(df.columns)[0:2000]

In [13]: %time dfr = df[list(region_ids)]
CPU times: user 384 ms, sys: 61.1 ms, total: 445 ms
Wall time: 444 ms

In [14]: dfr.shape
Out[14]: (24453, 2000)

In [15]: region_ids = np.random.permutation(df.columns)[0:20000]

In [16]: %time dfr = df[list(region_ids)]
CPU times: user 346 ms, sys: 157 ms, total: 503 ms
Wall time: 528 ms

In [17]: dfr.shape
Out[17]: (24453, 20000)

In [18]: region_ids = np.random.permutation(df.columns)[0:200000]

In [19]: %time dfr = df[list(region_ids)]
CPU times: user 575 ms, sys: 116 ms, total: 691 ms
Wall time: 708 ms

In [20]: dfr.shape
Out[20]: (24453, 200000)

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 26, 2021

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

@jorgecarleitao I will try to test the rust IPC reader now. I just saw that in this case it was actually using pyarrows IPC reader (this feather file was in Feather v1 format, and not v2 format (IPC on disk)).

pyarrow had problems with reading this file in Feather v2 format in the past (could write it), due Flatbuffer verification problems as it only could handle 1_000_000 columns (500 000 real data columns):
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10056

Is there a way to get all column names from an IPC (Feather v2) file without reading the whole Feather file completely?

In pyarrow this is possible with the dataset API (at least for Feather v2 files):

        feather_v2_dataset = ds.dataset(feather_file, format="feather")
        column_names = feather_v2_dataset.schema.names

https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10344

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 26, 2021

@jorgecarleitao
I think we hit the same bug that (py)arrow had (flatbuffer 1 milion hardcoded limit).

In (py)arrow, it was solved with this commit: apache/arrow#9447
(max_tables was increased, but the max number of tables is dependent on the size of the of the footer, to prevent denial of service by crafted IPC files).

In [33]: %time df2 = pl.read_ipc(''test.v2.feather'', use_pyarrow=False)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

/software/polars/py-polars/polars/io.py in read_ipc(file, use_pyarrow, storage_options)
    415             tbl = pa.feather.read_table(data)
    416             return pl.DataFrame.from_arrow(tbl)
--> 417         return pl.DataFrame.read_ipc(data)
    418 
    419 

/software/polars/py-polars/polars/eager/frame.py in read_ipc(file)
    606         """
    607         self = DataFrame.__new__(DataFrame)
--> 608         self._df = PyDataFrame.read_ipc(file)
    609         return self
    610 

RuntimeError: Any(ArrowError(Ipc("Unable to get root as footer: TooManyTables")))

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 26, 2021

@jorgecarleitao
I think if your you replace gen::File::root_as_footer with gen::File::root_as_footer_with_opts, you can set the max_table option: At https://github.com/jorgecarleitao/arrow2/blob/16c089e0c4c2b079da10b6fb8fbec19cdabaeab3/src/io/ipc/read/reader.rs#L98

https://docs.rs/flatbuffers/2.0.0/src/flatbuffers/get_root.rs.html#39-49

@ritchie46
Copy link
Member

I will close this because the issue was resolved. Not to stop the discussion. Please go along. :)

@ghuls
Copy link
Collaborator Author

ghuls commented Jul 30, 2021

@jorgecarleitao I implemented it in the same way as the arrow C++ implementation solved it: jorgecarleitao/arrow2#240

In [1]: import polars as pl

# Feather v1 database "loaded" with pyarrow reader (memory mapped):
In [2]: %time df = pl.read_ipc('test.v1.feather', use_pyarrow=True))
CPU times: user 30.5 s, sys: 17 s, total: 47.5 s
Wall time: 5min 26s

# Feather v2 database (with lz4 compression) loaded wtih arrow2 reader.
In [3]: %time df2 = pl.read_ipc('test.v2.feather', use_pyarrow=False)
CPU times: user 1min 25s, sys: 3min 41s, total: 5min 7s
Wall time: 26min 20s

In [4]: df2.shape
Out[4]: (24453, 2208374)

@ritchie46 For reading Feather v2 files with compression polars/polars-io/Cargo.toml should enable features = ["io_ipc_compression"] for the arrow2 dependency.

@ghuls
Copy link
Collaborator Author

ghuls commented Aug 4, 2021

@ritchie46 My patch for reading Feather v2 files with arrow2 was merged. Can you add features = ["io_ipc_compression"] to polars/polars-io/Cargo.toml when you update the arrow2 branch of polars and use this commit of arrow2 (or newer) jorgecarleitao/arrow2@8e146a7

@ritchie46
Copy link
Member

@ritchie46 My patch for reading Feather v2 files with arrow2 was merged. Can you add features = ["io_ipc_compression"] to polars/polars-io/Cargo.toml when you update the arrow2 branch of polars and use this commit of arrow2 (or newer) jorgecarleitao/arrow2@8e146a7

Nice! Could you make a PR for that?

@ghuls
Copy link
Collaborator Author

ghuls commented Aug 4, 2021

Yes. PR in preparation.

@ghuls
Copy link
Collaborator Author

ghuls commented Aug 5, 2021

See: #1096 (failing tests due pyarrow==4.0,5.0 problem in the arrow-cherry-pick branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants