Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

Open
ghuls opened this issue Feb 24, 2022 · 16 comments
Labels
performance Performance issues or improvements

Comments

@ghuls
Copy link
Collaborator

ghuls commented Feb 24, 2022

What language are you using?

Python.

What version of polars are you using?

0.13.5

What operating system are you using polars on?

CentOS 7

What language version are you using

python 3.10.2.

Describe your bug.

Selecting or doing queries via expressions on columns further in the dataframe gets exponentially slower when there are a lot of columns.

What are the steps to reproduce the behavior?

# Load Feather file with 1.1M collumns.
In [2]: %time df = pl.read_ipc("1110655_columns.zstd.v2.feather")
CPU times: user 2min 12s, sys: 1min 9s, total: 3min 21s
Wall time: 43 s

# Get all column names with floating point dtype.
In [5]: region_names = [column_name for column_name in df.columns if column_name != "motifs"]

In [6]: len(region_names)
Out[6]: 1110655

# Run query on first hundred columns.
In [9]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[0:100]])
CPU times: user 124 ms, sys: 66.5 ms, total: 191 ms
Wall time: 176 ms

# Run query on last hundred columns.
In [10]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[-100:]])
CPU times: user 5.41 s, sys: 42.8 ms, total: 5.45 s
Wall time: 3.31 s

# Run query on first thousand columns.
In [11]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[:1000]])
CPU times: user 239 ms, sys: 242 ms, total: 482 ms
Wall time: 394 ms

# Run query on last thousand columns.
In [12]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[-1000:]])
CPU times: user 58.6 s, sys: 227 ms, total: 58.9 s
Wall time: 36.8 s

# Just selecting first thousand columns.
In [16]: %time df_select = df.select([pl.col(reg) for reg in region_names[:1000]])
CPU times: user 252 ms, sys: 103 ms, total: 355 ms
Wall time: 354 ms

# Just selecting last thousand columns.
In [17]: %time df_select = df.select([pl.col(reg) for reg in region_names[-1000:]])
CPU times: user 6.97 s, sys: 136 ms, total: 7.1 s
Wall time: 7.1 s

# Just selecting first thousand columns via indexing.
In [23]: %time df_select_via_indexing = dr[region_names[:1000]]
CPU times: user 109 ms, sys: 461 µs, total: 109 ms
Wall time: 108 ms

# Just selecting last thousand columns via indexing.
In [24]: %time df_select_via_indexing = df[region_names[-1000:]]
CPU times: user 99.7 ms, sys: 412 µs, total: 100 ms
Wall time: 99.6 ms

What is the solution?

The location of the columns in the dataframe shouldn't matter.

When getting columns via expressions the same check as in #1028 should be implemented, so if you have a lot of expressions and a lot of columns, a hashmap is used to get the correct columns.

As seen here, the speed difference is big (both produce the same dataframe):

# Just selecting last thousand columns.
In [17]: %time df_select = df.select([pl.col(reg) for reg in region_names[-1000:]])
CPU times: user 6.97 s, sys: 136 ms, total: 7.1 s
Wall time: 7.1 s

# Just selecting last thousand columns via indexing (using hashmap for column name lookup)
In [24]: %time df_select_via_indexing = df[region_names[-1000:]]
CPU times: user 99.7 ms, sys: 412 µs, total: 100 ms
Wall time: 99.6 ms
@ghuls ghuls added the performance Performance issues or improvements label Feb 24, 2022
@ritchie46
Copy link
Member

I think we must refactor Schema to be backed by a hashmap that is ordered like https://github.com/bluss/indexmap instead of a Vec.

@ritchie46
Copy link
Member

See: #2788

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 27, 2022

The speed is already way better than before, but it is still way slower than df[list_of_columns].
Most time seems to be spend in system time.

In [9]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-1000:]])
CPU times: user 373 ms, sys: 181 ms, total: 554 ms
Wall time: 630 ms

In [10]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-10000:]])
CPU times: user 402 ms, sys: 1.12 s, total: 1.52 s
Wall time: 1.52 s

In [12]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-100000:]])
CPU times: user 1 s, sys: 10.6 s, total: 11.6 s
Wall time: 11.6 s

In [13]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[:100000]])
CPU times: user 1.02 s, sys: 10.6 s, total: 11.6 s
Wall time: 11.6 s

In [15]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[:10000]])
CPU times: user 494 ms, sys: 1.15 s, total: 1.64 s
Wall time: 1.64 s

In [16]: %time df_select = dfs[region_columns[-100000:]]
CPU times: user 130 ms, sys: 22.9 ms, total: 153 ms
Wall time: 164 msy

In [17]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns])
CPU times: user 7.6 s, sys: 1min 58s, total: 2min 5s
Wall time: 2min 6s

In [19]: %time df_select = dfs[region_columns[:]]
CPU times: user 413 ms, sys: 141 ms, total: 554 ms
Wall time: 551 ms


In [18]: dfs.shape
Out[18]: (14869, 1110656)

@ritchie46
Copy link
Member

Thanks, I have found another place where we still do an O(n) search. I will follow up on this.

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 28, 2022

It also seems that df.schema got super slow now after the change to indexmap backed schema (not even finishing after several minutes).

@ritchie46
Copy link
Member

ritchie46 commented Feb 28, 2022

You called that on a DataFrame or on a LazyFrame?

For a DataFrame I think it already was slow. It is quite naive python code:

    @property
    def schema(self) -> Dict[str, Type[DataType]]:
        return {c: self[c].dtype for c in self.columns}

And now that I look at it the self[c] is a linear scan.

So we can improve that as well.

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 28, 2022

I called it on DataFrame.

@ghuls
Copy link
Collaborator Author

ghuls commented Mar 1, 2022

#2795 seems to not make a big difference for the simple select query.

In [11]: %time df2 = df.select([pl.col(x) for x in df.columns[:1000]])
CPU times: user 541 ms, sys: 175 ms, total: 716 ms
Wall time: 715 ms

In [12]: %time df2 = df.select([pl.col(x) for x in df.columns[:10000]])
CPU times: user 622 ms, sys: 1.22 s, total: 1.84 s
Wall time: 1.84 s

In [13]: %time df2 = df.select([pl.col(x) for x in df.columns[:100000]])
CPU times: user 1.03 s, sys: 11.4 s, total: 12.4 s
Wall time: 12.4 s

In [14]: %time df2 = df.select([pl.col(x) for x in df.columns[-100000:]])
CPU times: user 1.04 s, sys: 11.3 s, total: 12.4 s
Wall time: 12.4 s

In [8]: %time df2 = df.select([pl.col(x) for x in df.columns])
CPU times: user 7.25 s, sys: 2min 4s, total: 2min 11s
Wall time: 2min 12s

@ritchie46
Copy link
Member

That is strange. Then it is time for a flamegraph I think. :/

@ghuls
Copy link
Collaborator Author

ghuls commented Mar 1, 2022

I am compiling polars now with the best optimisations to see if it makes a difference.

@ritchie46
Copy link
Member

target-cpu=native + release should be enough

@ritchie46
Copy link
Member

Could you do a cargo flamegraph though?

@ghuls
Copy link
Collaborator Author

ghuls commented Mar 1, 2022

With maturin develop --rustc-extra-args="-C target-cpu=native" --release the speed is a bit better.

In [4]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-10000:]]).collect()
CPU times: user 814 ms, sys: 261 ms, total: 1.08 s
Wall time: 1.03 s

In [5]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-100000:]]).collect()
CPU times: user 1.51 s, sys: 970 ms, total: 2.48 s
Wall time: 2.29 s

In [6]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns]).collect()
CPU times: user 12.5 s, sys: 12 s, total: 24.5 s
Wall time: 21.9 s

@ghuls
Copy link
Collaborator Author

ghuls commented Mar 1, 2022

Maybe the remaining problem is related to: #1023

Which was partially fixed with running simple projection not on the threadpool: 4a2c0b6

@ritchie46
Copy link
Member

ritchie46 commented Mar 1, 2022

Yes that will always be faster. We do a lot less work there. I think we must ensure now that querying columns at the end of the Dataframe is as fast as querying at the start.

@ghuls
Copy link
Collaborator Author

ghuls commented Mar 1, 2022

With the latest code querying the first X columns and the last X columns takes the same time now.
How do I use cargo flamegraph with maturin?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements
Projects
None yet
Development

No branches or pull requests

2 participants