Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation caused by presence of optional numpy dependency #6052

Closed
jakob-keller opened this issue Jan 4, 2023 · 10 comments
Closed

Comments

@jakob-keller
Copy link
Contributor

First of all, thanks for making polars available! It turned out to be a blazing fast alternative to pandas when doing pre-processing for real-time ML inference. True game changer!

We run an internal benchmark to flag possible performance regressions as part of testing new polars releases. Starting with 0.15.9, I noticed a minor performance regression that affects DataFrame initialization from data: dict[str, tuple[float | str | None, ...]]. After some digging, I believe this is caused by #5918 which dramatically increases calls to _NUMPY_TYPE().

I was surprised by the performance impact of _NUMPY_TYPE():

  • without numpy being installed, the impact is minimal. Great!
  • with numpy being merely installed, the impact grows by a factor of 27x - regardless of whether numpy is actually used. Not so great!

There is a trivial hack that speeds up DataFrame initialization and our benchmark by ~10%, well offsetting the initially discovered regression:

import polars.dependencies
polars.dependencies._NUMPY_AVAILABLE = False

Would you consider supporting a corresponding fast path for users that are also working with polars and numpy, just not in combination?

A few potential approaches come to mind:

  1. Refactor polars.dependencies._NUMPY_AVAILABLE and make it available as part of the API - for users to overwrite manually when required. Something like polars.NO_NUMPY = True. -> Still feels hacky
  2. Make polars.dependencies._NUMPY_AVAILABLE user configurable, e.g. by supporting optional environment variables or config files that may be used to trigger an override of polars.dependencies._NUMPY_AVAILABLE. -> I would be happy to use an environment variable, such as PY_POLARS_NO_NUMPY=1
  3. Understand and accept performance penalty of polars.dependencies._NUMPY_AVAILABLE and polars.dependencies._NUMPY_TYPE() and potentially reduce reliance on those throughout the code-base. -> high effort

BTW: The same issue applies to similar constants, such as polars.dependencies._PANDAS_AVAILABLE, which might affect other use cases, too.

@jakob-keller jakob-keller changed the title Address performance degradations due to presence of optional dependencies Performance degradation caused by presence of optional numpy dependency Jan 4, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 5, 2023

I can take a look at this; do you have a suggested/straightforward example showing something particularly awful? Or literally just init from a column of basic sequence types? I'll start experimenting, but if there's something especially egregious then let me know.

I've already got some ideas for speedups & better logic, so I'm quite sure we can address this without any hackery being required (I also believe this is the only place where a tight loop of such checks exist; everywhere else these lazy-typecheck functions are only called once/twice per method, so have no impact).

@jakob-keller
Copy link
Contributor Author

jakob-keller commented Jan 5, 2023

do you have a suggested/straightforward example showing something particularly awful?

Here is one trivial way to reproduce the ~10% performance penalty:

Base case:
python3 -m timeit -s "import polars.dependencies; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.22 msec per loop

Performance "hack":
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.1 msec per loop

By the way, pandas equivalent is 10x slower: 🐌
python3 -m timeit -s "import pandas; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "pandas.DataFrame(data)"
20 loops, best of 5: 9.67 msec per loop

Mac mini (M1, 2020), macOS Ventura 13.1, CPython 3.10.9, polars 0.15.11

@ritchie46
Copy link
Member

Maybe it is also an idea to add specialized constructors to the public interface that run almost no python bytecode?

This might be useful for users than need to create many very small DataFrames.

@jakob-keller
Copy link
Contributor Author

On a related note: I would have expected that specifying column data types via the columns argument might speed up DataFrame initialization, since type inference could by avoided. In reality, it adds a significant penalty, ~10-15% in trivial cases.

This should a separate issue, but might influence potential work on improvements such as spezialized constructors.

Let me know and I would be happy to elaborate and provide an example is a separate issue.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 5, 2023

I've got a small change almost ready to go that makes the _NUMPY_TYPE check ~8x faster (running in an isolated loop), but ideally I'm going to make it faster still. Then I'll think about some more structural improvements to our highly-overloaded frame-init, and also take a look at the apparent schema/type-inference slowdown (@jakob-keller; got a particular example of that one too?)

@ritchie46: a dedicated low-overhead init path for basic/key data types sounds great :)

@jakob-keller
Copy link
Contributor Author

jakob-keller commented Jan 5, 2023

Wow, you guys are as fast as polars itself ;-)

(@jakob-keller; got a particular example of that one too?)

Sure thing:

Without columns argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data)"
1000 loops, best of 5: 325 usec per loop

With columns argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data, columns)"
1000 loops, best of 5: 364 usec per loop

I believe the issue might be related to the fact that the columns argument always requires column names to be provided. Maybe this triggers renaming of any Series produced (i.e. rename()), even if the name is already correctly set.

Please note that results reverse without polars.dependencies._NUMPY_AVAILABLE = False. It appears that this masks the underlying issue, if any. Therefore, I decided against opening a separate issue for now.

@jakob-keller
Copy link
Contributor Author

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 5, 2023

Confirmed; the linked PR fixes the regression (which was worse than I realised) and may actually make things ~5% faster than before:

with Timer():
    for i in range(1000):
        _df = pl.DataFrame( 
            { f'column_{c}': tuple( range(1000) ) for c in range(100) } 
        )

0.15.8: 2.4187 secs
0.15.9: 4.5614 secs << regression
0.15.13 (post-PR): 2.2779 secs

Now to take it further...

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 5, 2023

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

No problem; thanks very much for the clean/detailed repro.

@jakob-keller
Copy link
Contributor Author

I can confirm that polars 0.15.13 addresses the regression as expected. Big thanks to everyone involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants