Performance degradation caused by presence of optional numpy dependency #6052

jakob-keller · 2023-01-04T21:50:34Z

First of all, thanks for making polars available! It turned out to be a blazing fast alternative to pandas when doing pre-processing for real-time ML inference. True game changer!

We run an internal benchmark to flag possible performance regressions as part of testing new polars releases. Starting with 0.15.9, I noticed a minor performance regression that affects DataFrame initialization from data: dict[str, tuple[float | str | None, ...]]. After some digging, I believe this is caused by #5918 which dramatically increases calls to _NUMPY_TYPE().

I was surprised by the performance impact of _NUMPY_TYPE():

without numpy being installed, the impact is minimal. Great!
with numpy being merely installed, the impact grows by a factor of 27x - regardless of whether numpy is actually used. Not so great!

There is a trivial hack that speeds up DataFrame initialization and our benchmark by ~10%, well offsetting the initially discovered regression:

import polars.dependencies
polars.dependencies._NUMPY_AVAILABLE = False

Would you consider supporting a corresponding fast path for users that are also working with polars and numpy, just not in combination?

A few potential approaches come to mind:

Refactor polars.dependencies._NUMPY_AVAILABLE and make it available as part of the API - for users to overwrite manually when required. Something like polars.NO_NUMPY = True. -> Still feels hacky
Make polars.dependencies._NUMPY_AVAILABLE user configurable, e.g. by supporting optional environment variables or config files that may be used to trigger an override of polars.dependencies._NUMPY_AVAILABLE. -> I would be happy to use an environment variable, such as PY_POLARS_NO_NUMPY=1
Understand and accept performance penalty of polars.dependencies._NUMPY_AVAILABLE and polars.dependencies._NUMPY_TYPE() and potentially reduce reliance on those throughout the code-base. -> high effort

BTW: The same issue applies to similar constants, such as polars.dependencies._PANDAS_AVAILABLE, which might affect other use cases, too.

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2023-01-05T06:35:56Z

I can take a look at this; do you have a suggested/straightforward example showing something particularly awful? Or literally just init from a column of basic sequence types? I'll start experimenting, but if there's something especially egregious then let me know.

I've already got some ideas for speedups & better logic, so I'm quite sure we can address this without any hackery being required (I also believe this is the only place where a tight loop of such checks exist; everywhere else these lazy-typecheck functions are only called once/twice per method, so have no impact).

jakob-keller · 2023-01-05T08:48:43Z

do you have a suggested/straightforward example showing something particularly awful?

Here is one trivial way to reproduce the ~10% performance penalty:

Base case:
python3 -m timeit -s "import polars.dependencies; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.22 msec per loop

Performance "hack":
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.1 msec per loop

By the way, pandas equivalent is 10x slower: 🐌
python3 -m timeit -s "import pandas; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "pandas.DataFrame(data)"
20 loops, best of 5: 9.67 msec per loop

Mac mini (M1, 2020), macOS Ventura 13.1, CPython 3.10.9, polars 0.15.11

ritchie46 · 2023-01-05T09:01:40Z

Maybe it is also an idea to add specialized constructors to the public interface that run almost no python bytecode?

This might be useful for users than need to create many very small DataFrames.

jakob-keller · 2023-01-05T09:22:04Z

On a related note: I would have expected that specifying column data types via the columns argument might speed up DataFrame initialization, since type inference could by avoided. In reality, it adds a significant penalty, ~10-15% in trivial cases.

This should a separate issue, but might influence potential work on improvements such as spezialized constructors.

Let me know and I would be happy to elaborate and provide an example is a separate issue.

alexander-beedie · 2023-01-05T10:30:46Z

I've got a small change almost ready to go that makes the _NUMPY_TYPE check ~8x faster (running in an isolated loop), but ideally I'm going to make it faster still. Then I'll think about some more structural improvements to our highly-overloaded frame-init, and also take a look at the apparent schema/type-inference slowdown (@jakob-keller; got a particular example of that one too?)

@ritchie46: a dedicated low-overhead init path for basic/key data types sounds great :)

jakob-keller · 2023-01-05T10:49:39Z

Wow, you guys are as fast as polars itself ;-)

(@jakob-keller; got a particular example of that one too?)

Sure thing:

Without columns argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data)"
1000 loops, best of 5: 325 usec per loop

With columns argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data, columns)"
1000 loops, best of 5: 364 usec per loop

I believe the issue might be related to the fact that the columns argument always requires column names to be provided. Maybe this triggers renaming of any Series produced (i.e. rename()), even if the name is already correctly set.

Please note that results reverse without polars.dependencies._NUMPY_AVAILABLE = False. It appears that this masks the underlying issue, if any. Therefore, I decided against opening a separate issue for now.

jakob-keller · 2023-01-05T13:53:10Z

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

alexander-beedie · 2023-01-05T14:05:09Z

Confirmed; the linked PR fixes the regression (which was worse than I realised) and may actually make things ~5% faster than before:

with Timer():
    for i in range(1000):
        _df = pl.DataFrame( 
            { f'column_{c}': tuple( range(1000) ) for c in range(100) } 
        )

0.15.8: 2.4187 secs
0.15.9: 4.5614 secs << regression
0.15.13 (post-PR): 2.2779 secs

Now to take it further...

alexander-beedie · 2023-01-05T14:09:48Z

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

No problem; thanks very much for the clean/detailed repro.

jakob-keller · 2023-01-05T21:34:35Z

I can confirm that polars 0.15.13 addresses the regression as expected. Big thanks to everyone involved!

jakob-keller changed the title ~~Address performance degradations due to presence of optional dependencies~~ Performance degradation caused by presence of optional numpy dependency Jan 4, 2023

alexander-beedie mentioned this issue Jan 5, 2023

feat(python): much faster lazy type-checks #6064

Merged

jakob-keller closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation caused by presence of optional numpy dependency #6052

Performance degradation caused by presence of optional numpy dependency #6052

jakob-keller commented Jan 4, 2023

alexander-beedie commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023 •

edited

Loading

ritchie46 commented Jan 5, 2023

jakob-keller commented Jan 5, 2023

alexander-beedie commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023

alexander-beedie commented Jan 5, 2023 •

edited

Loading

alexander-beedie commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023

Performance degradation caused by presence of optional numpy dependency #6052

Performance degradation caused by presence of optional numpy dependency #6052

Comments

jakob-keller commented Jan 4, 2023

alexander-beedie commented Jan 5, 2023 • edited Loading

jakob-keller commented Jan 5, 2023 • edited Loading

ritchie46 commented Jan 5, 2023

jakob-keller commented Jan 5, 2023

alexander-beedie commented Jan 5, 2023 • edited Loading

jakob-keller commented Jan 5, 2023 • edited Loading

jakob-keller commented Jan 5, 2023

alexander-beedie commented Jan 5, 2023 • edited Loading

alexander-beedie commented Jan 5, 2023 • edited Loading

jakob-keller commented Jan 5, 2023

alexander-beedie commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023 •

edited

Loading

alexander-beedie commented Jan 5, 2023 •

edited

Loading

jakob-keller commented Jan 5, 2023 •

edited

Loading

alexander-beedie commented Jan 5, 2023 •

edited

Loading

alexander-beedie commented Jan 5, 2023 •

edited

Loading