-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation caused by presence of optional numpy dependency #6052
Comments
I can take a look at this; do you have a suggested/straightforward example showing something particularly awful? Or literally just init from a column of basic sequence types? I'll start experimenting, but if there's something especially egregious then let me know. I've already got some ideas for speedups & better logic, so I'm quite sure we can address this without any hackery being required (I also believe this is the only place where a tight loop of such checks exist; everywhere else these lazy-typecheck functions are only called once/twice per method, so have no impact). |
Here is one trivial way to reproduce the ~10% performance penalty: Base case: Performance "hack": By the way, pandas equivalent is 10x slower: 🐌 Mac mini (M1, 2020), macOS Ventura 13.1, CPython 3.10.9, polars 0.15.11 |
Maybe it is also an idea to add specialized constructors to the public interface that run almost no python bytecode? This might be useful for users than need to create many very small DataFrames. |
On a related note: I would have expected that specifying column data types via the This should a separate issue, but might influence potential work on improvements such as spezialized constructors. Let me know and I would be happy to elaborate and provide an example is a separate issue. |
I've got a small change almost ready to go that makes the @ritchie46: a dedicated low-overhead init path for basic/key data types sounds great :) |
Wow, you guys are as fast as polars itself ;-)
Sure thing: Without With I believe the issue might be related to the fact that the columns argument always requires column names to be provided. Maybe this triggers renaming of any Series produced (i.e. rename()), even if the name is already correctly set. Please note that results reverse without |
Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience. |
Confirmed; the linked PR fixes the regression (which was worse than I realised) and may actually make things ~5% faster than before: with Timer():
for i in range(1000):
_df = pl.DataFrame(
{ f'column_{c}': tuple( range(1000) ) for c in range(100) }
)
Now to take it further... |
No problem; thanks very much for the clean/detailed repro. |
I can confirm that polars 0.15.13 addresses the regression as expected. Big thanks to everyone involved! |
First of all, thanks for making polars available! It turned out to be a blazing fast alternative to pandas when doing pre-processing for real-time ML inference. True game changer!
We run an internal benchmark to flag possible performance regressions as part of testing new polars releases. Starting with 0.15.9, I noticed a minor performance regression that affects DataFrame initialization from
data: dict[str, tuple[float | str | None, ...]]
. After some digging, I believe this is caused by #5918 which dramatically increases calls to_NUMPY_TYPE()
.I was surprised by the performance impact of
_NUMPY_TYPE()
:numpy
being installed, the impact is minimal. Great!numpy
being merely installed, the impact grows by a factor of 27x - regardless of whether numpy is actually used. Not so great!There is a trivial hack that speeds up DataFrame initialization and our benchmark by ~10%, well offsetting the initially discovered regression:
import polars.dependencies
polars.dependencies._NUMPY_AVAILABLE = False
Would you consider supporting a corresponding fast path for users that are also working with
polars
andnumpy
, just not in combination?A few potential approaches come to mind:
polars.dependencies._NUMPY_AVAILABLE
and make it available as part of the API - for users to overwrite manually when required. Something likepolars.NO_NUMPY = True
. -> Still feels hackypolars.dependencies._NUMPY_AVAILABLE
user configurable, e.g. by supporting optional environment variables or config files that may be used to trigger an override ofpolars.dependencies._NUMPY_AVAILABLE
. -> I would be happy to use an environment variable, such asPY_POLARS_NO_NUMPY=1
polars.dependencies._NUMPY_AVAILABLE
andpolars.dependencies._NUMPY_TYPE()
and potentially reduce reliance on those throughout the code-base. -> high effortBTW: The same issue applies to similar constants, such as
polars.dependencies._PANDAS_AVAILABLE
, which might affect other use cases, too.The text was updated successfully, but these errors were encountered: