-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series head
/ tail
is about 60+ times slower than DataFrame ones
#12928
Comments
Something strange is definitely going on. Creating a dataframe from the series, doing the operation and getting the series back is still 4/5 times faster than doing the operation in the series directly. I've updated the opening example |
This is unrelated to def head(self, n: int = 10) -> Series:
...
return self.to_frame().select(F.col(self.name).head(n)).to_series() It's the go("df.head(0)")
go("df.head(100)")
go("df.select(pl.col('x'))") # < -- this one takes a long time too
go("plain_series.to_frame().select(pl.col('x')).to_series()")
I don't think this overhead has been considered substantial until now, but I definitely see that it is in cases of repeating the same call many times. Did you just notice this when stress-testing the Series implementation, or did this come up in an actual use case? I'm unsure whether it's worth tackling, as there are a lot of series operations that use this pattern. @ritchie46 what do you think? Edit: looks like we just need to expose the method on the rust side. Incoming PR. |
Output after fix:
|
Thanks for the reply and the PR
My actual use case was totally unrelated, meaning I wanted to extract a numpy types from some column of a dataframe (to interface with existing code that saves data as numpy arrays that I did not want to change), and wanted to check if doing BTW is there a more intelligent way of doing that?
Up to you, but even while keeping this pattern (to df -> op -> to series) it may make sense to change the implementation to be more optimized. An equivalent (on a different pc) go("plain_series.head(0)")
go("plain_series.head(100)")
go("plain_series.tail(0)")
go("plain_series.tail(100)")
go("plain_series.to_frame().head(0).to_series()")
go("plain_series.to_frame().head(100).to_series()")
go("plain_series.to_frame().tail(0).to_series()")
go("plain_series.to_frame().tail(100).to_series()")
Just doing the above instead of doing a useless select (since there is always only a single column) may be sufficient without exposing new methods from rust (Note that I don't know how much effort it is to expose them) |
@CaselIT that's true in this particular case, because Regarding extracting numpy arrays, there is a import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}, schema={"a": pl.UInt8, "b": pl.Int32})
np_dtypes = [(c, d[0]) for c,d in df.filter(False).to_numpy(structured=True).dtype.fields.items()]
|
well of course for the cases there a comparable operation is available. Thanks for the short turn around
thanks will give it a try. Thanks! |
Alternatively, you could create a dict mapping of dtypes: mapping = {
pl.UInt8: np.dtype("uint8"),
pl.Int32: np.dtype("int32"),
}
[(c, mapping[d]) for c,d in df.schema.items()] or if you're using simpler dtypes: [(c, np.dtype(str(d).lower())) for c,d in df.schema.items()] |
sure, but since polars already knows how to transform to numpy I wanted to piggy back on it :)
looking a bit the candidates are 3 other: diff --git a/py-polars/polars/series/series.py b/py-polars/polars/series/series.py
index e8ab8b396..0ea69fb58 100644
--- a/py-polars/polars/series/series.py
+++ b/py-polars/polars/series/series.py
@@ -1687,7 +1687,7 @@ class Series:
def product(self) -> int | float:
"""Reduce this Series to the product value."""
- return self.to_frame().select(F.col(self.name).product()).to_series().item()
+ return self.to_frame().product().item()
def pow(self, exponent: int | float | None | Series) -> Series:
"""
@@ -1786,7 +1786,7 @@ class Series:
"""
if not self.dtype.is_numeric():
return None
- return self.to_frame().select(F.col(self.name).std(ddof)).to_series().item()
+ return self.to_frame().std(ddof).item()
def var(self, ddof: int = 1) -> float | None:
"""
@@ -1808,7 +1808,7 @@ class Series:
"""
if not self.dtype.is_numeric():
return None
- return self.to_frame().select(F.col(self.name).var(ddof)).to_series().item()
+ return self.to_frame().var(ddof).item()
def median(self) -> float | None:
""" from a quick test they should work. It's probably worth doing a fast path for select of frame of one column though, as you suggested in the pr |
seems that
But the code you suggested with |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
Doing head/tail on a series is almost a couple of magnitude slower than doing the same operation on a dataframe.
Expected behavior
These operations have a comparable time than dataframe. Since they are doing less work I would expect them to be faster, but comparable time is probably good enough.
Installed versions
The text was updated successfully, but these errors were encountered: