Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series head / tail is about 60+ times slower than DataFrame ones #12928

Closed
2 tasks done
CaselIT opened this issue Dec 7, 2023 · 9 comments · Fixed by #12946
Closed
2 tasks done

Series head / tail is about 60+ times slower than DataFrame ones #12928

CaselIT opened this issue Dec 7, 2023 · 9 comments · Fixed by #12946
Labels
bug Something isn't working performance Performance issues or improvements python Related to Python Polars

Comments

@CaselIT
Copy link
Contributor

CaselIT commented Dec 7, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from timeit import timeit

df = pl.DataFrame({"a": pl.arange(0, 10000, eager=True), "c": 1, "b": "a", "x": 42.42})
df_series = df["a"]
plain_series = pl.Series("x", pl.arange(0, 10000, eager=True))

def go(code):
    print(code, timeit(code, number=100_000, globals=globals()))

go("df.head(0)")
go("df.head(100)")
go("df.tail(0)")
go("df.tail(100)")
go("df_series.head(0)")
go("df_series.head(100)")
go("df_series.tail(0)")
go("df_series.tail(100)")
go("plain_series.head(0)")
go("plain_series.head(100)")
go("plain_series.tail(0)")
go("plain_series.tail(100)")
go("pl.DataFrame({'_': plain_series}).head(0)['_']")
go("pl.DataFrame({'_': plain_series}).head(100)['_']")
go("pl.DataFrame({'_': plain_series}).tail(0)['_']")
go("pl.DataFrame({'_': plain_series}).tail(100)['_']")

Log output

df.head(0) 0.18811989994719625
df.head(100) 0.1828387000132352
df.tail(0) 0.12071079993620515
df.tail(100) 0.10931550001259893
df_series.head(0) 7.409043599967845
df_series.head(100) 8.439290300011635
df_series.tail(0) 11.03056729992386
df_series.tail(100) 10.96223199996166
plain_series.head(0) 9.089440800016746
plain_series.head(100) 9.058857599971816
plain_series.tail(0) 10.974754099966958
plain_series.tail(100) 10.8858651999617
pl.DataFrame({'_': plain_series}).head(0)['_'] 1.592211200040765
pl.DataFrame({'_': plain_series}).head(100)['_'] 1.5547756999731064
pl.DataFrame({'_': plain_series}).tail(0)['_'] 1.5701891999924555
pl.DataFrame({'_': plain_series}).tail(100)['_'] 1.571642900002189

Issue description

Doing head/tail on a series is almost a couple of magnitude slower than doing the same operation on a dataframe.

Expected behavior

These operations have a comparable time than dataframe. Since they are doing less work I would expect them to be faster, but comparable time is probably good enough.

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.8.0
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              14.0.1
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9
@CaselIT CaselIT added bug Something isn't working python Related to Python Polars labels Dec 7, 2023
@CaselIT
Copy link
Contributor Author

CaselIT commented Dec 7, 2023

Something strange is definitely going on. Creating a dataframe from the series, doing the operation and getting the series back is still 4/5 times faster than doing the operation in the series directly.

I've updated the opening example

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 7, 2023

This is unrelated to head or tail (well, it sort of is). There is no direct Series implementaton, and the result is called by transforming the Series to a frame, running the Expression implementation, and then retrieving the Series again, as can be seen here:

def head(self, n: int = 10) -> Series:
    ...
    return self.to_frame().select(F.col(self.name).head(n)).to_series()

It's the select overhead that's taking time, which you can see here:

go("df.head(0)")
go("df.head(100)")
go("df.select(pl.col('x'))")  # < -- this one takes a long time too
go("plain_series.to_frame().select(pl.col('x')).to_series()")
df.head(0) 0.6803621590006514
df.head(100) 0.7990960510014702
df.select(pl.col('x')) 10.904522539000027
plain_series.to_frame().select(pl.col('x')).to_series() 11.942456482000125

I don't think this overhead has been considered substantial until now, but I definitely see that it is in cases of repeating the same call many times. Did you just notice this when stress-testing the Series implementation, or did this come up in an actual use case? I'm unsure whether it's worth tackling, as there are a lot of series operations that use this pattern. @ritchie46 what do you think?

Edit: looks like we just need to expose the method on the rust side. Incoming PR.

@mcrumiller
Copy link
Contributor

Output after fix:

df.head(0) 0.7016070260015113
df.head(100) 0.8119084070003737
df.tail(0) 0.6956114350014104
df.tail(100) 0.5694922879993101
df_series.head(0) 0.32251197599907755
df_series.head(100) 0.30024468200099363
df_series.tail(0) 0.35146700299992517
df_series.tail(100) 0.282969650999803
plain_series.head(0) 0.32037397700150905
plain_series.head(100) 0.2715963600003306
plain_series.tail(0) 0.3179022389995225
plain_series.tail(100) 0.27410357799999474
pl.DataFrame({'_': plain_series}).head(0)['_'] 3.9621210400000564
pl.DataFrame({'_': plain_series}).head(100)['_'] 3.679782751998573
pl.DataFrame({'_': plain_series}).tail(0)['_'] 4.025472122000792
pl.DataFrame({'_': plain_series}).tail(100)['_'] 3.76775115899909

@CaselIT
Copy link
Contributor Author

CaselIT commented Dec 7, 2023

Thanks for the reply and the PR

Did you just notice this when stress-testing the Series implementation, or did this come up in an actual use case?

My actual use case was totally unrelated, meaning I wanted to extract a numpy types from some column of a dataframe (to interface with existing code that saves data as numpy arrays that I did not want to change), and wanted to check if doing [(c, df[c].head(0).to_numpy().dtype) for c in cols] ad an acceptable time for my use case. That seemed to take too much time and that led me to this.

BTW is there a more intelligent way of doing that?

I'm unsure whether it's worth tackling, as there are a lot of series operations that use this pattern.

Up to you, but even while keeping this pattern (to df -> op -> to series) it may make sense to change the implementation to be more optimized. An equivalent plain_series.to_frame().head(10).to_series() is several times faster

(on a different pc)

go("plain_series.head(0)")
go("plain_series.head(100)")
go("plain_series.tail(0)")
go("plain_series.tail(100)")
go("plain_series.to_frame().head(0).to_series()")
go("plain_series.to_frame().head(100).to_series()")
go("plain_series.to_frame().tail(0).to_series()")
go("plain_series.to_frame().tail(100).to_series()")
plain_series.head(0) 4.969168399926275
plain_series.head(100) 5.323634900152683
plain_series.tail(0) 5.622998500009999
plain_series.tail(100) 5.60176460002549
plain_series.to_frame().head(0).to_series() 0.14708300004713237
plain_series.to_frame().head(100).to_series() 0.15118449996225536
plain_series.to_frame().tail(0).to_series() 0.1474038001615554
plain_series.to_frame().tail(100).to_series() 0.14568820013664663

Just doing the above instead of doing a useless select (since there is always only a single column) may be sufficient without exposing new methods from rust (Note that I don't know how much effort it is to expose them)

@mcrumiller
Copy link
Contributor

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

Regarding extracting numpy arrays, there is a Dataframe.to_numpy() which may speed things up for you, perhaps by selecting 0 rows:

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}, schema={"a": pl.UInt8, "b": pl.Int32})

np_dtypes = [(c, d[0]) for c,d in df.filter(False).to_numpy(structured=True).dtype.fields.items()]
[('a', dtype('uint8')), ('b', dtype('int32'))]

@CaselIT
Copy link
Contributor Author

CaselIT commented Dec 7, 2023

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

well of course for the cases there a comparable operation is available. Thanks for the short turn around

Regarding extracting numpy array

thanks will give it a try. Thanks!

@mcrumiller
Copy link
Contributor

Alternatively, you could create a dict mapping of dtypes:

mapping = {
    pl.UInt8: np.dtype("uint8"),
    pl.Int32: np.dtype("int32"),
}

[(c, mapping[d]) for c,d in df.schema.items()]

or if you're using simpler dtypes:

[(c, np.dtype(str(d).lower())) for c,d in df.schema.items()] 

@CaselIT
Copy link
Contributor Author

CaselIT commented Dec 7, 2023

Alternatively, you could create a dict mapping of dtypes:

sure, but since polars already knows how to transform to numpy I wanted to piggy back on it :)
The current solution should be fast enough btw

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

looking a bit the candidates are 3 other:

diff --git a/py-polars/polars/series/series.py b/py-polars/polars/series/series.py
index e8ab8b396..0ea69fb58 100644
--- a/py-polars/polars/series/series.py
+++ b/py-polars/polars/series/series.py
@@ -1687,7 +1687,7 @@ class Series:

     def product(self) -> int | float:
         """Reduce this Series to the product value."""
-        return self.to_frame().select(F.col(self.name).product()).to_series().item()
+        return self.to_frame().product().item()

     def pow(self, exponent: int | float | None | Series) -> Series:
         """
@@ -1786,7 +1786,7 @@ class Series:
         """
         if not self.dtype.is_numeric():
             return None
-        return self.to_frame().select(F.col(self.name).std(ddof)).to_series().item()
+        return self.to_frame().std(ddof).item()

     def var(self, ddof: int = 1) -> float | None:
         """
@@ -1808,7 +1808,7 @@ class Series:
         """
         if not self.dtype.is_numeric():
             return None
-        return self.to_frame().select(F.col(self.name).var(ddof)).to_series().item()
+        return self.to_frame().var(ddof).item()

     def median(self) -> float | None:
         """

from a quick test they should work.

It's probably worth doing a fast path for select of frame of one column though, as you suggested in the pr

@CaselIT
Copy link
Contributor Author

CaselIT commented Dec 7, 2023

Regarding extracting numpy arrays, there is a Dataframe.to_numpy() which may speed things up for you, perhaps by selecting 0 rows:

seems that filter(False) is not special cased so on a largish df it takes time compared to head(0) that's almost instantaneous

In [9]: %timeit df.filter(False)
51.2 µs ± 928 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [10]: %timeit df.head(0)
621 ns ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: df.shape
Out[11]: (10000, 4)

But the code you suggested with structured=True geves me the finished dtype, so it may make sense to just do that. Will try looking it it works next time I work on that system. Thanks for looking into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants