Series `head` / `tail` is about 60+ times slower than DataFrame ones #12928

CaselIT · 2023-12-07T09:34:55Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from timeit import timeit

df = pl.DataFrame({"a": pl.arange(0, 10000, eager=True), "c": 1, "b": "a", "x": 42.42})
df_series = df["a"]
plain_series = pl.Series("x", pl.arange(0, 10000, eager=True))

def go(code):
    print(code, timeit(code, number=100_000, globals=globals()))

go("df.head(0)")
go("df.head(100)")
go("df.tail(0)")
go("df.tail(100)")
go("df_series.head(0)")
go("df_series.head(100)")
go("df_series.tail(0)")
go("df_series.tail(100)")
go("plain_series.head(0)")
go("plain_series.head(100)")
go("plain_series.tail(0)")
go("plain_series.tail(100)")
go("pl.DataFrame({'_': plain_series}).head(0)['_']")
go("pl.DataFrame({'_': plain_series}).head(100)['_']")
go("pl.DataFrame({'_': plain_series}).tail(0)['_']")
go("pl.DataFrame({'_': plain_series}).tail(100)['_']")

Log output

df.head(0) 0.18811989994719625
df.head(100) 0.1828387000132352
df.tail(0) 0.12071079993620515
df.tail(100) 0.10931550001259893
df_series.head(0) 7.409043599967845
df_series.head(100) 8.439290300011635
df_series.tail(0) 11.03056729992386
df_series.tail(100) 10.96223199996166
plain_series.head(0) 9.089440800016746
plain_series.head(100) 9.058857599971816
plain_series.tail(0) 10.974754099966958
plain_series.tail(100) 10.8858651999617
pl.DataFrame({'_': plain_series}).head(0)['_'] 1.592211200040765
pl.DataFrame({'_': plain_series}).head(100)['_'] 1.5547756999731064
pl.DataFrame({'_': plain_series}).tail(0)['_'] 1.5701891999924555
pl.DataFrame({'_': plain_series}).tail(100)['_'] 1.571642900002189

Issue description

Doing head/tail on a series is almost a couple of magnitude slower than doing the same operation on a dataframe.

Expected behavior

These operations have a comparable time than dataframe. Since they are doing less work I would expect them to be faster, but comparable time is probably good enough.

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.8.0
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              14.0.1
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

CaselIT · 2023-12-07T09:42:17Z

Something strange is definitely going on. Creating a dataframe from the series, doing the operation and getting the series back is still 4/5 times faster than doing the operation in the series directly.

I've updated the opening example

mcrumiller · 2023-12-07T16:33:36Z

This is unrelated to head or tail (well, it sort of is). There is no direct Series implementaton, and the result is called by transforming the Series to a frame, running the Expression implementation, and then retrieving the Series again, as can be seen here:

def head(self, n: int = 10) -> Series:
    ...
    return self.to_frame().select(F.col(self.name).head(n)).to_series()

It's the select overhead that's taking time, which you can see here:

go("df.head(0)")
go("df.head(100)")
go("df.select(pl.col('x'))")  # < -- this one takes a long time too
go("plain_series.to_frame().select(pl.col('x')).to_series()")

df.head(0) 0.6803621590006514
df.head(100) 0.7990960510014702
df.select(pl.col('x')) 10.904522539000027
plain_series.to_frame().select(pl.col('x')).to_series() 11.942456482000125

I don't think this overhead has been considered substantial until now, but I definitely see that it is in cases of repeating the same call many times. Did you just notice this when stress-testing the Series implementation, or did this come up in an actual use case? I'm unsure whether it's worth tackling, as there are a lot of series operations that use this pattern. @ritchie46 what do you think?

Edit: looks like we just need to expose the method on the rust side. Incoming PR.

mcrumiller · 2023-12-07T17:14:05Z

Output after fix:

df.head(0) 0.7016070260015113
df.head(100) 0.8119084070003737
df.tail(0) 0.6956114350014104
df.tail(100) 0.5694922879993101
df_series.head(0) 0.32251197599907755
df_series.head(100) 0.30024468200099363
df_series.tail(0) 0.35146700299992517
df_series.tail(100) 0.282969650999803
plain_series.head(0) 0.32037397700150905
plain_series.head(100) 0.2715963600003306
plain_series.tail(0) 0.3179022389995225
plain_series.tail(100) 0.27410357799999474
pl.DataFrame({'_': plain_series}).head(0)['_'] 3.9621210400000564
pl.DataFrame({'_': plain_series}).head(100)['_'] 3.679782751998573
pl.DataFrame({'_': plain_series}).tail(0)['_'] 4.025472122000792
pl.DataFrame({'_': plain_series}).tail(100)['_'] 3.76775115899909

CaselIT · 2023-12-07T18:13:03Z

Thanks for the reply and the PR

Did you just notice this when stress-testing the Series implementation, or did this come up in an actual use case?

My actual use case was totally unrelated, meaning I wanted to extract a numpy types from some column of a dataframe (to interface with existing code that saves data as numpy arrays that I did not want to change), and wanted to check if doing [(c, df[c].head(0).to_numpy().dtype) for c in cols] ad an acceptable time for my use case. That seemed to take too much time and that led me to this.

BTW is there a more intelligent way of doing that?

I'm unsure whether it's worth tackling, as there are a lot of series operations that use this pattern.

Up to you, but even while keeping this pattern (to df -> op -> to series) it may make sense to change the implementation to be more optimized. An equivalent plain_series.to_frame().head(10).to_series() is several times faster

(on a different pc)

go("plain_series.head(0)")
go("plain_series.head(100)")
go("plain_series.tail(0)")
go("plain_series.tail(100)")
go("plain_series.to_frame().head(0).to_series()")
go("plain_series.to_frame().head(100).to_series()")
go("plain_series.to_frame().tail(0).to_series()")
go("plain_series.to_frame().tail(100).to_series()")

plain_series.head(0) 4.969168399926275
plain_series.head(100) 5.323634900152683
plain_series.tail(0) 5.622998500009999
plain_series.tail(100) 5.60176460002549
plain_series.to_frame().head(0).to_series() 0.14708300004713237
plain_series.to_frame().head(100).to_series() 0.15118449996225536
plain_series.to_frame().tail(0).to_series() 0.1474038001615554
plain_series.to_frame().tail(100).to_series() 0.14568820013664663

Just doing the above instead of doing a useless select (since there is always only a single column) may be sufficient without exposing new methods from rust (Note that I don't know how much effort it is to expose them)

mcrumiller · 2023-12-07T19:05:15Z

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

Regarding extracting numpy arrays, there is a Dataframe.to_numpy() which may speed things up for you, perhaps by selecting 0 rows:

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}, schema={"a": pl.UInt8, "b": pl.Int32})

np_dtypes = [(c, d[0]) for c,d in df.filter(False).to_numpy(structured=True).dtype.fields.items()]

[('a', dtype('uint8')), ('b', dtype('int32'))]

CaselIT · 2023-12-07T19:09:00Z

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

well of course for the cases there a comparable operation is available. Thanks for the short turn around

Regarding extracting numpy array

thanks will give it a try. Thanks!

mcrumiller · 2023-12-07T19:21:01Z

Alternatively, you could create a dict mapping of dtypes:

mapping = {
    pl.UInt8: np.dtype("uint8"),
    pl.Int32: np.dtype("int32"),
}

[(c, mapping[d]) for c,d in df.schema.items()]

or if you're using simpler dtypes:

[(c, np.dtype(str(d).lower())) for c,d in df.schema.items()]

CaselIT · 2023-12-07T19:30:02Z

Alternatively, you could create a dict mapping of dtypes:

sure, but since polars already knows how to transform to numpy I wanted to piggy back on it :)
The current solution should be fast enough btw

@CaselIT that's true in this particular case, because head and tail are valid operations on DataFrames, whereas most Series operations are not.

looking a bit the candidates are 3 other:

diff --git a/py-polars/polars/series/series.py b/py-polars/polars/series/series.py
index e8ab8b396..0ea69fb58 100644
--- a/py-polars/polars/series/series.py
+++ b/py-polars/polars/series/series.py
@@ -1687,7 +1687,7 @@ class Series:

     def product(self) -> int | float:
         """Reduce this Series to the product value."""
-        return self.to_frame().select(F.col(self.name).product()).to_series().item()
+        return self.to_frame().product().item()

     def pow(self, exponent: int | float | None | Series) -> Series:
         """
@@ -1786,7 +1786,7 @@ class Series:
         """
         if not self.dtype.is_numeric():
             return None
-        return self.to_frame().select(F.col(self.name).std(ddof)).to_series().item()
+        return self.to_frame().std(ddof).item()

     def var(self, ddof: int = 1) -> float | None:
         """
@@ -1808,7 +1808,7 @@ class Series:
         """
         if not self.dtype.is_numeric():
             return None
-        return self.to_frame().select(F.col(self.name).var(ddof)).to_series().item()
+        return self.to_frame().var(ddof).item()

     def median(self) -> float | None:
         """

from a quick test they should work.

It's probably worth doing a fast path for select of frame of one column though, as you suggested in the pr

CaselIT · 2023-12-07T19:39:36Z

Regarding extracting numpy arrays, there is a Dataframe.to_numpy() which may speed things up for you, perhaps by selecting 0 rows:

seems that filter(False) is not special cased so on a largish df it takes time compared to head(0) that's almost instantaneous

In [9]: %timeit df.filter(False)
51.2 µs ± 928 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [10]: %timeit df.head(0)
621 ns ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: df.shape
Out[11]: (10000, 4)

But the code you suggested with structured=True geves me the finished dtype, so it may make sense to just do that. Will try looking it it works next time I work on that system. Thanks for looking into it

CaselIT added bug Something isn't working python Related to Python Polars labels Dec 7, 2023

mcrumiller mentioned this issue Dec 7, 2023

perf(python): Avoid dispatching Series.head/tail to the expression engine #12946

Merged

universalmind303 added the performance Performance issues or improvements label Dec 11, 2023

stinodego closed this as completed in #12946 Dec 11, 2023

mcrumiller mentioned this issue Dec 12, 2023

Avoid dispatching more Series functions to the expression engine #13009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series `head` / `tail` is about 60+ times slower than DataFrame ones #12928

Series `head` / `tail` is about 60+ times slower than DataFrame ones #12928

CaselIT commented Dec 7, 2023 •

edited

Loading

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023 •

edited

Loading

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023 •

edited

Loading

CaselIT commented Dec 7, 2023 •

edited

Loading

Series head / tail is about 60+ times slower than DataFrame ones #12928

Series head / tail is about 60+ times slower than DataFrame ones #12928

Comments

CaselIT commented Dec 7, 2023 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023 • edited Loading

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023

mcrumiller commented Dec 7, 2023

CaselIT commented Dec 7, 2023 • edited Loading

CaselIT commented Dec 7, 2023 • edited Loading

Series `head` / `tail` is about 60+ times slower than DataFrame ones #12928

Series `head` / `tail` is about 60+ times slower than DataFrame ones #12928

CaselIT commented Dec 7, 2023 •

edited

Loading

mcrumiller commented Dec 7, 2023 •

edited

Loading

CaselIT commented Dec 7, 2023 •

edited

Loading

CaselIT commented Dec 7, 2023 •

edited

Loading