Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts #15480

Chuck321123 · 2024-04-04T16:22:44Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import numpy as np
import polars as pl
from datetime import datetime, timedelta, date

# Define parameters
num_rows = 280000
num_groups = 200

# Generate random data
data = {
    'group': np.random.choice([f'group_{i}' for i in range(num_groups)], size=num_rows),
    'random_value': np.random.rand(num_rows)
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Randomly set one-third of values to NaN
random_indices = np.random.choice(df.index, size=num_rows // 3, replace=False)
df.loc[random_indices, 'random_value'] = np.nan

df2 = pl.DataFrame(df)

%timeit df["random_value"].bfill()

%timeit df2.select(pl.col("random_value").backward_fill())

Log output

No response

Issue description

So polars backward_fill and forward_fill is about 25% slower as the bfill() and ffill() function in pandas. Would be nice if anyone could find a faster way to run these functions.

Expected behavior

That the polars equivalent is as fast or faster than the pandas counterpart

Installed versions

--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

d-reynol · 2024-04-04T23:32:12Z

I'm not seeing the same behavior:

242 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
239 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Chuck321123 · 2024-04-24T18:30:25Z

Reopening case as I still get faster results for pandas counterpart.

2.21 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let me know if anyone gets something else

deanm0000 · 2024-04-25T11:54:21Z

I don't think your test is apples to apples

doing df["random_value"].bfill() doesn't return a DataFrame. It returns a Series

A more apples to apples test would be compare two function calls that return a dataframe so something like

%%timeit
df2.with_columns(pl.col("random_value").backward_fill())

%%timeit
df.assign(a=df['a'].bfill())

When I do that comparison with 100M rows, 20% null. I get polars takes 795ms and pandas takes 1.44s

Chuck321123 · 2024-04-25T12:15:29Z

@deanm0000 I see. The whole idea is to create a new column in a dataframe where i do backwardfilling. By using with_columns instead of select i get the following results where polars is line number 2:

2.33 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.25 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The pandas way of adding a new column/manipulating existing column is usually df["New_Col"] = ..., so would kind of be wrong to compare to assign in which "nobody" uses

deanm0000 · 2024-04-25T19:53:43Z

I see your point but it's not a bug that pandas is faster for this operation.

Someone should correct me if I have this wrong but I think the difference is that numpy arrays are mutable whereas arrow arrays are immutable. That means when you just want to change a subset of values, pandas/numpy can do that inplace whereas when you want to perform the same operation with arrow arrays it has to rewrite all the values.

orlp · 2024-05-24T16:00:29Z

As mentioned by others, this is not a fair comparison as the input/output formats are different - we don't do in-place manipulation but generate a copy. Also, Polars actually has proper nulls (which means it has to look in a different memory location that contains the nulls), whereas Pandas only has to look at the values themselves since it uses NaNs.

Finally the original test of 280,000 rows is way too small - at that point you're almost benchmarking the Polars DSL parsing/optimizer more than the data manipulation itself.

Repeating the above experiment with 100M rows I get the following results on my Apple M2 machine:

378 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
742 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm currently finishing a PR that would reduce the gap to this:

379 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
544 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

More improvement with branchless filling is possible still but low priority at the moment, as it's rather labour-intensive to write.

Chuck321123 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 4, 2024

Chuck321123 changed the title ~~Forward_fill() and backward_fill() is twice as slow in polars than pandas~~ Forward_fill() and backward_fill() is 1.5-2 times slower in polars than pandas Apr 4, 2024

Chuck321123 changed the title ~~Forward_fill() and backward_fill() is 1.5-2 times slower in polars than pandas~~ Forward_fill() and backward_fill() is 1.5-2 times slower in polars compared to pandas' counterparts Apr 4, 2024

Chuck321123 closed this as completed Apr 5, 2024

avimallu mentioned this issue Apr 5, 2024

.over() performs quite slow in given sample #15492

Closed

2 tasks

Chuck321123 changed the title ~~Forward_fill() and backward_fill() is 1.5-2 times slower in polars compared to pandas' counterparts~~ Forward_fill() and backward_fill() is 30% slower in polars compared to pandas' counterparts Apr 24, 2024

Chuck321123 reopened this Apr 24, 2024

Chuck321123 changed the title ~~Forward_fill() and backward_fill() is 30% slower in polars compared to pandas' counterparts~~ Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts Apr 24, 2024

deanm0000 added performance Performance issues or improvements and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Apr 25, 2024

orlp mentioned this issue May 24, 2024

perf: improved numeric fill_(forward/backward) #16475

Merged

ritchie46 closed this as completed in #16475 May 25, 2024

c-peters added the accepted Ready for implementation label May 27, 2024

c-peters assigned orlp May 27, 2024

cmdlineluser mentioned this issue Jun 11, 2024

Feature request: Faster backward- and forward_fill() functions #16875

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts #15480

Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts #15480

Chuck321123 commented Apr 4, 2024 •

edited

Loading

d-reynol commented Apr 4, 2024

Chuck321123 commented Apr 24, 2024 •

edited

Loading

deanm0000 commented Apr 25, 2024 •

edited

Loading

Chuck321123 commented Apr 25, 2024 •

edited

Loading

deanm0000 commented Apr 25, 2024

orlp commented May 24, 2024

Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts #15480

Forward_fill() and backward_fill() is about 25% slower in polars compared to pandas' counterparts #15480

Comments

Chuck321123 commented Apr 4, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

d-reynol commented Apr 4, 2024

Chuck321123 commented Apr 24, 2024 • edited Loading

deanm0000 commented Apr 25, 2024 • edited Loading

Chuck321123 commented Apr 25, 2024 • edited Loading

deanm0000 commented Apr 25, 2024

orlp commented May 24, 2024

Chuck321123 commented Apr 4, 2024 •

edited

Loading

Chuck321123 commented Apr 24, 2024 •

edited

Loading

deanm0000 commented Apr 25, 2024 •

edited

Loading

Chuck321123 commented Apr 25, 2024 •

edited

Loading