Weird performance characteristics of resampled quantile() function (100 times slower) #26150

ghost · 2019-04-19T12:51:14Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from time import time

print("Pandas version:", pd.__version__)
print("Numpy version:", np.__version__)

index = pd.date_range("2016-01-01", periods=int(1e7), freq="ms")
df = pd.DataFrame(np.random.rand(len(index)), columns=["column"], index=index)

print()
print("Median")

start = time()
df.resample("1min").median()
print("df.resample.median", time() - start)

start = time()
df.column.resample("1min").median()
print("df.column.resample.median", time() - start)

start = time()
df.resample("1min").column.median()
print("df.resample.column.median", time() - start)

start = time()
df.resample("1min").apply(lambda x: np.median(x))
print("df.resample.apply np.median", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: np.median(x))
print("df.column.resample.apply np.median", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: x.median())
print("df.column.resample.apply pd.median", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: np.median(x))
print("df.resample.column.apply np.median", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: x.median())
print("df.resample.column.apply pd.median", time() - start)

print()
print("Quantile")

start = time()
df.resample("1min").quantile(0.25)
print("df.resample.quantile", time() - start)

start = time()
df.column.resample("1min").quantile(0.25)
print("df.column.resample.quantile", time() - start)

start = time()
df.resample("1min").column.quantile(0.25)
print("df.resample.column.quantile", time() - start)

start = time()
df.resample("1min").apply(lambda x: np.quantile(x, 0.25))
print("df.resample.apply np.quantile", time() - start)

start = time()
df.resample("1min").apply(lambda x: x.quantile(0.25))
print("df.resample.apply pd.quantile", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: np.quantile(x, 0.25))
print("df.column.resample.apply np.quantile", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: x.quantile(0.25))
print("df.resample.column.apply pd.quantile", time() - start)

Output:

Pandas version: 0.24.2
Numpy version: 1.16.2

Median
df.resample.median 0.5927023887634277
df.column.resample.median 0.5536832809448242
df.resample.column.median 0.5364699363708496
df.resample.apply np.median 0.1465294361114502
df.column.resample.apply np.median 0.14081549644470215
df.column.resample.apply pd.median 0.19739317893981934
df.resample.column.apply np.median 0.704085111618042
df.resample.column.apply pd.median 0.7553591728210449

Quantile
df.resample.quantile 0.8943967819213867
df.column.resample.quantile 16.76218605041504
df.resample.column.quantile 16.512025117874146
df.resample.apply np.quantile 0.15454792976379395
df.resample.apply pd.quantile 0.2752718925476074
df.column.resample.apply np.quantile 0.1515665054321289
df.resample.column.apply pd.quantile 0.8622317314147949

Problem description

I have noticed that quantile() is sometimes extremely slow even though median() (which should have a similar run time) is not. While debugging the behaviour, I have found the following two problems:

apply + numpy is significantly faster than the corresponding pandas functions. This is surprising since I would have expected that the pandas operations avoid the overhead of apply().
If you perform quantile() on a series instead of a dataframe, then the operation is much slower. In the example above, it is 100 times slower than the fastest equivalent.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-17134-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-04-19T12:57:06Z

try on master as a patch was recently merge for this

ghost · 2019-04-19T13:31:50Z

The performance has improved a bit, but it is still 50 times slower. Moreover, the performance for the whole dataframe got much worse.

Pandas version: 0.25.0.dev0+429.gf53bb0619
Numpy version: 1.16.2

Median
df.resample.median 0.631497859954834
df.column.resample.median 0.6417803764343262
df.resample.column.median 0.6168115139007568
df.resample.apply np.median 0.5013644695281982
df.column.resample.apply np.median 0.1668860912322998
df.column.resample.apply pd.median 0.2318108081817627
df.resample.column.apply np.median 0.8123440742492676
df.resample.column.apply pd.median 0.856703519821167

Quantile
df.resample.quantile 9.104807615280151
df.column.resample.quantile 9.142380952835083
df.resample.column.quantile 8.587191104888916
df.resample.apply np.quantile 0.44725990295410156
df.resample.apply pd.quantile 0.6029741764068604
df.column.resample.apply np.quantile 0.16558241844177246
df.resample.column.apply pd.quantile 0.9647414684295654

swyoon · 2019-05-26T10:03:48Z

Any updates on this? I want to work on this if it's not occupied.

TomAugspurger · 2019-05-26T11:39:01Z

Doesn't seem like anyone is working on it, feel free to take a look.

…

On Sun, May 26, 2019 at 5:03 AM Sangwoong Yoon ***@***.***> wrote: Any updates on this? I want to work on this if it's not occupied. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26150?email_source=notifications&email_token=AAKAOIVSF3MUZJXS7HKBHKDPXJOAVA5CNFSM4HHD53J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWICHYI#issuecomment-495985633>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIRFIKQNXSADUZJDZITPXJOAVANCNFSM4HHD53JQ> .

swyoon · 2019-06-29T08:36:11Z

Basically, the Cython quantile function of Pandas is way slower than that of Numpy. This can be confirmed by the following snippet.

Numpy Quantile Function

import numpy as np
from time import time

N = int(1e7)
data = np.arange(N)
time_s = time()
np_result = np.quantile(data, 0.5)

print('time', time() - time_s)
print('result', np_result)

which results in

time 0.03830075263977051
result 4999999.5

Pandas Cython Function

libgroupby.group_quantile is basically what is called when we use resample().quantile()

import numpy as np
import pandas._libs.groupby as libgroupby
N = int(1e7)
data = np.arange(N)
time_s = time()
libgroupby.group_quantile(out=a, labels=np.ones((N,), dtype='int'), 
                                values=data, mask=np.zeros(N, dtype='uint8'), 
                                q=0.5, interpolation='linear')
print('time', time() - time_s)
cython_result = a[~np.isnan(a)][0]
print('result', cython_result)

which results in

time 0.8336191177368164
result 4999999.5

@TomAugspurger @jreback What do you think of this issue? Updating libgroupby.group_quantile might be reinventing a wheel. Shall we make some changes to use numpy quantile function?

jreback · 2019-06-29T11:24:34Z

@swyoon you are comparing apples and oranges
pandas is grouping, numpy is not

numpy also only handles a small set of dtypes and further does not handle all of the ties correctly

you are welcome to profile

mroeschke · 2023-11-21T19:11:52Z

Seems like efforts here have stalled and may not be relevant anymore with recent version of pandas or numpy, closing until we have more recent profiling results

gfyoung added Performance Memory or execution speed performance Resample resample method labels Apr 23, 2019

jbrockmendel added the quantile quantile method label Oct 22, 2019

mroeschke closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird performance characteristics of resampled quantile() function (100 times slower) #26150

Weird performance characteristics of resampled quantile() function (100 times slower) #26150

ghost commented Apr 19, 2019 •

edited by ghost

Loading

INSTALLED VERSIONS

jreback commented Apr 19, 2019

ghost commented Apr 19, 2019

swyoon commented May 26, 2019

TomAugspurger commented May 26, 2019 via email

swyoon commented Jun 29, 2019

jreback commented Jun 29, 2019

mroeschke commented Nov 21, 2023

Weird performance characteristics of resampled quantile() function (100 times slower) #26150

Weird performance characteristics of resampled quantile() function (100 times slower) #26150

Comments

ghost commented Apr 19, 2019 • edited by ghost Loading

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 19, 2019

ghost commented Apr 19, 2019

swyoon commented May 26, 2019

TomAugspurger commented May 26, 2019 via email

swyoon commented Jun 29, 2019

Numpy Quantile Function

Pandas Cython Function

jreback commented Jun 29, 2019

mroeschke commented Nov 21, 2023

ghost commented Apr 19, 2019 •

edited by ghost

Loading

Output of `pd.show_versions()`