Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird performance characteristics of resampled quantile() function (100 times slower) #26150

Closed
ghost opened this issue Apr 19, 2019 · 7 comments
Labels
Performance Memory or execution speed performance quantile quantile method Resample resample method

Comments

@ghost
Copy link

ghost commented Apr 19, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from time import time

print("Pandas version:", pd.__version__)
print("Numpy version:", np.__version__)

index = pd.date_range("2016-01-01", periods=int(1e7), freq="ms")
df = pd.DataFrame(np.random.rand(len(index)), columns=["column"], index=index)

print()
print("Median")

start = time()
df.resample("1min").median()
print("df.resample.median", time() - start)

start = time()
df.column.resample("1min").median()
print("df.column.resample.median", time() - start)

start = time()
df.resample("1min").column.median()
print("df.resample.column.median", time() - start)

start = time()
df.resample("1min").apply(lambda x: np.median(x))
print("df.resample.apply np.median", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: np.median(x))
print("df.column.resample.apply np.median", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: x.median())
print("df.column.resample.apply pd.median", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: np.median(x))
print("df.resample.column.apply np.median", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: x.median())
print("df.resample.column.apply pd.median", time() - start)

print()
print("Quantile")

start = time()
df.resample("1min").quantile(0.25)
print("df.resample.quantile", time() - start)

start = time()
df.column.resample("1min").quantile(0.25)
print("df.column.resample.quantile", time() - start)

start = time()
df.resample("1min").column.quantile(0.25)
print("df.resample.column.quantile", time() - start)

start = time()
df.resample("1min").apply(lambda x: np.quantile(x, 0.25))
print("df.resample.apply np.quantile", time() - start)

start = time()
df.resample("1min").apply(lambda x: x.quantile(0.25))
print("df.resample.apply pd.quantile", time() - start)

start = time()
df.column.resample("1min").apply(lambda x: np.quantile(x, 0.25))
print("df.column.resample.apply np.quantile", time() - start)

start = time()
df.resample("1min").column.apply(lambda x: x.quantile(0.25))
print("df.resample.column.apply pd.quantile", time() - start)

Output:

Pandas version: 0.24.2
Numpy version: 1.16.2

Median
df.resample.median 0.5927023887634277
df.column.resample.median 0.5536832809448242
df.resample.column.median 0.5364699363708496
df.resample.apply np.median 0.1465294361114502
df.column.resample.apply np.median 0.14081549644470215
df.column.resample.apply pd.median 0.19739317893981934
df.resample.column.apply np.median 0.704085111618042
df.resample.column.apply pd.median 0.7553591728210449

Quantile
df.resample.quantile 0.8943967819213867
df.column.resample.quantile 16.76218605041504
df.resample.column.quantile 16.512025117874146
df.resample.apply np.quantile 0.15454792976379395
df.resample.apply pd.quantile 0.2752718925476074
df.column.resample.apply np.quantile 0.1515665054321289
df.resample.column.apply pd.quantile 0.8622317314147949

Problem description

I have noticed that quantile() is sometimes extremely slow even though median() (which should have a similar run time) is not. While debugging the behaviour, I have found the following two problems:

  1. apply + numpy is significantly faster than the corresponding pandas functions. This is surprising since I would have expected that the pandas operations avoid the overhead of apply().
  2. If you perform quantile() on a series instead of a dataframe, then the operation is much slower. In the example above, it is 100 times slower than the fastest equivalent.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-17134-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented Apr 19, 2019

try on master as a patch was recently merge for this

@ghost
Copy link
Author

ghost commented Apr 19, 2019

The performance has improved a bit, but it is still 50 times slower. Moreover, the performance for the whole dataframe got much worse.

Pandas version: 0.25.0.dev0+429.gf53bb0619
Numpy version: 1.16.2

Median
df.resample.median 0.631497859954834
df.column.resample.median 0.6417803764343262
df.resample.column.median 0.6168115139007568
df.resample.apply np.median 0.5013644695281982
df.column.resample.apply np.median 0.1668860912322998
df.column.resample.apply pd.median 0.2318108081817627
df.resample.column.apply np.median 0.8123440742492676
df.resample.column.apply pd.median 0.856703519821167

Quantile
df.resample.quantile 9.104807615280151
df.column.resample.quantile 9.142380952835083
df.resample.column.quantile 8.587191104888916
df.resample.apply np.quantile 0.44725990295410156
df.resample.apply pd.quantile 0.6029741764068604
df.column.resample.apply np.quantile 0.16558241844177246
df.resample.column.apply pd.quantile 0.9647414684295654

@gfyoung gfyoung added Performance Memory or execution speed performance Resample resample method labels Apr 23, 2019
@swyoon
Copy link
Contributor

swyoon commented May 26, 2019

Any updates on this? I want to work on this if it's not occupied.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 26, 2019 via email

@swyoon
Copy link
Contributor

swyoon commented Jun 29, 2019

Basically, the Cython quantile function of Pandas is way slower than that of Numpy. This can be confirmed by the following snippet.

Numpy Quantile Function

import numpy as np
from time import time

N = int(1e7)
data = np.arange(N)
time_s = time()
np_result = np.quantile(data, 0.5)

print('time', time() - time_s)
print('result', np_result)

which results in

time 0.03830075263977051
result 4999999.5

Pandas Cython Function

libgroupby.group_quantile is basically what is called when we use resample().quantile()

import numpy as np
import pandas._libs.groupby as libgroupby
N = int(1e7)
data = np.arange(N)
time_s = time()
libgroupby.group_quantile(out=a, labels=np.ones((N,), dtype='int'), 
                                values=data, mask=np.zeros(N, dtype='uint8'), 
                                q=0.5, interpolation='linear')
print('time', time() - time_s)
cython_result = a[~np.isnan(a)][0]
print('result', cython_result)

which results in

time 0.8336191177368164
result 4999999.5

@TomAugspurger @jreback What do you think of this issue? Updating libgroupby.group_quantile might be reinventing a wheel. Shall we make some changes to use numpy quantile function?

@jreback
Copy link
Contributor

jreback commented Jun 29, 2019

@swyoon you are comparing apples and oranges
pandas is grouping, numpy is not

numpy also only handles a small set of dtypes and further does not handle all of the ties correctly

you are welcome to profile

@jbrockmendel jbrockmendel added the quantile quantile method label Oct 22, 2019
@mroeschke
Copy link
Member

Seems like efforts here have stalled and may not be relevant anymore with recent version of pandas or numpy, closing until we have more recent profiling results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance quantile quantile method Resample resample method
Projects
None yet
Development

No branches or pull requests

6 participants