Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with timeseries plotting on py3? #11831

Closed
jorisvandenbossche opened this issue Dec 12, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@jorisvandenbossche
Copy link
Member

commented Dec 12, 2015

I noticed a performance issue with plotting timeseries. After some trying with different environments (different pandas, matplotlib and python versions), it seems there is a problem on python 3 -> up to 10 x slowdown compared to python 2.7:

Python 2 and pandas 0.16.2 and 0.17.1:

In [2]: sys.version
Out[2]: '2.7.10 |Anaconda 1.7.0 (64-bit)| (default, Oct 21 2015, 19:35:23) [MSC
v.1500 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.16.2'

In [4]: matplotlib.__version__
Out[4]: '1.4.3'

In [6]: N = 2000

In [7]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [8]: %timeit df.plot()
1 loops, best of 3: 228 ms per loop

In [10]: df = pd.DataFrame(np.random.randn(N, 5))

In [11]: %timeit df.plot()
10 loops, best of 3: 110 ms per loop
In [1]: import sys

In [2]: sys.version
Out[2]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec  7 2015, 14:10:42) [MS
C v.1500 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: u'0.17.1'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 269 ms per loop

In [8]: df = pd.DataFrame(np.random.randn(N, 5))

In [9]: %timeit df.plot()
10 loops, best of 3: 139 ms per loop

With python 3, pandas 0.16.2 and 0.17.1:

In [2]: sys.version
Out[2]: '3.5.0 |Continuum Analytics, Inc.| (default, Dec  1 2015, 11:46:22) [MSC
 v.1900 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.16.2'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 1.02 s per loop

In [9]: df = pd.DataFrame(np.random.randn(N, 5))

In [10]: %timeit df.plot()
10 loops, best of 3: 143 ms per loop
In [1]: import sys

In [2]: sys.version
Out[2]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSC v
.1900 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.17.1'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975', periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 2.37 s per loop    <------------ !!!! 10x slower than on py2.7

In [8]: df = pd.DataFrame(np.random.randn(N, 5))

In [9]: %timeit df.plot()
10 loops, best of 3: 132 ms per loop
@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2015

this is the period index conversion, which IIRC you fixed?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2015

You are referring to #11194 I think. This fixed a perf regression before 0.17.0 was released (introduced between 0.16.2 and 0.17.0). Do you see a possible reason that this fix would not work on py3?

@RolandRitt

This comment has been minimized.

Copy link

commented Jan 7, 2016

Has this issue been fixed or are there workarounds?
I do have them same Problem (python3 is about 16 times slower).

Are things moving on this issue?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2016

@rittmeister123 Not that I know (I haven't had time myself to dive into it).

If you want to try to profile, to see where the slowdown is coming from, that would be very welcome!

@jorisvandenbossche jorisvandenbossche added this to the 0.18.0 milestone Jan 7, 2016

@RolandRitt

This comment has been minimized.

Copy link

commented Jan 8, 2016

I'm new here (my first entry :) ), so please excuse possible format-issues or something else.

@jorisvandenbossche I did some profiling with the example from above:

My Setup for Python 3:

In [1]: sys.version
Out[1]: '3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 15:00:12)
[MSC v.1900 64 bit (AMD64)]

In[2]: pd.__version__
Out[2]: '0.17.1'

In[3]: matplotlib.__version__
Out[3]: '1.5.0'

In[4]: np.__version__
Out[4]: '1.10.1'

My Setup for Python 2:

In [1]: sys.version
Out[1]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec  7 2015, 14:10:42)
[MSC v.1500 64 bit (AMD64)]'

In[2]: pd.__version__
Out[2]: u'0.17.1'

In[3]: matplotlib.__version__
Out[3]: '1.5.0'

In[4]: np.__version__
Out[4]: '1.10.1'

The dataframe is generated with:

In [1]: N = 2000
        df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975', periods=N))

Timeit on python2:

In [1]: %timeit df.plot()
Out[1]: 1 loops, best of 3: 111 ms per loop

Timeit on python3:

In [1]: %timeit df.plot()
Out[1]: 1 loops, best of 3: 1.01 s per loop

Attached you can find the Profiling files generated with

In[1]: %prun -D python2_df_plot.prof df.plot()

and

In[1]: %prun -D python3_df_plot.prof df.plot()

profiling_pandas_plot.zip

Please have a look at it, since i have no knowledge on profiling

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2016

@rittmeister123 any luck with this?

@RolandRitt

This comment has been minimized.

Copy link

commented Jan 30, 2016

Unfortunately not!
I'm quiet new in python and do not had time to take a close look whats going on and read the profiling..

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2016

@jorisvandenbossche any thoughts on this?

@RolandRitt

This comment has been minimized.

Copy link

commented Feb 10, 2016

FYI:
A few days ago I tried to plot a dataframe with the use_index Parameter set to false...and recognized a speed up...but hadn't had time to exactly verify this and proof it...

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2016

Did some quick profiling, and one of the elements is in any case the difference in performance of PeriodIndex._mpl_repr(), which is just a call to PeriodIndex._get_object_array:

In [8]: sys.version
Out[8]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSC v.1900 64 bit (AMD64)]'

In [9]: pd.__version__
Out[9]: '0.17.1'

In [10]: pidx = pd.period_range('1975-01-01', periods=2000)

In [11]: %timeit pidx._mpl_repr()
1 loops, best of 3: 461 ms per loop

vs

In [6]: sys.version
Out[6]: '2.7.11 |Anaconda 1.7.0 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSCv.1500 64 bit (AMD64)]'

In [7]: pd.__version__
Out[7]: '0.16.2'

In [8]: pidx = pd.period_range('1975-01-01', periods=2000)

In [9]: %timeit pidx._mpl_repr()
100 loops, best of 3: 5.25 ms per loop

In turn, this boils down to calls to Period.from_ordinal():

In [12]: sys.version
Out[12]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSCv.1900 64 bit (AMD64)]'

In [13]: pd.__version__
Out[13]: '0.17.1'

In [14]: %timeit pd.Period._from_ordinal(ordinal=1, freq='D')
1000 loops, best of 3: 476 µs per loop

vs

In [6]: sys.version
Out[6]: '2.7.11 |Anaconda 1.7.0 (64-bit)| (default, Jan 29 2016, 14:26:21) [MSCv.1500 64 bit (AMD64)]'

In [7]: pd.__version__
Out[7]: '0.17.1+315.g62363d2'

In [8]: %timeit pd.Period._from_ordinal(ordinal=1, freq='D')
10000 loops, best of 3: 42.1 µs per loop

Now, what would cause the dramatic difference in performance in the from_ordinal method between python 2 and 3, is still a mystery to me (and I also don't have time to look into further).
@jreback any idea?

@blbradley you did some work on the Period cython code. Do you have by any chance an idea where this peformance difference could be coming from?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2016

I think https://github.com/pydata/pandas/blob/master/pandas/src/period.pyx#L658

needs this instead of doing the string interpretation each time

if isinstance(freq, offsets.DateOffset):
    return freq

also the import can be replace by offsets.to_offset (below)

as when _mpl_repr is called an already constructed freq object is passed (and not the string as in the case above).

There maybe something else going on in the actual to_offset to explain the py2/3 diff though (as string conversion should be similar perf)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2016

I actually did a wrong timeit, as the freq is already an offset in the case of plotting:

# python 3
In [24]: %timeit pd.Period._from_ordinal(ordinal=1, freq=pidx.freq)
1000 loops, best of 3: 233 µs per loop

In [25]: type(pidx.freq)
Out[25]: pandas.tseries.offsets.Day

vs

In [11]: %timeit pd.Period._from_ordinal(ordinal=1, freq=pidx.freq)
The slowest run took 6.19 times longer than the fastest. This could mean that an
 intermediate result is being cached
100000 loops, best of 3: 5.5 µs per loop

which makes the difference even larger

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2016

Strange, I don't see a difference in performance of to_offset, while there is in _maybe_convert_freq:

python 3:

In [43]: %timeit pd.Period._maybe_convert_freq(pidx.freq)
1000 loops, best of 3: 200 µs per loop

In [44]: %timeit to_offset(pidx.freq)
The slowest run took 10.28 times longer than the fastest. This could mean that a
n intermediate result is being cached
1000000 loops, best of 3: 479 ns per loop

vs python 2:

In [19]:  %timeit pd.Period._maybe_convert_freq(pidx.freq)
The slowest run took 6.87 times longer than the fastest. This could mean that an
 intermediate result is being cached
100000 loops, best of 3: 4.42 µs per loop

In [20]: %timeit to_offset(pidx.freq)
The slowest run took 13.08 times longer than the fastest. This could mean that a
n intermediate result is being cached
1000000 loops, best of 3: 502 ns per loop
@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2016

try removing the import in _maybe_convert_freq (use offsets.to_offset instead)
would still add the check for DateOffset as its doing extra work

@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 27, 2016

@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 25, 2016

@jreback jreback closed this in 7fbc600 Apr 26, 2016

nps added a commit to nps/pandas that referenced this issue May 17, 2016

PERF: Fix performance issues when creating multiple instances of Period
closes pandas-dev#12903
closes pandas-dev#11831

Author: rs2 <rootsumsquared@gmail.com>

Closes pandas-dev#12909 from rs2/master and squashes the following commits:

0d9712d [rs2] Make RESO constants global in period.pyx and reduce the number of loops in asv_benchmarks/period.py
1c5a2ab [rs2] Added asv benchmark for Period, PeriodIndex
8bcfd57 [rs2] Reworded whatsnew
8f254e3 [rs2] Added a whatsnew entry + ensured constants are imported correctly by test_tslib.py
5b3e291 [rs2] Moved constants to frequencies.py
fec1b51 [rs2] Fix performance issues when creating multiple instances of Period (pandas-dev#12903, pandas-dev#11831)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.