df.plot() very slow compared to explicit matplotlib on large dataframes #18236

fredrik-1 · 2017-11-12T10:20:43Z

import numpy as np
import pandas as pd
import matplotlib
#matplotlib.use('tkagg')
import matplotlib.pyplot as plt
import time
#plt.ion()

class Timer():
    def __init__(self):
        self.time0=time.time()
    def __call__(self, str_):
        print(str_+str(time.time()-self.time0))
        self.time0=time.time()
        
df=pd.DataFrame({'a':np.random.randn(1000000)})
fig=plt.figure(1)
fig.clf()
    
timer=Timer()
ax0=fig.add_subplot(2,1,1)
ax0.plot(df.index,df['a'])
timer('matplotlib took: ')

ax2=fig.add_subplot(2,1,2)
df.plot(legend=None, ax=ax2)
timer('pandas took:     ')
fig.canvas.draw()

Result on my slow computer:
matplotlib took:0.48404979705810547
pandas took: 47.75511360168457

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?
The above is a simple example but I encountered the same difference on real world data with times as the index when run an a much faster computer with more memory (a recent anaconda download)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 20 Model 2 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-11-12T12:46:33Z

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?

Could you do some profiling to look where times differ?

fredrik-1 · 2017-11-12T14:03:15Z

I am new to profiling in python but I guess the (unecessary) time is spent in:

        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:2611(__call__)
        1    0.000    0.000   70.357   70.357 pandas\plotting\_core.py:1850(plot_frame)
        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:1640(_plot)
        1    0.162    0.162   70.356   70.356  pandas\plotting\_core.py:241(generate)
        1    0.613    0.613   69.868   69.868  pandas\plotting\_core.py:369(_post_plot_logic_common)
        1    4.507    4.507   69.184   69.184  pandas\plotting\_core.py:371(<listcomp>)
  1000001    9.924    0.000   64.677    0.000  pandas\io\formats\printing.py:157(pprint_thing)
  1000001   14.808    0.000   38.582    0.000  pandas\io\formats\printing.py:186(as_escaped_unicode)
  1000000    3.648    0.000   22.603    0.000  numpy\core\numeric.py:1905(array_str)
  1000000    8.743    0.000   18.955    0.000 numpy\core\arrayprint.py:381(wrapper)
  1000000    6.561    0.000    7.512    0.000  numpy\core\arrayprint.py:399(array2string)
  2000696    7.280    0.000    7.280    0.000 {built-in method builtins.hasattr}
  1000001    4.571    0.000    6.814    0.000  pandas\core\dtypes\inference.py:396(is_sequence)
  3003869    3.259    0.000    3.270    0.000 {built-in method builtins.isinstance}
  1001993    2.247    0.000    2.247    0.000 {built-in method builtins.iter}
  1000000    0.951    0.000    0.951    0.000 {method 'item' of 'numpy.ndarray' objects}
  1000000    0.718    0.000    0.718    0.000 {method 'discard' of 'set' objects}
  1000000    0.674    0.000    0.674    0.000 {built-in method _thread.get_ident}
  1000002    0.656    0.000    0.656    0.000 {method 'add' of 'set' objects}
  1000686    0.652    0.000    0.652    0.000 {built-in method builtins.id}

fredrik-1 · 2017-11-12T14:25:06Z

So pandas do a list comprehension for all elements in the index using the function pprint_thing to make unicode strings (or something similar). Not that strange that it takes time with millions of rows in the data frame.

Is it necessary when you (at least in my case) can send the data directly to matplotlib?

fredrik-1 · 2017-11-12T15:42:10Z

The following change to post_plot_logic_common seems to work in my example:

def _post_plot_logic_common(self, ax, data):
        """Common post process for each axes"""
#        labels = [pprint_thing(key) for key in data.index]
#        labels = dict(zip(range(len(data.index)), labels))

        if self.orientation == 'vertical' or self.orientation is None:
            if self._need_to_set_index:
                xticklabels = [pprint_thing(data.index[x]) for x in ax.get_xticks()]
#                xticklabels = [labels.get(x, '') for x in ax.get_xticks()]
                ax.set_xticklabels(xticklabels)
            self._apply_axis_properties(ax.xaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
        elif self.orientation == 'horizontal':
            if self._need_to_set_index:
                yticklabels = [pprint_thing(data.index[y]) for y in ax.get_yticks()]
#                yticklabels = [labels.get(y, '') for y in ax.get_yticks()]
                ax.set_yticklabels(yticklabels)
            self._apply_axis_properties(ax.yaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize)
        else:  # pragma no cover
            raise ValueError

but this version has problem if get_xticks() don't return an integer index) but I don't really understand how the old code work well in that case either (returning a '' is of course better than an index error but I would think that it would be good to have a tick there also in that case)

TomAugspurger · 2017-11-13T14:30:57Z

Not sure either. If your index is meaningless, we have the use_index keyword, which may be faster for you.

jorisvandenbossche · 2017-11-20T13:33:38Z

Yeah indeed it is a bit strange as in many cases get_xticks don't return a valid index. I think therefore you have the '' fallback in the current code so it does not error, and I suppose somewhere else it is then overwriting those empty strings with the correct labels (but I didn't further look into that).

Anyhow, I have a PR that should fix this: #18373

TomAugspurger added Performance Memory or execution speed performance Visualization plotting labels Nov 13, 2017

TomAugspurger added this to the Next Major Release milestone Nov 13, 2017

TomAugspurger mentioned this issue Nov 20, 2017

PERF: improve plotting performance by not stringifying all x data #18373

Merged

jorisvandenbossche modified the milestones: Next Major Release, 0.21.1 Nov 20, 2017

jorisvandenbossche closed this as completed in #18373 Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

fredrik-1 commented Nov 12, 2017

TomAugspurger commented Nov 12, 2017

fredrik-1 commented Nov 12, 2017

fredrik-1 commented Nov 12, 2017 •

edited

Loading

fredrik-1 commented Nov 12, 2017 •

edited

Loading

TomAugspurger commented Nov 13, 2017

jorisvandenbossche commented Nov 20, 2017

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

Comments

fredrik-1 commented Nov 12, 2017

TomAugspurger commented Nov 12, 2017

fredrik-1 commented Nov 12, 2017

fredrik-1 commented Nov 12, 2017 • edited Loading

fredrik-1 commented Nov 12, 2017 • edited Loading

TomAugspurger commented Nov 13, 2017

jorisvandenbossche commented Nov 20, 2017

fredrik-1 commented Nov 12, 2017 •

edited

Loading

fredrik-1 commented Nov 12, 2017 •

edited

Loading