Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

Closed
fredrik-1 opened this issue Nov 12, 2017 · 6 comments · Fixed by #18373
Closed

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

fredrik-1 opened this issue Nov 12, 2017 · 6 comments · Fixed by #18373
Labels
Performance Memory or execution speed performance Visualization plotting
Milestone

Comments

@fredrik-1
Copy link

import numpy as np
import pandas as pd
import matplotlib
#matplotlib.use('tkagg')
import matplotlib.pyplot as plt
import time
#plt.ion()

class Timer():
    def __init__(self):
        self.time0=time.time()
    def __call__(self, str_):
        print(str_+str(time.time()-self.time0))
        self.time0=time.time()
        
df=pd.DataFrame({'a':np.random.randn(1000000)})
fig=plt.figure(1)
fig.clf()
    
timer=Timer()
ax0=fig.add_subplot(2,1,1)
ax0.plot(df.index,df['a'])
timer('matplotlib took: ')

ax2=fig.add_subplot(2,1,2)
df.plot(legend=None, ax=ax2)
timer('pandas took:     ')
fig.canvas.draw()

Result on my slow computer:
matplotlib took:0.48404979705810547
pandas took: 47.75511360168457

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?
The above is a simple example but I encountered the same difference on real world data with times as the index when run an a much faster computer with more memory (a recent anaconda download)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 20 Model 2 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?

Could you do some profiling to look where times differ?

@fredrik-1
Copy link
Author

I am new to profiling in python but I guess the (unecessary) time is spent in:

        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:2611(__call__)
        1    0.000    0.000   70.357   70.357 pandas\plotting\_core.py:1850(plot_frame)
        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:1640(_plot)
        1    0.162    0.162   70.356   70.356  pandas\plotting\_core.py:241(generate)
        1    0.613    0.613   69.868   69.868  pandas\plotting\_core.py:369(_post_plot_logic_common)
        1    4.507    4.507   69.184   69.184  pandas\plotting\_core.py:371(<listcomp>)
  1000001    9.924    0.000   64.677    0.000  pandas\io\formats\printing.py:157(pprint_thing)
  1000001   14.808    0.000   38.582    0.000  pandas\io\formats\printing.py:186(as_escaped_unicode)
  1000000    3.648    0.000   22.603    0.000  numpy\core\numeric.py:1905(array_str)
  1000000    8.743    0.000   18.955    0.000 numpy\core\arrayprint.py:381(wrapper)
  1000000    6.561    0.000    7.512    0.000  numpy\core\arrayprint.py:399(array2string)
  2000696    7.280    0.000    7.280    0.000 {built-in method builtins.hasattr}
  1000001    4.571    0.000    6.814    0.000  pandas\core\dtypes\inference.py:396(is_sequence)
  3003869    3.259    0.000    3.270    0.000 {built-in method builtins.isinstance}
  1001993    2.247    0.000    2.247    0.000 {built-in method builtins.iter}
  1000000    0.951    0.000    0.951    0.000 {method 'item' of 'numpy.ndarray' objects}
  1000000    0.718    0.000    0.718    0.000 {method 'discard' of 'set' objects}
  1000000    0.674    0.000    0.674    0.000 {built-in method _thread.get_ident}
  1000002    0.656    0.000    0.656    0.000 {method 'add' of 'set' objects}
  1000686    0.652    0.000    0.652    0.000 {built-in method builtins.id}

@fredrik-1
Copy link
Author

fredrik-1 commented Nov 12, 2017

So pandas do a list comprehension for all elements in the index using the function pprint_thing to make unicode strings (or something similar). Not that strange that it takes time with millions of rows in the data frame.

Is it necessary when you (at least in my case) can send the data directly to matplotlib?

@fredrik-1
Copy link
Author

fredrik-1 commented Nov 12, 2017

The following change to post_plot_logic_common seems to work in my example:

def _post_plot_logic_common(self, ax, data):
        """Common post process for each axes"""
#        labels = [pprint_thing(key) for key in data.index]
#        labels = dict(zip(range(len(data.index)), labels))

        if self.orientation == 'vertical' or self.orientation is None:
            if self._need_to_set_index:
                xticklabels = [pprint_thing(data.index[x]) for x in ax.get_xticks()]
#                xticklabels = [labels.get(x, '') for x in ax.get_xticks()]
                ax.set_xticklabels(xticklabels)
            self._apply_axis_properties(ax.xaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
        elif self.orientation == 'horizontal':
            if self._need_to_set_index:
                yticklabels = [pprint_thing(data.index[y]) for y in ax.get_yticks()]
#                yticklabels = [labels.get(y, '') for y in ax.get_yticks()]
                ax.set_yticklabels(yticklabels)
            self._apply_axis_properties(ax.yaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize)
        else:  # pragma no cover
            raise ValueError

but this version has problem if get_xticks() don't return an integer index) but I don't really understand how the old code work well in that case either (returning a '' is of course better than an index error but I would think that it would be good to have a tick there also in that case)

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Visualization plotting labels Nov 13, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 13, 2017
@TomAugspurger
Copy link
Contributor

Not sure either. If your index is meaningless, we have the use_index keyword, which may be faster for you.

@jorisvandenbossche
Copy link
Member

Yeah indeed it is a bit strange as in many cases get_xticks don't return a valid index. I think therefore you have the '' fallback in the current code so it does not error, and I suppose somewhere else it is then overwriting those empty strings with the correct labels (but I didn't further look into that).

Anyhow, I have a PR that should fix this: #18373

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.21.1 Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Visualization plotting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants