New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.plot() very slow compared to explicit matplotlib on large dataframes #18236

Closed
fredrik-1 opened this Issue Nov 12, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@fredrik-1

fredrik-1 commented Nov 12, 2017

import numpy as np
import pandas as pd
import matplotlib
#matplotlib.use('tkagg')
import matplotlib.pyplot as plt
import time
#plt.ion()

class Timer():
    def __init__(self):
        self.time0=time.time()
    def __call__(self, str_):
        print(str_+str(time.time()-self.time0))
        self.time0=time.time()
        
df=pd.DataFrame({'a':np.random.randn(1000000)})
fig=plt.figure(1)
fig.clf()
    
timer=Timer()
ax0=fig.add_subplot(2,1,1)
ax0.plot(df.index,df['a'])
timer('matplotlib took: ')

ax2=fig.add_subplot(2,1,2)
df.plot(legend=None, ax=ax2)
timer('pandas took:     ')
fig.canvas.draw()

Result on my slow computer:
matplotlib took:0.48404979705810547
pandas took: 47.75511360168457

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?
The above is a simple example but I encountered the same difference on real world data with times as the index when run an a much faster computer with more memory (a recent anaconda download)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 20 Model 2 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 12, 2017

Contributor

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?

Could you do some profiling to look where times differ?

Contributor

TomAugspurger commented Nov 12, 2017

Why is the df.plot() much slower than the explicit matplotlib plot when ploting the same dataframe?

Could you do some profiling to look where times differ?

@fredrik-1

This comment has been minimized.

Show comment
Hide comment
@fredrik-1

fredrik-1 Nov 12, 2017

I am new to profiling in python but I guess the (unecessary) time is spent in:

        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:2611(__call__)
        1    0.000    0.000   70.357   70.357 pandas\plotting\_core.py:1850(plot_frame)
        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:1640(_plot)
        1    0.162    0.162   70.356   70.356  pandas\plotting\_core.py:241(generate)
        1    0.613    0.613   69.868   69.868  pandas\plotting\_core.py:369(_post_plot_logic_common)
        1    4.507    4.507   69.184   69.184  pandas\plotting\_core.py:371(<listcomp>)
  1000001    9.924    0.000   64.677    0.000  pandas\io\formats\printing.py:157(pprint_thing)
  1000001   14.808    0.000   38.582    0.000  pandas\io\formats\printing.py:186(as_escaped_unicode)
  1000000    3.648    0.000   22.603    0.000  numpy\core\numeric.py:1905(array_str)
  1000000    8.743    0.000   18.955    0.000 numpy\core\arrayprint.py:381(wrapper)
  1000000    6.561    0.000    7.512    0.000  numpy\core\arrayprint.py:399(array2string)
  2000696    7.280    0.000    7.280    0.000 {built-in method builtins.hasattr}
  1000001    4.571    0.000    6.814    0.000  pandas\core\dtypes\inference.py:396(is_sequence)
  3003869    3.259    0.000    3.270    0.000 {built-in method builtins.isinstance}
  1001993    2.247    0.000    2.247    0.000 {built-in method builtins.iter}
  1000000    0.951    0.000    0.951    0.000 {method 'item' of 'numpy.ndarray' objects}
  1000000    0.718    0.000    0.718    0.000 {method 'discard' of 'set' objects}
  1000000    0.674    0.000    0.674    0.000 {built-in method _thread.get_ident}
  1000002    0.656    0.000    0.656    0.000 {method 'add' of 'set' objects}
  1000686    0.652    0.000    0.652    0.000 {built-in method builtins.id}

fredrik-1 commented Nov 12, 2017

I am new to profiling in python but I guess the (unecessary) time is spent in:

        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:2611(__call__)
        1    0.000    0.000   70.357   70.357 pandas\plotting\_core.py:1850(plot_frame)
        1    0.000    0.000   70.357   70.357  pandas\plotting\_core.py:1640(_plot)
        1    0.162    0.162   70.356   70.356  pandas\plotting\_core.py:241(generate)
        1    0.613    0.613   69.868   69.868  pandas\plotting\_core.py:369(_post_plot_logic_common)
        1    4.507    4.507   69.184   69.184  pandas\plotting\_core.py:371(<listcomp>)
  1000001    9.924    0.000   64.677    0.000  pandas\io\formats\printing.py:157(pprint_thing)
  1000001   14.808    0.000   38.582    0.000  pandas\io\formats\printing.py:186(as_escaped_unicode)
  1000000    3.648    0.000   22.603    0.000  numpy\core\numeric.py:1905(array_str)
  1000000    8.743    0.000   18.955    0.000 numpy\core\arrayprint.py:381(wrapper)
  1000000    6.561    0.000    7.512    0.000  numpy\core\arrayprint.py:399(array2string)
  2000696    7.280    0.000    7.280    0.000 {built-in method builtins.hasattr}
  1000001    4.571    0.000    6.814    0.000  pandas\core\dtypes\inference.py:396(is_sequence)
  3003869    3.259    0.000    3.270    0.000 {built-in method builtins.isinstance}
  1001993    2.247    0.000    2.247    0.000 {built-in method builtins.iter}
  1000000    0.951    0.000    0.951    0.000 {method 'item' of 'numpy.ndarray' objects}
  1000000    0.718    0.000    0.718    0.000 {method 'discard' of 'set' objects}
  1000000    0.674    0.000    0.674    0.000 {built-in method _thread.get_ident}
  1000002    0.656    0.000    0.656    0.000 {method 'add' of 'set' objects}
  1000686    0.652    0.000    0.652    0.000 {built-in method builtins.id}
@fredrik-1

This comment has been minimized.

Show comment
Hide comment
@fredrik-1

fredrik-1 Nov 12, 2017

So pandas do a list comprehension for all elements in the index using the function pprint_thing to make unicode strings (or something similar). Not that strange that it takes time with millions of rows in the data frame.

Is it necessary when you (at least in my case) can send the data directly to matplotlib?

fredrik-1 commented Nov 12, 2017

So pandas do a list comprehension for all elements in the index using the function pprint_thing to make unicode strings (or something similar). Not that strange that it takes time with millions of rows in the data frame.

Is it necessary when you (at least in my case) can send the data directly to matplotlib?

@fredrik-1

This comment has been minimized.

Show comment
Hide comment
@fredrik-1

fredrik-1 Nov 12, 2017

The following change to post_plot_logic_common seems to work in my example:

def _post_plot_logic_common(self, ax, data):
        """Common post process for each axes"""
#        labels = [pprint_thing(key) for key in data.index]
#        labels = dict(zip(range(len(data.index)), labels))

        if self.orientation == 'vertical' or self.orientation is None:
            if self._need_to_set_index:
                xticklabels = [pprint_thing(data.index[x]) for x in ax.get_xticks()]
#                xticklabels = [labels.get(x, '') for x in ax.get_xticks()]
                ax.set_xticklabels(xticklabels)
            self._apply_axis_properties(ax.xaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
        elif self.orientation == 'horizontal':
            if self._need_to_set_index:
                yticklabels = [pprint_thing(data.index[y]) for y in ax.get_yticks()]
#                yticklabels = [labels.get(y, '') for y in ax.get_yticks()]
                ax.set_yticklabels(yticklabels)
            self._apply_axis_properties(ax.yaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize)
        else:  # pragma no cover
            raise ValueError

but this version has problem if get_xticks() don't return an integer index) but I don't really understand how the old code work well in that case either (returning a '' is of course better than an index error but I would think that it would be good to have a tick there also in that case)

fredrik-1 commented Nov 12, 2017

The following change to post_plot_logic_common seems to work in my example:

def _post_plot_logic_common(self, ax, data):
        """Common post process for each axes"""
#        labels = [pprint_thing(key) for key in data.index]
#        labels = dict(zip(range(len(data.index)), labels))

        if self.orientation == 'vertical' or self.orientation is None:
            if self._need_to_set_index:
                xticklabels = [pprint_thing(data.index[x]) for x in ax.get_xticks()]
#                xticklabels = [labels.get(x, '') for x in ax.get_xticks()]
                ax.set_xticklabels(xticklabels)
            self._apply_axis_properties(ax.xaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
        elif self.orientation == 'horizontal':
            if self._need_to_set_index:
                yticklabels = [pprint_thing(data.index[y]) for y in ax.get_yticks()]
#                yticklabels = [labels.get(y, '') for y in ax.get_yticks()]
                ax.set_yticklabels(yticklabels)
            self._apply_axis_properties(ax.yaxis, rot=self.rot,
                                        fontsize=self.fontsize)
            self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize)
        else:  # pragma no cover
            raise ValueError

but this version has problem if get_xticks() don't return an integer index) but I don't really understand how the old code work well in that case either (returning a '' is of course better than an index error but I would think that it would be good to have a tick there also in that case)

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 13, 2017

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 13, 2017

Contributor

Not sure either. If your index is meaningless, we have the use_index keyword, which may be faster for you.

Contributor

TomAugspurger commented Nov 13, 2017

Not sure either. If your index is meaningless, we have the use_index keyword, which may be faster for you.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 20, 2017

Member

Yeah indeed it is a bit strange as in many cases get_xticks don't return a valid index. I think therefore you have the '' fallback in the current code so it does not error, and I suppose somewhere else it is then overwriting those empty strings with the correct labels (but I didn't further look into that).

Anyhow, I have a PR that should fix this: #18373

Member

jorisvandenbossche commented Nov 20, 2017

Yeah indeed it is a bit strange as in many cases get_xticks don't return a valid index. I think therefore you have the '' fallback in the current code so it does not error, and I suppose somewhere else it is then overwriting those empty strings with the correct labels (but I didn't further look into that).

Anyhow, I have a PR that should fix this: #18373

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.21.1 Nov 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment