Slow printing of large data frames #2807

Closed
cpcloud opened this Issue Feb 7, 2013 · 27 comments

Comments

Projects
None yet
4 participants
Member

cpcloud commented Feb 7, 2013

I have DataFrames with about 14 million rows and 16 columns and I have to wait at least 3-4 seconds for it to repr in an IPython session. Is there anything that can be done about this?

Owner

wesm commented Feb 7, 2013

Set DataFrame._verbose_info to be False. This should be made configurable, or maybe even just disable the null-counts over 1 million rows

Member

cpcloud commented Feb 7, 2013

@wesm For the configurable, could this be done by registering an option such as:

# necessary imports

def check_verbose_info(x):
    try:
        bool(x)
    except:
        raise ValueError('invalid value for frame.verbose_info')

pd.config.register_option('frame.verbose_info', True, check_verbose_info)

If so, where would this need to go?

Contributor

y-p commented Feb 8, 2013

@cpcloud, have a look at pandas/core/config_init.py

Member

cpcloud commented Feb 13, 2013

With the latest release this is slow even after setting _verbose_info.

Owner

wesm commented Feb 13, 2013

that's a bug then (probably it is checking whether the dataframe is "too wide" before reaching that code). if no one fixes the bug before i have a chance i will get to it in the next couple of weeks

Contributor

y-p commented Feb 22, 2013

The delay comes from the dtype count, which verbose_info doesn't disable.
Despite a lot of changes, this has been behaving like this at least since 0.9.1.

I intended to put of a fix for review but ended up pushing to master by mistake
(sorry about that). So I welcome review.

Contributor

jreback commented Feb 22, 2013

I changed the way the way get_dtype_counts works in 0.11. simple fix for this

Contributor

jreback commented Feb 22, 2013

@cpcloud do u have a test example?
what version/commit are u running?

Contributor

jreback commented Feb 22, 2013

@y-p saw your fix thanks.....this obviously shouldn't have been using as_blocks

Contributor

y-p commented Feb 22, 2013

df.blocks is quite slow for wide frames, and it can be a nasty surprise to get that hit
since, being a @property, it just looks like you're getting a reference to some existing object.

Contributor

y-p commented Feb 22, 2013

I'll call this closed.

y-p closed this Feb 22, 2013

Member

cpcloud commented Feb 22, 2013

I'm always running the latest version :). @jreback You could do something like

df = DataFrame(rand(1e7, 16))
%timeit repr(df)
Member

cpcloud commented Feb 22, 2013

Interestingly, DataFrames with columns with all bool dtypes are twice as slow to repr as are those with columns with all float dtypes. I timed this using %timeit repr(df). This is not for an arbitrarily shaped DataFrame. It was only twice as slow on a DataFrame with about 14 M rows.

Member

cpcloud commented Feb 22, 2013

Would you guys still take a pull request for the config option of omitting non null info for frames with > 1e6 rows?

y-p reopened this Feb 22, 2013

Contributor

y-p commented Feb 22, 2013

Looks like the fix was only partial.

Contributor

jreback commented Feb 22, 2013

@cpcloud fyi @y-p revert of get_dtype_counts() fixed this. This was calling the .blocks method which returns a dict of dtype -> homogenous frame (new in 0.11); this was uncessary and copied the data

Contributor

y-p commented Feb 22, 2013

I'm confused. maybe I'm just hitting memory pressure on my machine.
@cpcloud, please pull git master and confirm that the problem went away for you.

Contributor

y-p commented Feb 22, 2013

before fix:

In [1]: a=pd.DataFrame(rand(1e7, 10))

In [2]: %timeit repr(a)
1 loops, best of 3: 5.06 s per loop

In [3]: a._verbose_info=False

In [4]: %timeit repr(a)
1 loops, best of 3: 2.79 s per loop

after fix:


In [6]: a=pd.DataFrame(rand(1e7, 10))

In [8]: %timeit repr(a)
1 loops, best of 3: 3.3 s per loop

In [9]: a._verbose_info=False

In [10]: %timeit repr(a)
1000 loops, best of 3: 539 µs per loop

@cpcloud , go ahead and open a pull request for an option setting a threshold for _verbose_info_.

Member

cpcloud commented Feb 23, 2013

@y-p I can confirm that I get similar results to yours.

Member

cpcloud commented Feb 23, 2013

I have a pull request ready minus tests. I'm not sure how to go about testing a display configuration option in any kind of non-kludgy way. My first thought was to assert that the string 'null' is not in the repr, but that seems like a very fragile way to test it.

Contributor

y-p commented Feb 23, 2013

go for an assertion on line count

Member

cpcloud commented Feb 23, 2013

Hmm. Does the thresholding option obviate the need for DataFrame._verbose_info? In addition to the threshold I added a display.verbose_info option that works in conjunction with the row threshold option (display.max_info_rows), but now I'm thinking that any verbose_info option should just be done away with.

Contributor

y-p commented Feb 23, 2013

seems reasonable, just as long as the threshold logic alows an "infinite" value.
maxint on 32bit platforms might not be enough for example.

Member

cpcloud commented Feb 23, 2013

BTW, is it okay to include pep8 cleanups in pull requests? I'm using flake8 + flymake to get the pep8 violations.

Contributor

y-p commented Feb 23, 2013

If it's just around the area you touched that's fine. If it's the whole file and
the diff is large, try to split the pep8 and the enhancement into seperate
commits for easier review.

Member

cpcloud commented Feb 23, 2013

Okay. Sounds good. Thanks.

Contributor

y-p commented Mar 12, 2013

half the fix is in master, and there's a pending PR #2918 for the other half of controlling verbose_info
via an option. closing.

y-p closed this Mar 12, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment