Slow printing of large data frames #2807

cpcloud · 2013-02-07T00:28:07Z

I have DataFrames with about 14 million rows and 16 columns and I have to wait at least 3-4 seconds for it to repr in an IPython session. Is there anything that can be done about this?

The text was updated successfully, but these errors were encountered:

wesm · 2013-02-07T00:33:56Z

Set DataFrame._verbose_info to be False. This should be made configurable, or maybe even just disable the null-counts over 1 million rows

cpcloud · 2013-02-07T15:11:21Z

@wesm For the configurable, could this be done by registering an option such as:

# necessary imports

def check_verbose_info(x):
    try:
        bool(x)
    except:
        raise ValueError('invalid value for frame.verbose_info')

pd.config.register_option('frame.verbose_info', True, check_verbose_info)

If so, where would this need to go?

ghost · 2013-02-08T11:14:11Z

@cpcloud, have a look at pandas/core/config_init.py

cpcloud · 2013-02-13T12:38:48Z

With the latest release this is slow even after setting _verbose_info.

wesm · 2013-02-13T19:06:17Z

that's a bug then (probably it is checking whether the dataframe is "too wide" before reaching that code). if no one fixes the bug before i have a chance i will get to it in the next couple of weeks

ghost · 2013-02-22T06:53:38Z

The delay comes from the dtype count, which verbose_info doesn't disable.
Despite a lot of changes, this has been behaving like this at least since 0.9.1.

I intended to put of a fix for review but ended up pushing to master by mistake
(sorry about that). So I welcome review.

jreback · 2013-02-22T12:11:40Z

I changed the way the way get_dtype_counts works in 0.11. simple fix for this

jreback · 2013-02-22T12:25:21Z

@cpcloud do u have a test example?
what version/commit are u running?

jreback · 2013-02-22T12:49:23Z

@y-p saw your fix thanks.....this obviously shouldn't have been using as_blocks

ghost · 2013-02-22T13:42:06Z

df.blocks is quite slow for wide frames, and it can be a nasty surprise to get that hit
since, being a @Property, it just looks like you're getting a reference to some existing object.

ghost · 2013-02-22T13:51:25Z

I'll call this closed.

cpcloud · 2013-02-22T13:54:01Z

I'm always running the latest version :). @jreback You could do something like

df = DataFrame(rand(1e7, 16))
%timeit repr(df)

cpcloud · 2013-02-22T14:09:47Z

Interestingly, DataFrames with columns with all bool dtypes are twice as slow to repr as are those with columns with all float dtypes. I timed this using %timeit repr(df). This is not for an arbitrarily shaped DataFrame. It was only twice as slow on a DataFrame with about 14 M rows.

cpcloud · 2013-02-22T14:12:56Z

Would you guys still take a pull request for the config option of omitting non null info for frames with > 1e6 rows?

ghost · 2013-02-22T14:17:12Z

Looks like the fix was only partial.

jreback · 2013-02-22T14:17:52Z

@cpcloud fyi @y-p revert of get_dtype_counts() fixed this. This was calling the .blocks method which returns a dict of dtype -> homogenous frame (new in 0.11); this was uncessary and copied the data

ghost · 2013-02-22T14:21:15Z

I'm confused. maybe I'm just hitting memory pressure on my machine.
@cpcloud, please pull git master and confirm that the problem went away for you.

ghost · 2013-02-22T14:28:35Z

before fix:

In [1]: a=pd.DataFrame(rand(1e7, 10))

In [2]: %timeit repr(a)
1 loops, best of 3: 5.06 s per loop

In [3]: a._verbose_info=False

In [4]: %timeit repr(a)
1 loops, best of 3: 2.79 s per loop

after fix:


In [6]: a=pd.DataFrame(rand(1e7, 10))

In [8]: %timeit repr(a)
1 loops, best of 3: 3.3 s per loop

In [9]: a._verbose_info=False

In [10]: %timeit repr(a)
1000 loops, best of 3: 539 µs per loop

@cpcloud , go ahead and open a pull request for an option setting a threshold for _verbose_info_.

cpcloud · 2013-02-23T00:50:36Z

@y-p I can confirm that I get similar results to yours.

cpcloud · 2013-02-23T19:09:38Z

I have a pull request ready minus tests. I'm not sure how to go about testing a display configuration option in any kind of non-kludgy way. My first thought was to assert that the string 'null' is not in the repr, but that seems like a very fragile way to test it.

ghost · 2013-02-23T19:13:56Z

go for an assertion on line count

cpcloud · 2013-02-23T19:45:14Z

Hmm. Does the thresholding option obviate the need for DataFrame._verbose_info? In addition to the threshold I added a display.verbose_info option that works in conjunction with the row threshold option (display.max_info_rows), but now I'm thinking that any verbose_info option should just be done away with.

ghost · 2013-02-23T19:50:20Z

seems reasonable, just as long as the threshold logic alows an "infinite" value.
maxint on 32bit platforms might not be enough for example.

cpcloud · 2013-02-23T20:19:56Z

BTW, is it okay to include pep8 cleanups in pull requests? I'm using flake8 + flymake to get the pep8 violations.

ghost · 2013-02-23T20:34:39Z

If it's just around the area you touched that's fine. If it's the whole file and
the diff is large, try to split the pep8 and the enhancement into seperate
commits for easier review.

cpcloud · 2013-02-23T20:35:31Z

Okay. Sounds good. Thanks.

ghost · 2013-03-12T17:54:32Z

half the fix is in master, and there's a pending PR #2918 for the other half of controlling verbose_info
via an option. closing.

ghost mentioned this issue Feb 22, 2013

Printing a DataFrame is slow #2876

Closed

ghost closed this as completed Feb 22, 2013

ghost reopened this Feb 22, 2013

ghost closed this as completed Mar 12, 2013

ghost mentioned this issue Jan 16, 2014

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow printing of large data frames #2807

Slow printing of large data frames #2807

cpcloud commented Feb 7, 2013

wesm commented Feb 7, 2013

cpcloud commented Feb 7, 2013

ghost commented Feb 8, 2013

cpcloud commented Feb 13, 2013

wesm commented Feb 13, 2013

ghost commented Feb 22, 2013

jreback commented Feb 22, 2013

jreback commented Feb 22, 2013

jreback commented Feb 22, 2013

ghost commented Feb 22, 2013

ghost commented Feb 22, 2013

cpcloud commented Feb 22, 2013

cpcloud commented Feb 22, 2013

cpcloud commented Feb 22, 2013

ghost commented Feb 22, 2013

jreback commented Feb 22, 2013

ghost commented Feb 22, 2013

ghost commented Feb 22, 2013

cpcloud commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Mar 12, 2013

Slow printing of large data frames #2807

Slow printing of large data frames #2807

Comments

cpcloud commented Feb 7, 2013

wesm commented Feb 7, 2013

cpcloud commented Feb 7, 2013

ghost commented Feb 8, 2013

cpcloud commented Feb 13, 2013

wesm commented Feb 13, 2013

ghost commented Feb 22, 2013

jreback commented Feb 22, 2013

jreback commented Feb 22, 2013

jreback commented Feb 22, 2013

ghost commented Feb 22, 2013

ghost commented Feb 22, 2013

cpcloud commented Feb 22, 2013

cpcloud commented Feb 22, 2013

cpcloud commented Feb 22, 2013

ghost commented Feb 22, 2013

jreback commented Feb 22, 2013

ghost commented Feb 22, 2013

ghost commented Feb 22, 2013

cpcloud commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Feb 23, 2013

cpcloud commented Feb 23, 2013

ghost commented Mar 12, 2013