Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow printing of large data frames #2807

Closed
cpcloud opened this issue Feb 7, 2013 · 27 comments
Closed

Slow printing of large data frames #2807

cpcloud opened this issue Feb 7, 2013 · 27 comments

Comments

@cpcloud
Copy link
Member

cpcloud commented Feb 7, 2013

I have DataFrames with about 14 million rows and 16 columns and I have to wait at least 3-4 seconds for it to repr in an IPython session. Is there anything that can be done about this?

@wesm
Copy link
Member

wesm commented Feb 7, 2013

Set DataFrame._verbose_info to be False. This should be made configurable, or maybe even just disable the null-counts over 1 million rows

@cpcloud
Copy link
Member Author

cpcloud commented Feb 7, 2013

@wesm For the configurable, could this be done by registering an option such as:

# necessary imports

def check_verbose_info(x):
    try:
        bool(x)
    except:
        raise ValueError('invalid value for frame.verbose_info')

pd.config.register_option('frame.verbose_info', True, check_verbose_info)

If so, where would this need to go?

@ghost
Copy link

ghost commented Feb 8, 2013

@cpcloud, have a look at pandas/core/config_init.py

@cpcloud
Copy link
Member Author

cpcloud commented Feb 13, 2013

With the latest release this is slow even after setting _verbose_info.

@wesm
Copy link
Member

wesm commented Feb 13, 2013

that's a bug then (probably it is checking whether the dataframe is "too wide" before reaching that code). if no one fixes the bug before i have a chance i will get to it in the next couple of weeks

@ghost
Copy link

ghost commented Feb 22, 2013

The delay comes from the dtype count, which verbose_info doesn't disable.
Despite a lot of changes, this has been behaving like this at least since 0.9.1.

I intended to put of a fix for review but ended up pushing to master by mistake
(sorry about that). So I welcome review.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2013

I changed the way the way get_dtype_counts works in 0.11. simple fix for this

@jreback
Copy link
Contributor

jreback commented Feb 22, 2013

@cpcloud do u have a test example?
what version/commit are u running?

@jreback
Copy link
Contributor

jreback commented Feb 22, 2013

@y-p saw your fix thanks.....this obviously shouldn't have been using as_blocks

@ghost
Copy link

ghost commented Feb 22, 2013

df.blocks is quite slow for wide frames, and it can be a nasty surprise to get that hit
since, being a @Property, it just looks like you're getting a reference to some existing object.

@ghost
Copy link

ghost commented Feb 22, 2013

I'll call this closed.

@ghost ghost closed this as completed Feb 22, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Feb 22, 2013

I'm always running the latest version :). @jreback You could do something like

df = DataFrame(rand(1e7, 16))
%timeit repr(df)

@cpcloud
Copy link
Member Author

cpcloud commented Feb 22, 2013

Interestingly, DataFrames with columns with all bool dtypes are twice as slow to repr as are those with columns with all float dtypes. I timed this using %timeit repr(df). This is not for an arbitrarily shaped DataFrame. It was only twice as slow on a DataFrame with about 14 M rows.

@cpcloud
Copy link
Member Author

cpcloud commented Feb 22, 2013

Would you guys still take a pull request for the config option of omitting non null info for frames with > 1e6 rows?

@ghost ghost reopened this Feb 22, 2013
@ghost
Copy link

ghost commented Feb 22, 2013

Looks like the fix was only partial.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2013

@cpcloud fyi @y-p revert of get_dtype_counts() fixed this. This was calling the .blocks method which returns a dict of dtype -> homogenous frame (new in 0.11); this was uncessary and copied the data

@ghost
Copy link

ghost commented Feb 22, 2013

I'm confused. maybe I'm just hitting memory pressure on my machine.
@cpcloud, please pull git master and confirm that the problem went away for you.

@ghost
Copy link

ghost commented Feb 22, 2013

before fix:

In [1]: a=pd.DataFrame(rand(1e7, 10))

In [2]: %timeit repr(a)
1 loops, best of 3: 5.06 s per loop

In [3]: a._verbose_info=False

In [4]: %timeit repr(a)
1 loops, best of 3: 2.79 s per loop

after fix:


In [6]: a=pd.DataFrame(rand(1e7, 10))

In [8]: %timeit repr(a)
1 loops, best of 3: 3.3 s per loop

In [9]: a._verbose_info=False

In [10]: %timeit repr(a)
1000 loops, best of 3: 539 µs per loop

@cpcloud , go ahead and open a pull request for an option setting a threshold for _verbose_info_.

@cpcloud
Copy link
Member Author

cpcloud commented Feb 23, 2013

@y-p I can confirm that I get similar results to yours.

@cpcloud
Copy link
Member Author

cpcloud commented Feb 23, 2013

I have a pull request ready minus tests. I'm not sure how to go about testing a display configuration option in any kind of non-kludgy way. My first thought was to assert that the string 'null' is not in the repr, but that seems like a very fragile way to test it.

@ghost
Copy link

ghost commented Feb 23, 2013

go for an assertion on line count

@cpcloud
Copy link
Member Author

cpcloud commented Feb 23, 2013

Hmm. Does the thresholding option obviate the need for DataFrame._verbose_info? In addition to the threshold I added a display.verbose_info option that works in conjunction with the row threshold option (display.max_info_rows), but now I'm thinking that any verbose_info option should just be done away with.

@ghost
Copy link

ghost commented Feb 23, 2013

seems reasonable, just as long as the threshold logic alows an "infinite" value.
maxint on 32bit platforms might not be enough for example.

@cpcloud
Copy link
Member Author

cpcloud commented Feb 23, 2013

BTW, is it okay to include pep8 cleanups in pull requests? I'm using flake8 + flymake to get the pep8 violations.

@ghost
Copy link

ghost commented Feb 23, 2013

If it's just around the area you touched that's fine. If it's the whole file and
the diff is large, try to split the pep8 and the enhancement into seperate
commits for easier review.

@cpcloud
Copy link
Member Author

cpcloud commented Feb 23, 2013

Okay. Sounds good. Thanks.

@ghost
Copy link

ghost commented Mar 12, 2013

half the fix is in master, and there's a pending PR #2918 for the other half of controlling verbose_info
via an option. closing.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants