Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow printing of large data frames #2807

Closed
cpcloud opened this issue Feb 7, 2013 · 27 comments

Comments

@cpcloud
Copy link
Member

commented Feb 7, 2013

I have DataFrames with about 14 million rows and 16 columns and I have to wait at least 3-4 seconds for it to repr in an IPython session. Is there anything that can be done about this?

@wesm

This comment has been minimized.

Copy link
Member

commented Feb 7, 2013

Set DataFrame._verbose_info to be False. This should be made configurable, or maybe even just disable the null-counts over 1 million rows

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2013

@wesm For the configurable, could this be done by registering an option such as:

# necessary imports

def check_verbose_info(x):
    try:
        bool(x)
    except:
        raise ValueError('invalid value for frame.verbose_info')

pd.config.register_option('frame.verbose_info', True, check_verbose_info)

If so, where would this need to go?

@ghost

This comment has been minimized.

Copy link

commented Feb 8, 2013

@cpcloud, have a look at pandas/core/config_init.py

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 13, 2013

With the latest release this is slow even after setting _verbose_info.

@wesm

This comment has been minimized.

Copy link
Member

commented Feb 13, 2013

that's a bug then (probably it is checking whether the dataframe is "too wide" before reaching that code). if no one fixes the bug before i have a chance i will get to it in the next couple of weeks

@ghost ghost referenced this issue Feb 22, 2013
@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

The delay comes from the dtype count, which verbose_info doesn't disable.
Despite a lot of changes, this has been behaving like this at least since 0.9.1.

I intended to put of a fix for review but ended up pushing to master by mistake
(sorry about that). So I welcome review.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2013

I changed the way the way get_dtype_counts works in 0.11. simple fix for this

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2013

@cpcloud do u have a test example?
what version/commit are u running?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2013

@y-p saw your fix thanks.....this obviously shouldn't have been using as_blocks

@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

df.blocks is quite slow for wide frames, and it can be a nasty surprise to get that hit
since, being a @Property, it just looks like you're getting a reference to some existing object.

@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

I'll call this closed.

@ghost ghost closed this Feb 22, 2013

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2013

I'm always running the latest version :). @jreback You could do something like

df = DataFrame(rand(1e7, 16))
%timeit repr(df)
@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2013

Interestingly, DataFrames with columns with all bool dtypes are twice as slow to repr as are those with columns with all float dtypes. I timed this using %timeit repr(df). This is not for an arbitrarily shaped DataFrame. It was only twice as slow on a DataFrame with about 14 M rows.

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2013

Would you guys still take a pull request for the config option of omitting non null info for frames with > 1e6 rows?

@ghost ghost reopened this Feb 22, 2013

@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

Looks like the fix was only partial.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2013

@cpcloud fyi @y-p revert of get_dtype_counts() fixed this. This was calling the .blocks method which returns a dict of dtype -> homogenous frame (new in 0.11); this was uncessary and copied the data

@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

I'm confused. maybe I'm just hitting memory pressure on my machine.
@cpcloud, please pull git master and confirm that the problem went away for you.

@ghost

This comment has been minimized.

Copy link

commented Feb 22, 2013

before fix:

In [1]: a=pd.DataFrame(rand(1e7, 10))

In [2]: %timeit repr(a)
1 loops, best of 3: 5.06 s per loop

In [3]: a._verbose_info=False

In [4]: %timeit repr(a)
1 loops, best of 3: 2.79 s per loop

after fix:


In [6]: a=pd.DataFrame(rand(1e7, 10))

In [8]: %timeit repr(a)
1 loops, best of 3: 3.3 s per loop

In [9]: a._verbose_info=False

In [10]: %timeit repr(a)
1000 loops, best of 3: 539 µs per loop

@cpcloud , go ahead and open a pull request for an option setting a threshold for _verbose_info_.

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2013

@y-p I can confirm that I get similar results to yours.

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2013

I have a pull request ready minus tests. I'm not sure how to go about testing a display configuration option in any kind of non-kludgy way. My first thought was to assert that the string 'null' is not in the repr, but that seems like a very fragile way to test it.

@ghost

This comment has been minimized.

Copy link

commented Feb 23, 2013

go for an assertion on line count

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2013

Hmm. Does the thresholding option obviate the need for DataFrame._verbose_info? In addition to the threshold I added a display.verbose_info option that works in conjunction with the row threshold option (display.max_info_rows), but now I'm thinking that any verbose_info option should just be done away with.

@ghost

This comment has been minimized.

Copy link

commented Feb 23, 2013

seems reasonable, just as long as the threshold logic alows an "infinite" value.
maxint on 32bit platforms might not be enough for example.

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2013

BTW, is it okay to include pep8 cleanups in pull requests? I'm using flake8 + flymake to get the pep8 violations.

@ghost

This comment has been minimized.

Copy link

commented Feb 23, 2013

If it's just around the area you touched that's fine. If it's the whole file and
the diff is large, try to split the pep8 and the enhancement into seperate
commits for easier review.

@cpcloud

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2013

Okay. Sounds good. Thanks.

@ghost

This comment has been minimized.

Copy link

commented Mar 12, 2013

half the fix is in master, and there's a pending PR #2918 for the other half of controlling verbose_info
via an option. closing.

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.