Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Slow printing of large data frames #2807
Comments
|
Set DataFrame._verbose_info to be False. This should be made configurable, or maybe even just disable the null-counts over 1 million rows |
|
@wesm For the configurable, could this be done by registering an option such as: # necessary imports
def check_verbose_info(x):
try:
bool(x)
except:
raise ValueError('invalid value for frame.verbose_info')
pd.config.register_option('frame.verbose_info', True, check_verbose_info)If so, where would this need to go? |
|
@cpcloud, have a look at |
|
With the latest release this is slow even after setting _verbose_info. |
|
that's a bug then (probably it is checking whether the dataframe is "too wide" before reaching that code). if no one fixes the bug before i have a chance i will get to it in the next couple of weeks |
|
The delay comes from the dtype count, which I intended to put of a fix for review but ended up pushing to master by mistake |
|
I changed the way the way get_dtype_counts works in 0.11. simple fix for this |
|
@cpcloud do u have a test example? |
|
@y-p saw your fix thanks.....this obviously shouldn't have been using as_blocks |
|
|
|
I'll call this closed. |
y-p
closed this
Feb 22, 2013
|
I'm always running the latest version :). @jreback You could do something like df = DataFrame(rand(1e7, 16))
%timeit repr(df) |
|
Interestingly, |
|
Would you guys still take a pull request for the config option of omitting non null info for frames with > 1e6 rows? |
y-p
reopened this
Feb 22, 2013
|
Looks like the fix was only partial. |
|
I'm confused. maybe I'm just hitting memory pressure on my machine. |
|
before fix:
after fix:
@cpcloud , go ahead and open a pull request for an option setting a threshold for |
|
@y-p I can confirm that I get similar results to yours. |
|
I have a pull request ready minus tests. I'm not sure how to go about testing a display configuration option in any kind of non-kludgy way. My first thought was to assert that the string 'null' is not in the repr, but that seems like a very fragile way to test it. |
|
go for an assertion on line count |
|
Hmm. Does the thresholding option obviate the need for |
|
seems reasonable, just as long as the threshold logic alows an "infinite" value. |
|
BTW, is it okay to include pep8 cleanups in pull requests? I'm using flake8 + flymake to get the pep8 violations. |
|
If it's just around the area you touched that's fine. If it's the whole file and |
|
Okay. Sounds good. Thanks. |
|
half the fix is in master, and there's a pending PR #2918 for the other half of controlling verbose_info |
cpcloud commentedFeb 7, 2013
I have DataFrames with about 14 million rows and 16 columns and I have to wait at least 3-4 seconds for it to
reprin an IPython session. Is there anything that can be done about this?