New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

Closed
y-p opened this Issue Jan 16, 2014 · 21 comments

Comments

Projects
None yet
3 participants
@y-p
Contributor

y-p commented Jan 16, 2014

For the FEC dataset, it takes about 1.5 sec to get a repr. and prun
puts it all in infer_dtype.

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

do you have a link to the dataset...can't seem to find mine

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

ftp://ftp.fec.gov/FEC/Presidential_Map/2012/P00000001/P00000001-ALL.zip

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

I believe that this is the problem.

It is trying to see if their are floats in an object array. I would simply not do this at all
or short-circuit it.

Breakpoint 2 at /mnt/home/jreback/pandas/pandas/core/format.py:1663
(Pdb) c
> /mnt/home/jreback/pandas/pandas/core/format.py(1663)_format_strings()
-> is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
(Pdb) l
1658                    # object dtype
1659                    return '%s' % formatter(x)
1660 
1661            vals = self.values
1662 
1663B->         is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
1664            leading_space = is_float.any()
1665 
1666            fmt_values = []
1667            for i, v in enumerate(vals):
1668                if not is_float[i] and leading_space:
@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

It's probably there to support the float_format arg of to_string. I'll have to think about it.

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

for an object dtype you could warn if it 'looks' like float, but otherwise skip it
its checking strings that aren't numbers at all is the problem

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

isn't looks_like_float() exactly what map_infer does? how could it be faster if
I have to check each value for "appearence"?

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

It doesn't need to do this for values not displayed in the output. that's it.

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

right!

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

That's not where the bottleneck is.
What's this?

In [10]: %timeit df.dtypes.values
1 loops, best of 3: 178 ms per loop

aren't dtypes just a lookup?

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

this issue address this, but it needs reworking to make it more internal as I havev indicated: #5740

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

Related (djeavu): 3cb6961, #2807 (comment)

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

I have got a PR...give me a few

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

There's an off chance this might be the cause of a lot of the slowdowns we saw in 0.13
after the NDFrame refactor. is the change in behaviour related? if yes, hurrah.

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

anything with df.apply in it is generally bad (as dsm fixed for str.extract), when needed internally

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

#5660

frame_get_dtype_counts | 0.1843 | 0.1113 | 1.6552 |

less then what I expected, but should have sent bells ringing.

Unrelated in fact.

@dsm054

This comment has been minimized.

Contributor

dsm054 commented Jan 16, 2014

@jreback: dsm->unutbu. Can't take credit for that one. :^)

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

@dsm054 sorry....you are right!! morning confusion

@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

That only cuts it in half. Is this expected?

df=pd.read_csv('P00000001-ALL.csv',low_memory=False)
%timeit df.iloc[:100, 4]
10 loops, best of 3: 88.1 ms per loop

Isn't slicing supposed to be cheap?

@y-p y-p reopened this Jan 16, 2014

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

let me look

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 16, 2014

easy enough....
was inferring the object dtypes internally when no need to do so

In [5]: %timeit df.iloc[:100, 4]
1000 loops, best of 3: 293 ᄉs per loop
@y-p

This comment has been minimized.

Contributor

y-p commented Jan 16, 2014

2 secs -> 50 ms for repr(df). hellz yeah.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment