Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame constructor speed #621

Closed
dieterv77 opened this issue Jan 12, 2012 · 2 comments
Closed

DataFrame constructor speed #621

dieterv77 opened this issue Jan 12, 2012 · 2 comments

Comments

@dieterv77
Copy link
Contributor

Hi, i was playing around with constructing DataFrame's from nested dicts, and noticed that things have gotten a bit slower since v0.6.1.

Here's some sample code i was playing with:

import time
import pandas

print pandas.version

data = dict((i,dict((j,float(j)) for j in xrange(100))) for i in xrange(5000))
t0 = time.time(); df = pandas.DataFrame(data); t1 = time.time(); print t1 - t0

With version 0.6.1, the printed time is about 0.21s on my machine, with a little help from git bisect,
i found that:

commit f3ca67d takes it from 0.21s to 0.44s
commit 9d65e8e takes it from 0.44s to 0.54s

It's possible that some of these were unavoidable considering they may have been necessary bugfixes, but i wanted to
see if anyone else is seeing this too.

environment info: 64bit ubuntu 11.10, python2.7, numpy 1.6.1, cython 0.15.1

@wesm
Copy link
Member

wesm commented Jan 12, 2012

Well rats, and you see in those commits I totally thought I was making things faster! I see the issue and I'm going to address it now and add a vbenchmark (http://pandas.sourceforge.net/vbench.html) so we can track the performance more systematically going forward.

wesm added a commit that referenced this issue Jan 12, 2012
…ted dict with integer indexes, add vbench for it, speed up _stack_dict in internals, GH #621
@wesm
Copy link
Member

wesm commented Jan 12, 2012

OK I fixed things up and even made things a little faster.

before (3ed22d7):

In [3]: timeit df = DataFrame(data)
1 loops, best of 3: 690 ms per loop

after (HEAD):

In [3]: timeit df = DataFrame(data)
10 loops, best of 3: 167 ms per loop

and 0.6.1:

In [3]: timeit df = DataFrame(data)
1 loops, best of 3: 273 ms per loop

Note this problem only affected integer-indexed data. The issue if you're interested had to do with boxing of int64 scalars (from the Index when doing dict lookups).

@wesm wesm closed this as completed Jan 12, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants