PERF: speed-up DateFrame.itertuples() with namedtuples #11625

xflr6 · 2015-11-17T11:12:13Z

use namedtuple._make()

Also:

replace bare except: to avoid catching SystemExit and KeyboardInterrupt
remove the generator return from the try-clause to an else
more explicit fallback to regular tuples when name=None (docs?)

jreback · 2015-11-17T11:49:31Z

you would have to add a benchmark for this
though u r using a private function so u less this is a huge speed up prob not worth it
need a test for name=None
don't use an else clause - just put in the try - what u added is much less explicit

xflr6 · 2015-11-17T12:35:37Z

As fas as I can see, _make() is the preferred way of creating a namedtuple from an iterable (i.e. although it starts with an underscore, it is not a private method):

In addition to the methods inherited from tuples, named tuples support three additional methods
and one attribute. To prevent conflicts with field names, the method and attribute names start with
an underscore.

I disagree, to me this is exactly what the else-clause of try...except is for. Why have a bare try around the generator expression?

The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the
try ... except statement.

jreback · 2015-11-17T12:54:00Z

pandas/core/frame.py

                pass
+            else:


you are missing the point here; this should NOT create a named tuple if it fails the try, that is the point; and this code exists below, you are duplicating code.

Sorry for correcting you, but the code does exactly what you say is to be achieved.

The try ... except statement has an optional else clause, which, when present,
must follow all except clauses. It is useful for code that must be executed
if the try clause does not raise an exception.

you are duplicating code, see the return statement

Sorry, I do not see this. Can you be a little more explicit.
There are two return statements just like in the current version in master, right?

the statement you added duplicates the other return statement. you don't need the else clause at all. if you think that the _make actually speeds things, then put it there.

would need a benchmark as well

need a test for name=None if this actually changed anything (does it?)

Just to clarify: I moved the first return statement inside the else-clause and modified it to use _make (i.e. I did not add a statement and not introduce (new) duplication).
If you remove the else-block (with the contained return) then namedtuples will never be returned.

oh, this is just for name tuples, ok, then, but just put it where it was. I think else clauses make this very unreadable. No reason not to put it in the try

remove the else as discussed above and put it in the try

xflr6 · 2015-11-17T13:29:52Z

Here are some simple timings:

import collections

import pandas as pd
from pandas.compat import map, zip

class DataFrame(pd.DataFrame):

    def itertuples_new(self, index=True, name="Pandas"):
(...)
            else:
                return (itertuple(*row) for row in zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

    def itertuples_make(self, index=True, name="Pandas"):
(...)
            else:
                return map(itertuple._make, zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
   'D': range(1000), 'E': range(1000), 'F': range(1000)})

%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop

%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop

%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop

jreback · 2015-11-17T13:40:30Z

pls add a benchmark to the asv suite (make about 10x bigger).

xflr6 · 2015-11-17T15:14:35Z

$ asv continuous master HEAD -b
 frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 2238f73e:
[  0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED.

jreback · 2015-11-17T15:24:48Z

asv_bench/benchmarks/frame_methods.py

+class frame_itertuples(object):
+
+    def setup(self):
+        self.df = DataFrame(np.random.randn(5000, 10))


make this bigger (50k)

xflr6 · 2015-11-17T15:52:19Z

[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   77.15ms
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples  103.37ms

jreback · 2015-11-17T16:13:52Z

pandas/tests/test_frame.py

@@ -5545,6 +5545,8 @@ def test_itertuples(self):
        dfaa = df[['a', 'a']]
        self.assertEqual(list(dfaa.itertuples()), [(0, 1, 1), (1, 2, 2), (2, 3, 3)])

+        self.assertEqual(type(next(df.itertuples(name=None))), tuple)


this is not good enough as a NamedTuple will match this as well.. Construct an exact tuple like above.

Sorry for having to correct you again but this is not true:

>>> import collections >>> spam = collections.namedtuple('spam', 'spam eggs')(1, 2) >>> type(spam) == tuple False >>> spam == (1, 2) True

we don't use type(...) is not idiomatic. use isinstance and see why what you are doing is wrong.

I disagree: testing is exactly the context where type(spam) is eggs is what you do if you want to check for exact type and not for ancestry. I would use self.assertIs but AFAIR this was added in Python 2.7.

What do you propose to use as check here?

read the original PR we already had this discussion.

why test for type when you can explicitly test for object equality?

@MaximilianR: Thanks for the hint, pandas.common.core.is_named_tuple() uses hasattr(..., '_fields'). Is that to be prefered?

@kawochen: See above, the test (as I understand it) is to check the returned values are plain tuples, but namedtuple('spam', 'spam eggs')(1, 2) == (1, 2).

thanks. that is quite annoying. perhaps we need more utilities for dealing with these, and type does seem like the way to test for this. IMO you still need to test for object equality

If you want to test if an instance is a namedtuple use is_named_tuple

If you want to test if an instance is a specific namedtuple, I would check the repr matches the repr you expect. Also OK if you additionally check the instances' equality / if the expected instance is a namedtuple.

Type won't help us here @kawochen, because namedtuple's type is just a tuple - it's actually a factory function rather than a type (although there's much debate over whether this is good)

Switched to repr checking. Note that grepping tests for Equal(type( there are some more of such explicit type comparisons (which IMHO this is fine).

xflr6 · 2015-11-17T16:20:49Z

Okay, maybe one last try to reconsider your dislike of try...else-clauses.
Apart from the official docs, there also seems to be some others in favour.

jreback · 2015-11-17T16:35:22Z

@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base.

xflr6 · 2015-11-18T08:53:35Z

Performance comparison with the regular tuple returning branch:

[ 25.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   76.64ms
[ 50.00%] ··· Running ...ame_itertuples.time_frame_itertuples_regular   38.04ms

jreback · 2015-11-18T11:53:59Z

ok, add this issue number onto where #11269 is in whatsnew/v0.17.0
squash to a single commit. ping when green.

xflr6 · 2015-11-18T12:54:07Z

There is no issue for this (only the PR), should I open one?

jreback · 2015-11-18T13:08:28Z

no use the pr number

jreback · 2015-11-19T01:10:44Z

merged via 4ffc3ef

thanks!

jreback reviewed Nov 17, 2015
View reviewed changes

jreback added the Performance Memory or execution speed performance label Nov 17, 2015

jreback reviewed Nov 17, 2015
View reviewed changes

PERF: speed-up DateFrame.itertuples() with namedtuples

e40d9d6

jreback added this to the 0.17.1 milestone Nov 19, 2015

jreback closed this Nov 19, 2015

xflr6 deleted the enhance_pr11325 branch November 19, 2015 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 17, 2015

jreback Nov 17, 2015

xflr6 Nov 17, 2015

jreback Nov 17, 2015

xflr6 Nov 17, 2015

jreback Nov 17, 2015

xflr6 Nov 17, 2015

jreback Nov 17, 2015

jreback Nov 17, 2015

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 17, 2015

jreback Nov 17, 2015

xflr6 commented Nov 17, 2015

jreback Nov 17, 2015

xflr6 Nov 17, 2015

jreback Nov 17, 2015

xflr6 Nov 17, 2015

jreback Nov 17, 2015

kawochen Nov 17, 2015

xflr6 Nov 17, 2015

kawochen Nov 17, 2015

max-sixty Nov 17, 2015

xflr6 Nov 17, 2015

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 18, 2015

jreback commented Nov 18, 2015

xflr6 commented Nov 18, 2015

jreback commented Nov 18, 2015

jreback commented Nov 19, 2015

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

Conversation

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 17, 2015

Choose a reason for hiding this comment

xflr6 commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xflr6 commented Nov 17, 2015

jreback commented Nov 17, 2015

xflr6 commented Nov 18, 2015

jreback commented Nov 18, 2015

xflr6 commented Nov 18, 2015

jreback commented Nov 18, 2015

jreback commented Nov 19, 2015