PERF: speed-up DateFrame.itertuples() with namedtuples #11625

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants
Contributor

xflr6 commented Nov 17, 2015

Also:

  • replace bare except: to avoid catching SystemExit and KeyboardInterrupt
  • remove the generator return from the try-clause to an else
  • more explicit fallback to regular tuples when name=None (docs?)
Contributor

jreback commented Nov 17, 2015

you would have to add a benchmark for this
though u r using a private function so u less this is a huge speed up prob not worth it
need a test for name=None
don't use an else clause - just put in the try - what u added is much less explicit

Contributor

xflr6 commented Nov 17, 2015

As fas as I can see, _make() is the preferred way of creating a namedtuple from an iterable (i.e. although it starts with an underscore, it is not a private method):

In addition to the methods inherited from tuples, named tuples support three additional methods
and one attribute. To prevent conflicts with field names, the method and attribute names start with
an underscore.

I disagree, to me this is exactly what the else-clause of try...except is for. Why have a bare try around the generator expression?

The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the
try ... except statement.

@jreback jreback and 1 other commented on an outdated diff Nov 17, 2015

pandas/core/frame.py
pass
+ else:
@jreback

jreback Nov 17, 2015

Contributor

you are missing the point here; this should NOT create a named tuple if it fails the try, that is the point; and this code exists below, you are duplicating code.

@xflr6

xflr6 Nov 17, 2015

Contributor

Sorry for correcting you, but the code does exactly what you say is to be achieved.

The try ... except statement has an optional else clause, which, when present,
must follow all except clauses. It is useful for code that must be executed
if the try clause does not raise an exception.

@jreback

jreback Nov 17, 2015

Contributor

you are duplicating code, see the return statement

@xflr6

xflr6 Nov 17, 2015

Contributor

Sorry, I do not see this. Can you be a little more explicit.
There are two return statements just like in the current version in master, right?

@jreback

jreback Nov 17, 2015

Contributor

the statement you added duplicates the other return statement. you don't need the else clause at all. if you think that the _make actually speeds things, then put it there.

  • would need a benchmark as well
  • need a test for name=None if this actually changed anything (does it?)
@xflr6

xflr6 Nov 17, 2015

Contributor

Just to clarify: I moved the first return statement inside the else-clause and modified it to use _make (i.e. I did not add a statement and not introduce (new) duplication).
If you remove the else-block (with the contained return) then namedtuples will never be returned.

@jreback

jreback Nov 17, 2015

Contributor

oh, this is just for name tuples, ok, then, but just put it where it was. I think else clauses make this very unreadable. No reason not to put it in the try

@jreback

jreback Nov 17, 2015

Contributor

remove the else as discussed above and put it in the try

Contributor

xflr6 commented Nov 17, 2015

Here are some simple timings:

import collections

import pandas as pd
from pandas.compat import map, zip

class DataFrame(pd.DataFrame):

    def itertuples_new(self, index=True, name="Pandas"):
(...)
            else:
                return (itertuple(*row) for row in zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

    def itertuples_make(self, index=True, name="Pandas"):
(...)
            else:
                return map(itertuple._make, zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
   'D': range(1000), 'E': range(1000), 'F': range(1000)})

%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop

%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop

%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop
Contributor

jreback commented Nov 17, 2015

pls add a benchmark to the asv suite (make about 10x bigger).

jreback added the Performance label Nov 17, 2015

Contributor

xflr6 commented Nov 17, 2015

$ asv continuous master HEAD -b
 frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 2238f73e:
[  0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback jreback commented on an outdated diff Nov 17, 2015

asv_bench/benchmarks/frame_methods.py
@@ -653,6 +653,16 @@ def j(self):
self.df3[0]
+class frame_itertuples(object):
+
+ def setup(self):
+ self.df = DataFrame(np.random.randn(5000, 10))
@jreback

jreback Nov 17, 2015

Contributor

make this bigger (50k)

Contributor

xflr6 commented Nov 17, 2015

[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   77.15ms
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples  103.37ms

@jreback jreback and 3 others commented on an outdated diff Nov 17, 2015

pandas/tests/test_frame.py
@@ -5545,6 +5545,8 @@ def test_itertuples(self):
dfaa = df[['a', 'a']]
self.assertEqual(list(dfaa.itertuples()), [(0, 1, 1), (1, 2, 2), (2, 3, 3)])
+ self.assertEqual(type(next(df.itertuples(name=None))), tuple)
@jreback

jreback Nov 17, 2015

Contributor

this is not good enough as a NamedTuple will match this as well.. Construct an exact tuple like above.

@xflr6

xflr6 Nov 17, 2015

Contributor

Sorry for having to correct you again but this is not true:

>>> import collections
>>> spam = collections.namedtuple('spam', 'spam eggs')(1, 2)
>>> type(spam) == tuple
False
>>> spam == (1, 2)
True
@jreback

jreback Nov 17, 2015

Contributor

we don't use type(...) is not idiomatic. use isinstance and see why what you are doing is wrong.

@xflr6

xflr6 Nov 17, 2015

Contributor

I disagree: testing is exactly the context where type(spam) is eggs is what you do if you want to check for exact type and not for ancestry. I would use self.assertIs but AFAIR this was added in Python 2.7.

What do you propose to use as check here?

@jreback

jreback Nov 17, 2015

Contributor

read the original PR we already had this discussion.

@jreback

jreback Nov 17, 2015

Contributor

further, pls review the actual testing code. we have self.assertIsInstance is defined. pandas defines almost everything you need.

@xflr6

xflr6 Nov 17, 2015

Contributor

Okay, reading that discussion I suspect your proposal is something like:
self.assertNotIsInstance(<inst>, <namedtuple>)

Please note that is impossible to check that is not a namedtuple this way (which I assume is what is to be tested):

  • The caller of itertuple() has no access to the class created on-the-fly inside the method.
  • It can surely never access it with itertuple(name=None) because it s not created in the first place.
  • The only common base class of different kinds of namedtuples is tuple.

With the test infrastrucure, I propose to change the line to:
self.assertIs(type(<inst>), tuple)

The only other possibility I can imagine would be something like assert not hasattr(<inst>, '_fields').

@MaximilianR

MaximilianR Nov 17, 2015

Contributor

pandas has an is_namedtuple where this is encapsulated

@kawochen

kawochen Nov 17, 2015

Contributor

why test for type when you can explicitly test for object equality?

@xflr6

xflr6 Nov 17, 2015

Contributor

@MaximilianR: Thanks for the hint, pandas.common.core.is_named_tuple() uses hasattr(..., '_fields'). Is that to be prefered?

@kawochen: See above, the test (as I understand it) is to check the returned values are plain tuples, but namedtuple('spam', 'spam eggs')(1, 2) == (1, 2).

@kawochen

kawochen Nov 17, 2015

Contributor

thanks. that is quite annoying. perhaps we need more utilities for dealing with these, and type does seem like the way to test for this. IMO you still need to test for object equality

@MaximilianR

MaximilianR Nov 17, 2015

Contributor
  • If you want to test if an instance is a namedtuple use is_named_tuple
  • If you want to test if an instance is a specific namedtuple, I would check the repr matches the repr you expect. Also OK if you additionally check the instances' equality / if the expected instance is a namedtuple.

Type won't help us here @kawochen, because namedtuple's type is just a tuple - it's actually a factory function rather than a type (although there's much debate over whether this is good)

@xflr6

xflr6 Nov 17, 2015

Contributor

Switched to repr checking. Note that grepping tests for Equal(type( there are some more of such explicit type comparisons (which IMHO this is fine).

Contributor

xflr6 commented Nov 17, 2015

Okay, maybe one last try to reconsider your dislike of try...else-clauses.
Apart from the official docs, there also seems to be some others in favour.

Contributor

jreback commented Nov 17, 2015

@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base.

Contributor

xflr6 commented Nov 18, 2015

Performance comparison with the regular tuple returning branch:

[ 25.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   76.64ms
[ 50.00%] ··· Running ...ame_itertuples.time_frame_itertuples_regular   38.04ms
Contributor

jreback commented Nov 18, 2015

ok, add this issue number onto where #11269 is in whatsnew/v0.17.0
squash to a single commit. ping when green.

Contributor

xflr6 commented Nov 18, 2015

There is no issue for this (only the PR), should I open one?

Contributor

jreback commented Nov 18, 2015

no use the pr number

jreback added this to the 0.17.1 milestone Nov 19, 2015

Contributor

jreback commented Nov 19, 2015

merged via 4ffc3ef

thanks!

jreback closed this Nov 19, 2015

xflr6 deleted the unknown repository branch Nov 19, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment