Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

Closed
wants to merge 1 commit into from
Closed

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

wants to merge 1 commit into from

Conversation

xflr6
Copy link
Contributor

@xflr6 xflr6 commented Nov 17, 2015

Also:

  • replace bare except: to avoid catching SystemExit and KeyboardInterrupt
  • remove the generator return from the try-clause to an else
  • more explicit fallback to regular tuples when name=None (docs?)

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

you would have to add a benchmark for this
though u r using a private function so u less this is a huge speed up prob not worth it
need a test for name=None
don't use an else clause - just put in the try - what u added is much less explicit

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 17, 2015

As fas as I can see, _make() is the preferred way of creating a namedtuple from an iterable (i.e. although it starts with an underscore, it is not a private method):

In addition to the methods inherited from tuples, named tuples support three additional methods
and one attribute. To prevent conflicts with field names, the method and attribute names start with
an underscore.

I disagree, to me this is exactly what the else-clause of try...except is for. Why have a bare try around the generator expression?

The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the
try ... except statement.

pass
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are missing the point here; this should NOT create a named tuple if it fails the try, that is the point; and this code exists below, you are duplicating code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for correcting you, but the code does exactly what you say is to be achieved.

The try ... except statement has an optional else clause, which, when present,
must follow all except clauses. It is useful for code that must be executed
if the try clause does not raise an exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are duplicating code, see the return statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not see this. Can you be a little more explicit.
There are two return statements just like in the current version in master, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the statement you added duplicates the other return statement. you don't need the else clause at all. if you think that the _make actually speeds things, then put it there.

  • would need a benchmark as well
  • need a test for name=None if this actually changed anything (does it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: I moved the first return statement inside the else-clause and modified it to use _make (i.e. I did not add a statement and not introduce (new) duplication).
If you remove the else-block (with the contained return) then namedtuples will never be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this is just for name tuples, ok, then, but just put it where it was. I think else clauses make this very unreadable. No reason not to put it in the try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the else as discussed above and put it in the try

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 17, 2015

Here are some simple timings:

import collections

import pandas as pd
from pandas.compat import map, zip

class DataFrame(pd.DataFrame):

    def itertuples_new(self, index=True, name="Pandas"):
(...)
            else:
                return (itertuple(*row) for row in zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

    def itertuples_make(self, index=True, name="Pandas"):
(...)
            else:
                return map(itertuple._make, zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
   'D': range(1000), 'E': range(1000), 'F': range(1000)})

%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop

%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop

%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

pls add a benchmark to the asv suite (make about 10x bigger).

@jreback jreback added the Performance Memory or execution speed performance label Nov 17, 2015
@xflr6
Copy link
Contributor Author

xflr6 commented Nov 17, 2015

$ asv continuous master HEAD -b
 frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 2238f73e:
[  0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED.

class frame_itertuples(object):

def setup(self):
self.df = DataFrame(np.random.randn(5000, 10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this bigger (50k)

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 17, 2015

[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   77.15ms
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples  103.37ms

@@ -5545,6 +5545,8 @@ def test_itertuples(self):
dfaa = df[['a', 'a']]
self.assertEqual(list(dfaa.itertuples()), [(0, 1, 1), (1, 2, 2), (2, 3, 3)])

self.assertEqual(type(next(df.itertuples(name=None))), tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not good enough as a NamedTuple will match this as well.. Construct an exact tuple like above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for having to correct you again but this is not true:

>>> import collections
>>> spam = collections.namedtuple('spam', 'spam eggs')(1, 2)
>>> type(spam) == tuple
False
>>> spam == (1, 2)
True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use type(...) is not idiomatic. use isinstance and see why what you are doing is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree: testing is exactly the context where type(spam) is eggs is what you do if you want to check for exact type and not for ancestry. I would use self.assertIs but AFAIR this was added in Python 2.7.

What do you propose to use as check here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read the original PR we already had this discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why test for type when you can explicitly test for object equality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaximilianR: Thanks for the hint, pandas.common.core.is_named_tuple() uses hasattr(..., '_fields'). Is that to be prefered?

@kawochen: See above, the test (as I understand it) is to check the returned values are plain tuples, but namedtuple('spam', 'spam eggs')(1, 2) == (1, 2).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. that is quite annoying. perhaps we need more utilities for dealing with these, and type does seem like the way to test for this. IMO you still need to test for object equality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If you want to test if an instance is a namedtuple use is_named_tuple
  • If you want to test if an instance is a specific namedtuple, I would check the repr matches the repr you expect. Also OK if you additionally check the instances' equality / if the expected instance is a namedtuple.

Type won't help us here @kawochen, because namedtuple's type is just a tuple - it's actually a factory function rather than a type (although there's much debate over whether this is good)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to repr checking. Note that grepping tests for Equal(type( there are some more of such explicit type comparisons (which IMHO this is fine).

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 17, 2015

Okay, maybe one last try to reconsider your dislike of try...else-clauses.
Apart from the official docs, there also seems to be some others in favour.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base.

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 18, 2015

Performance comparison with the regular tuple returning branch:

[ 25.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   76.64ms
[ 50.00%] ··· Running ...ame_itertuples.time_frame_itertuples_regular   38.04ms

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

ok, add this issue number onto where #11269 is in whatsnew/v0.17.0
squash to a single commit. ping when green.

@xflr6
Copy link
Contributor Author

xflr6 commented Nov 18, 2015

There is no issue for this (only the PR), should I open one?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

no use the pr number

@jreback jreback added this to the 0.17.1 milestone Nov 19, 2015
@jreback
Copy link
Contributor

jreback commented Nov 19, 2015

merged via 4ffc3ef

thanks!

@jreback jreback closed this Nov 19, 2015
@xflr6 xflr6 deleted the enhance_pr11325 branch November 19, 2015 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants