New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed-up DateFrame.itertuples() with namedtuples #11625

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@xflr6
Contributor

xflr6 commented Nov 17, 2015

Also:

  • replace bare except: to avoid catching SystemExit and KeyboardInterrupt
  • remove the generator return from the try-clause to an else
  • more explicit fallback to regular tuples when name=None (docs?)
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 17, 2015

Contributor

you would have to add a benchmark for this
though u r using a private function so u less this is a huge speed up prob not worth it
need a test for name=None
don't use an else clause - just put in the try - what u added is much less explicit

Contributor

jreback commented Nov 17, 2015

you would have to add a benchmark for this
though u r using a private function so u less this is a huge speed up prob not worth it
need a test for name=None
don't use an else clause - just put in the try - what u added is much less explicit

@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 17, 2015

Contributor

As fas as I can see, _make() is the preferred way of creating a namedtuple from an iterable (i.e. although it starts with an underscore, it is not a private method):

In addition to the methods inherited from tuples, named tuples support three additional methods
and one attribute. To prevent conflicts with field names, the method and attribute names start with
an underscore.

I disagree, to me this is exactly what the else-clause of try...except is for. Why have a bare try around the generator expression?

The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the
try ... except statement.

Contributor

xflr6 commented Nov 17, 2015

As fas as I can see, _make() is the preferred way of creating a namedtuple from an iterable (i.e. although it starts with an underscore, it is not a private method):

In addition to the methods inherited from tuples, named tuples support three additional methods
and one attribute. To prevent conflicts with field names, the method and attribute names start with
an underscore.

I disagree, to me this is exactly what the else-clause of try...except is for. Why have a bare try around the generator expression?

The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the
try ... except statement.

@jreback

View changes

Show outdated Hide outdated pandas/core/frame.py Outdated
@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 17, 2015

Contributor

Here are some simple timings:

import collections

import pandas as pd
from pandas.compat import map, zip

class DataFrame(pd.DataFrame):

    def itertuples_new(self, index=True, name="Pandas"):
(...)
            else:
                return (itertuple(*row) for row in zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

    def itertuples_make(self, index=True, name="Pandas"):
(...)
            else:
                return map(itertuple._make, zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
   'D': range(1000), 'E': range(1000), 'F': range(1000)})

%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop

%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop

%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop
Contributor

xflr6 commented Nov 17, 2015

Here are some simple timings:

import collections

import pandas as pd
from pandas.compat import map, zip

class DataFrame(pd.DataFrame):

    def itertuples_new(self, index=True, name="Pandas"):
(...)
            else:
                return (itertuple(*row) for row in zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

    def itertuples_make(self, index=True, name="Pandas"):
(...)
            else:
                return map(itertuple._make, zip(*arrays))

        # fallback to regular tuples
        return zip(*arrays)

df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
   'D': range(1000), 'E': range(1000), 'F': range(1000)})

%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop

%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop

%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 17, 2015

Contributor

pls add a benchmark to the asv suite (make about 10x bigger).

Contributor

jreback commented Nov 17, 2015

pls add a benchmark to the asv suite (make about 10x bigger).

@jreback jreback added the Performance label Nov 17, 2015

@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 17, 2015

Contributor
$ asv continuous master HEAD -b
 frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 2238f73e:
[  0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED.
Contributor

xflr6 commented Nov 17, 2015

$ asv continuous master HEAD -b
 frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 2238f73e:
[  0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED.
@jreback

View changes

Show outdated Hide outdated asv_bench/benchmarks/frame_methods.py Outdated
@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 17, 2015

Contributor
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   77.15ms
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples  103.37ms
Contributor

xflr6 commented Nov 17, 2015

[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   77.15ms
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples  103.37ms
@jreback

View changes

Show outdated Hide outdated pandas/tests/test_frame.py Outdated
@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 17, 2015

Contributor

Okay, maybe one last try to reconsider your dislike of try...else-clauses.
Apart from the official docs, there also seems to be some others in favour.

Contributor

xflr6 commented Nov 17, 2015

Okay, maybe one last try to reconsider your dislike of try...else-clauses.
Apart from the official docs, there also seems to be some others in favour.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 17, 2015

Contributor

@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base.

Contributor

jreback commented Nov 17, 2015

@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base.

@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 18, 2015

Contributor

Performance comparison with the regular tuple returning branch:

[ 25.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   76.64ms
[ 50.00%] ··· Running ...ame_itertuples.time_frame_itertuples_regular   38.04ms
Contributor

xflr6 commented Nov 18, 2015

Performance comparison with the regular tuple returning branch:

[ 25.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples   76.64ms
[ 50.00%] ··· Running ...ame_itertuples.time_frame_itertuples_regular   38.04ms
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 18, 2015

Contributor

ok, add this issue number onto where #11269 is in whatsnew/v0.17.0
squash to a single commit. ping when green.

Contributor

jreback commented Nov 18, 2015

ok, add this issue number onto where #11269 is in whatsnew/v0.17.0
squash to a single commit. ping when green.

@xflr6

This comment has been minimized.

Show comment
Hide comment
@xflr6

xflr6 Nov 18, 2015

Contributor

There is no issue for this (only the PR), should I open one?

Contributor

xflr6 commented Nov 18, 2015

There is no issue for this (only the PR), should I open one?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 18, 2015

Contributor

no use the pr number

Contributor

jreback commented Nov 18, 2015

no use the pr number

@jreback jreback added this to the 0.17.1 milestone Nov 19, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 19, 2015

Contributor

merged via 4ffc3ef

thanks!

Contributor

jreback commented Nov 19, 2015

merged via 4ffc3ef

thanks!

@jreback jreback closed this Nov 19, 2015

@xflr6 xflr6 deleted the xflr6:enhance_pr11325 branch Nov 19, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment