PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

qwhelan · 2018-10-29T04:23:12Z

Removes an unnecessary DataFrame creation when dealing solely with Series objects, which reduces runtime of concat.

xref PERF: concat perf #23362
tests passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Performance Comparison

import pandas as pd
import numpy as np
a = np.random.randint(2000, 2100, size=100000)  # larger n than #23362 
b = np.random.randint(2000, 2100, size=100000)

x = pd.core.arrays.period_array(a, freq='B')
y = pd.core.arrays.period_array(b, freq='B')

s = pd.Series(x)
t = pd.Series(y)

Baseline

%timeit x._concat_same_type([x, y])
202 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.concat([s, t], ignore_index=True)
839 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

%timeit pd.concat([s, t], ignore_index=True)
396 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And the excised code itself:

In [13]: pd.DataFrame()._get_axis_number(0)
Out[13]: 0

In [14]: %timeit pd.DataFrame()._get_axis_number(0)
Out[14]: 312 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So roughly 40% of the runtime was being spent mapping axis=0 -> axis=0

pep8speaks · 2018-10-29T04:23:14Z

Hello @qwhelan! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/generic.py !
There are no PEP8 issues in the file pandas/core/reshape/concat.py !
There are no PEP8 issues in the file pandas/tests/generic/test_generic.py !

Comment last updated on November 02, 2018 at 00:32 Hours UTC

codecov · 2018-10-29T05:02:26Z

Codecov Report

Merging #23404 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #23404   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         161      161           
  Lines       51191    51191           
=======================================
  Hits        47210    47210           
  Misses       3981     3981

Flag	Coverage Δ
#multiple	`90.6% <100%> (ø)`	⬆️
#single	`42.26% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/reshape/concat.py	`97.6% <ø> (ø)`	⬆️
pandas/core/generic.py	`96.81% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93aba79...486d3b8. Read the comment docs.

jreback

so this actually affects perf? pls show a before and after

qwhelan · 2018-10-29T19:04:13Z

@jreback I was waiting for asv to finish before posting benchmarks, but I've updated the initial post with some basic benchmarks as that's taking longer than anticipated.

pandas/core/reshape/concat.py

jreback

this change just obfuscates things for very very little gain here. This is a constant time operation that we use all over the codebase.

You can try to make this a @staticmethod instead if you really want to get this in.

TomAugspurger · 2018-10-31T14:20:48Z

@qwhelan doing this just on Py3 also feels hacky. I think everything in _get_axis_number is static, so you can do

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index db10494f0..14edc1d7f 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -358,18 +358,19 @@ class NDFrame(PandasObject, SelectionMixin):
             d.update(kwargs)
             return cls(data, **d)
 
-    def _get_axis_number(self, axis):
-        axis = self._AXIS_ALIASES.get(axis, axis)
+    @classmethod
+    def _get_axis_number(cls, axis):
+        axis = cls._AXIS_ALIASES.get(axis, axis)
         if is_integer(axis):
-            if axis in self._AXIS_NAMES:
+            if axis in cls._AXIS_NAMES:
                 return axis
         else:
             try:
-                return self._AXIS_NUMBERS[axis]
+                return cls._AXIS_NUMBERS[axis]
             except KeyError:
                 pass
         raise ValueError('No axis named {0} for object type {1}'
-                         .format(axis, type(self)))
+                         .format(axis, cls))
 
     def _get_axis_name(self, axis):
         axis = self._AXIS_ALIASES.get(axis, axis)

Will you see if the same change can be made to

_get_axis_name
_get_axis
_get_block_manager_axis

Then, will you ensure that we have basic tests for using all those methods on classes (as well as instances).

qwhelan · 2018-10-31T18:36:47Z

@jreback Yeah, I'm aware this change is trivial but just using it as an excuse to get my dev setup fully working for some more substantial perf patches that will be coming soon.

@TomAugspurger Thanks, I'll check into it.

qwhelan · 2018-10-31T21:45:03Z

@TomAugspurger I implemented the @classmethod change for all but _get_axis() (which is used to get an axis instance, so shouldn't be a @classmethod).

There's decent existing test coverage but I added an explicit test of class vs instance invocations for all valid possible calls for good measure.

TomAugspurger · 2018-10-31T21:50:44Z

Thanks, LGTM, though you have a linting error:
#23404 (comment)

qwhelan · 2018-10-31T22:05:55Z

@TomAugspurger Thanks, resolved.

jreback

small comment, ping on green.

pandas/tests/generic/test_generic.py

jreback · 2018-11-01T17:08:07Z

lgtm. ping on green.

qwhelan · 2018-11-01T19:54:17Z

@jreback All green, thanks!

pandas/tests/generic/test_generic.py

TomAugspurger · 2018-11-01T22:38:34Z

In [4]: %timeit x._concat_same_type([x, y])
   ...:
99.8 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.concat([s, t], ignore_index=True, copy=False)
   ...:
   ...:
194 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

so we're down to 2x. Let me know if you're interested in pushing this further @qwhelan (in a separate PR).

…ethod

jreback · 2018-11-02T13:46:28Z

thanks @qwhelan

qwhelan · 2018-11-02T17:51:11Z

@TomAugspurger Yeah, I'm looking at some related issues if you'd like to keep that issue open for now.

…xamples * repo_org/master: (66 commits) CLN: doc string (pandas-dev#23469) DOC: Add cookbook entry for triangular correlation matrix (GH22840) (pandas-dev#23032) add number of Errors, Warnings to scripts/validate_docstrings.py (pandas-dev#23150) BUG: Allow freq conversion from dt64 to period (pandas-dev#23460) ENH: Add FrozenList.union and .difference (pandas-dev#23394) REF: cython cleanup, typing, optimizations (pandas-dev#23464) strictness and checks for Timedelta _simple_new (pandas-dev#23433) Fixing flake8 problems new to flake8 3.6.0 (pandas-dev#23472) DOC: Updating the docstring of Series.dot (pandas-dev#22890) TST: Fixturize series/test_analytics.py (pandas-dev#22755) BUG/ENH: Handle NonexistentTimeError in date rounding (pandas-dev#23406) PERF: speed up concat on Series by making _get_axis_number() a classmethod (pandas-dev#23404) REF: Remove DatetimelikeArrayMixin._shallow_copy (pandas-dev#23430) REF: strictness/simplification in DatetimeArray/Index _simple_new (pandas-dev#23431) REF: cython cleanup, typing, optimizations (pandas-dev#23456) TST: tweak Hypothesis configuration and idioms (pandas-dev#23441) BUG: fix HDFStore.append with all empty strings error (GH12242) (pandas-dev#23435) TST: Skip 32bit failing IntervalTree tests (pandas-dev#23442) BUG: Deprecate nthreads argument (pandas-dev#23112) style: fix import format at pandas/core/reshape (pandas-dev#23387) ...

…ethod (pandas-dev#23404)

jreback requested changes Oct 29, 2018

View reviewed changes

gfyoung added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 29, 2018

topper-123 requested changes Oct 30, 2018

View reviewed changes

pandas/core/reshape/concat.py Outdated Show resolved Hide resolved

qwhelan force-pushed the master branch 2 times, most recently from 8816e4b to b464b65 Compare October 30, 2018 20:09

jreback requested changes Oct 31, 2018

View reviewed changes

qwhelan force-pushed the master branch from b464b65 to 80dfcb8 Compare October 31, 2018 21:36

qwhelan force-pushed the master branch from 80dfcb8 to 87066cd Compare October 31, 2018 21:57

jreback requested changes Nov 1, 2018

View reviewed changes

pandas/tests/generic/test_generic.py Outdated Show resolved Hide resolved

jreback added this to the 0.24.0 milestone Nov 1, 2018

qwhelan force-pushed the master branch from 87066cd to aee8a67 Compare November 1, 2018 17:07

jreback approved these changes Nov 1, 2018

View reviewed changes

qwhelan force-pushed the master branch from aee8a67 to c2b6cc1 Compare November 1, 2018 18:37

TomAugspurger reviewed Nov 1, 2018

View reviewed changes

pandas/tests/generic/test_generic.py Outdated Show resolved Hide resolved

qwhelan force-pushed the master branch from c2b6cc1 to ec04672 Compare November 2, 2018 00:26

PERF: speed up concat on Series by making _get_axis_number() a classm…

486d3b8

…ethod

qwhelan force-pushed the master branch from ec04672 to 486d3b8 Compare November 2, 2018 00:32

jreback merged commit c30c07f into pandas-dev:master Nov 2, 2018

TomAugspurger mentioned this pull request Nov 2, 2018

PERF: concat perf #23362

Closed

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

PERF: speed up concat on Series by making _get_axis_number() a classm…

68b8fdc

…ethod (pandas-dev#23404)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: speed up concat on Series by making _get_axis_number() a classm…

b314df4

…ethod (pandas-dev#23404)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: speed up concat on Series by making _get_axis_number() a classm…

0c6811c

…ethod (pandas-dev#23404)

simonjayhawkins mentioned this pull request Mar 5, 2019

BUG: in error message raised when invalid axis parameter #25555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

qwhelan commented Oct 29, 2018 •

edited by jreback

Loading

pep8speaks commented Oct 29, 2018 •

edited

Loading

codecov bot commented Oct 29, 2018 •

edited

Loading

jreback left a comment

qwhelan commented Oct 29, 2018

jreback left a comment

TomAugspurger commented Oct 31, 2018

qwhelan commented Oct 31, 2018

qwhelan commented Oct 31, 2018

TomAugspurger commented Oct 31, 2018

qwhelan commented Oct 31, 2018

jreback left a comment

jreback commented Nov 1, 2018

qwhelan commented Nov 1, 2018

TomAugspurger commented Nov 1, 2018

jreback commented Nov 2, 2018

qwhelan commented Nov 2, 2018

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

Conversation

qwhelan commented Oct 29, 2018 • edited by jreback Loading

Performance Comparison

Baseline

After

pep8speaks commented Oct 29, 2018 • edited Loading

Comment last updated on November 02, 2018 at 00:32 Hours UTC

codecov bot commented Oct 29, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

qwhelan commented Oct 29, 2018

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 31, 2018

qwhelan commented Oct 31, 2018

qwhelan commented Oct 31, 2018

TomAugspurger commented Oct 31, 2018

qwhelan commented Oct 31, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback commented Nov 1, 2018

qwhelan commented Nov 1, 2018

TomAugspurger commented Nov 1, 2018

jreback commented Nov 2, 2018

qwhelan commented Nov 2, 2018

qwhelan commented Oct 29, 2018 •

edited by jreback

Loading

pep8speaks commented Oct 29, 2018 •

edited

Loading

codecov bot commented Oct 29, 2018 •

edited

Loading