Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

Merged
merged 1 commit into from Nov 2, 2018

Conversation

Projects
None yet
6 participants
@qwhelan
Copy link
Contributor

commented Oct 29, 2018

Removes an unnecessary DataFrame creation when dealing solely with Series objects, which reduces runtime of concat.

  • xref #23362
  • tests passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Performance Comparison

import pandas as pd
import numpy as np
a = np.random.randint(2000, 2100, size=100000)  # larger n than #23362 
b = np.random.randint(2000, 2100, size=100000)

x = pd.core.arrays.period_array(a, freq='B')
y = pd.core.arrays.period_array(b, freq='B')

s = pd.Series(x)
t = pd.Series(y)

Baseline

%timeit x._concat_same_type([x, y])
202 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.concat([s, t], ignore_index=True)
839 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

%timeit pd.concat([s, t], ignore_index=True)
396 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And the excised code itself:

In [13]: pd.DataFrame()._get_axis_number(0)
Out[13]: 0

In [14]: %timeit pd.DataFrame()._get_axis_number(0)
Out[14]: 312 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So roughly 40% of the runtime was being spent mapping axis=0 -> axis=0

@pep8speaks

This comment has been minimized.

Copy link

commented Oct 29, 2018

Hello @qwhelan! Thanks for updating the PR.

Comment last updated on November 02, 2018 at 00:32 Hours UTC
@codecov

This comment has been minimized.

Copy link

commented Oct 29, 2018

Codecov Report

Merging #23404 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23404   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         161      161           
  Lines       51191    51191           
=======================================
  Hits        47210    47210           
  Misses       3981     3981
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.26% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/concat.py 97.6% <ø> (ø) ⬆️
pandas/core/generic.py 96.81% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93aba79...486d3b8. Read the comment docs.

@jreback
Copy link
Contributor

left a comment

so this actually affects perf? pls show a before and after

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 29, 2018

@jreback I was waiting for asv to finish before posting benchmarks, but I've updated the initial post with some basic benchmarks as that's taking longer than anticipated.

Show resolved Hide resolved pandas/core/reshape/concat.py Outdated

@qwhelan qwhelan force-pushed the qwhelan:master branch 2 times, most recently from 8816e4b to b464b65 Oct 30, 2018

@jreback
Copy link
Contributor

left a comment

this change just obfuscates things for very very little gain here. This is a constant time operation that we use all over the codebase.

You can try to make this a @staticmethod instead if you really want to get this in.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

@qwhelan doing this just on Py3 also feels hacky. I think everything in _get_axis_number is static, so you can do

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index db10494f0..14edc1d7f 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -358,18 +358,19 @@ class NDFrame(PandasObject, SelectionMixin):
             d.update(kwargs)
             return cls(data, **d)
 
-    def _get_axis_number(self, axis):
-        axis = self._AXIS_ALIASES.get(axis, axis)
+    @classmethod
+    def _get_axis_number(cls, axis):
+        axis = cls._AXIS_ALIASES.get(axis, axis)
         if is_integer(axis):
-            if axis in self._AXIS_NAMES:
+            if axis in cls._AXIS_NAMES:
                 return axis
         else:
             try:
-                return self._AXIS_NUMBERS[axis]
+                return cls._AXIS_NUMBERS[axis]
             except KeyError:
                 pass
         raise ValueError('No axis named {0} for object type {1}'
-                         .format(axis, type(self)))
+                         .format(axis, cls))
 
     def _get_axis_name(self, axis):
         axis = self._AXIS_ALIASES.get(axis, axis)

Will you see if the same change can be made to

  • _get_axis_name
  • _get_axis
  • _get_block_manager_axis

Then, will you ensure that we have basic tests for using all those methods on classes (as well as instances).

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

@jreback Yeah, I'm aware this change is trivial but just using it as an excuse to get my dev setup fully working for some more substantial perf patches that will be coming soon.

@TomAugspurger Thanks, I'll check into it.

@qwhelan qwhelan force-pushed the qwhelan:master branch from b464b65 to 80dfcb8 Oct 31, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

@TomAugspurger I implemented the @classmethod change for all but _get_axis() (which is used to get an axis instance, so shouldn't be a @classmethod).

There's decent existing test coverage but I added an explicit test of class vs instance invocations for all valid possible calls for good measure.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

Thanks, LGTM, though you have a linting error:
#23404 (comment)

@qwhelan qwhelan force-pushed the qwhelan:master branch from 80dfcb8 to 87066cd Oct 31, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

@TomAugspurger Thanks, resolved.

@jreback
Copy link
Contributor

left a comment

small comment, ping on green.

Show resolved Hide resolved pandas/tests/generic/test_generic.py Outdated

@jreback jreback added this to the 0.24.0 milestone Nov 1, 2018

@qwhelan qwhelan force-pushed the qwhelan:master branch from 87066cd to aee8a67 Nov 1, 2018

@jreback

jreback approved these changes Nov 1, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2018

lgtm. ping on green.

@qwhelan qwhelan force-pushed the qwhelan:master branch from aee8a67 to c2b6cc1 Nov 1, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2018

@jreback All green, thanks!

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2018

In [4]: %timeit x._concat_same_type([x, y])
   ...:
99.8 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.concat([s, t], ignore_index=True, copy=False)
   ...:
   ...:
194 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

so we're down to 2x. Let me know if you're interested in pushing this further @qwhelan (in a separate PR).

@qwhelan qwhelan force-pushed the qwhelan:master branch from c2b6cc1 to ec04672 Nov 2, 2018

@qwhelan qwhelan force-pushed the qwhelan:master branch from ec04672 to 486d3b8 Nov 2, 2018

@jreback jreback merged commit c30c07f into pandas-dev:master Nov 2, 2018

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
pandas-dev.pandas Build #20181102.4 has test failures
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2018

thanks @qwhelan

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2018

@TomAugspurger Yeah, I'm looking at some related issues if you'd like to keep that issue open for now.

thoo added a commit to thoo/pandas that referenced this pull request Nov 3, 2018

Merge remote-tracking branch 'repo_org/master' into check_np_pd_fromE…
…xamples

* repo_org/master: (66 commits)
  CLN: doc string (pandas-dev#23469)
  DOC: Add cookbook entry for triangular correlation matrix (GH22840) (pandas-dev#23032)
  add number of Errors, Warnings to scripts/validate_docstrings.py (pandas-dev#23150)
  BUG: Allow freq conversion from dt64 to period (pandas-dev#23460)
  ENH: Add FrozenList.union and .difference (pandas-dev#23394)
  REF: cython cleanup, typing, optimizations (pandas-dev#23464)
  strictness and checks for Timedelta _simple_new (pandas-dev#23433)
  Fixing flake8 problems new to flake8 3.6.0 (pandas-dev#23472)
  DOC: Updating the docstring of Series.dot  (pandas-dev#22890)
  TST: Fixturize series/test_analytics.py (pandas-dev#22755)
  BUG/ENH: Handle NonexistentTimeError in date rounding (pandas-dev#23406)
  PERF: speed up concat on Series by making _get_axis_number() a classmethod (pandas-dev#23404)
  REF: Remove DatetimelikeArrayMixin._shallow_copy (pandas-dev#23430)
  REF: strictness/simplification in DatetimeArray/Index _simple_new (pandas-dev#23431)
  REF: cython cleanup, typing, optimizations (pandas-dev#23456)
  TST: tweak Hypothesis configuration and idioms (pandas-dev#23441)
  BUG: fix HDFStore.append with all empty strings error (GH12242) (pandas-dev#23435)
  TST: Skip 32bit failing IntervalTree tests (pandas-dev#23442)
  BUG: Deprecate nthreads argument (pandas-dev#23112)
  style: fix import format at pandas/core/reshape (pandas-dev#23387)
  ...

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.