Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404

Merged
merged 1 commit into from
Nov 2, 2018

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Oct 29, 2018

Removes an unnecessary DataFrame creation when dealing solely with Series objects, which reduces runtime of concat.

Performance Comparison

import pandas as pd
import numpy as np
a = np.random.randint(2000, 2100, size=100000)  # larger n than #23362 
b = np.random.randint(2000, 2100, size=100000)

x = pd.core.arrays.period_array(a, freq='B')
y = pd.core.arrays.period_array(b, freq='B')

s = pd.Series(x)
t = pd.Series(y)

Baseline

%timeit x._concat_same_type([x, y])
202 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.concat([s, t], ignore_index=True)
839 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

%timeit pd.concat([s, t], ignore_index=True)
396 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And the excised code itself:

In [13]: pd.DataFrame()._get_axis_number(0)
Out[13]: 0

In [14]: %timeit pd.DataFrame()._get_axis_number(0)
Out[14]: 312 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So roughly 40% of the runtime was being spent mapping axis=0 -> axis=0

@pep8speaks
Copy link

pep8speaks commented Oct 29, 2018

Hello @qwhelan! Thanks for updating the PR.

Comment last updated on November 02, 2018 at 00:32 Hours UTC

@codecov
Copy link

codecov bot commented Oct 29, 2018

Codecov Report

Merging #23404 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23404   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         161      161           
  Lines       51191    51191           
=======================================
  Hits        47210    47210           
  Misses       3981     3981
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.26% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/concat.py 97.6% <ø> (ø) ⬆️
pandas/core/generic.py 96.81% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93aba79...486d3b8. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this actually affects perf? pls show a before and after

@gfyoung gfyoung added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 29, 2018
@qwhelan
Copy link
Contributor Author

qwhelan commented Oct 29, 2018

@jreback I was waiting for asv to finish before posting benchmarks, but I've updated the initial post with some basic benchmarks as that's taking longer than anticipated.

pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
@qwhelan qwhelan force-pushed the master branch 2 times, most recently from 8816e4b to b464b65 Compare October 30, 2018 20:09
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change just obfuscates things for very very little gain here. This is a constant time operation that we use all over the codebase.

You can try to make this a @staticmethod instead if you really want to get this in.

@TomAugspurger
Copy link
Contributor

@qwhelan doing this just on Py3 also feels hacky. I think everything in _get_axis_number is static, so you can do

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index db10494f0..14edc1d7f 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -358,18 +358,19 @@ class NDFrame(PandasObject, SelectionMixin):
             d.update(kwargs)
             return cls(data, **d)
 
-    def _get_axis_number(self, axis):
-        axis = self._AXIS_ALIASES.get(axis, axis)
+    @classmethod
+    def _get_axis_number(cls, axis):
+        axis = cls._AXIS_ALIASES.get(axis, axis)
         if is_integer(axis):
-            if axis in self._AXIS_NAMES:
+            if axis in cls._AXIS_NAMES:
                 return axis
         else:
             try:
-                return self._AXIS_NUMBERS[axis]
+                return cls._AXIS_NUMBERS[axis]
             except KeyError:
                 pass
         raise ValueError('No axis named {0} for object type {1}'
-                         .format(axis, type(self)))
+                         .format(axis, cls))
 
     def _get_axis_name(self, axis):
         axis = self._AXIS_ALIASES.get(axis, axis)

Will you see if the same change can be made to

  • _get_axis_name
  • _get_axis
  • _get_block_manager_axis

Then, will you ensure that we have basic tests for using all those methods on classes (as well as instances).

@qwhelan
Copy link
Contributor Author

qwhelan commented Oct 31, 2018

@jreback Yeah, I'm aware this change is trivial but just using it as an excuse to get my dev setup fully working for some more substantial perf patches that will be coming soon.

@TomAugspurger Thanks, I'll check into it.

@qwhelan
Copy link
Contributor Author

qwhelan commented Oct 31, 2018

@TomAugspurger I implemented the @classmethod change for all but _get_axis() (which is used to get an axis instance, so shouldn't be a @classmethod).

There's decent existing test coverage but I added an explicit test of class vs instance invocations for all valid possible calls for good measure.

@TomAugspurger
Copy link
Contributor

Thanks, LGTM, though you have a linting error:
#23404 (comment)

@qwhelan
Copy link
Contributor Author

qwhelan commented Oct 31, 2018

@TomAugspurger Thanks, resolved.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, ping on green.

pandas/tests/generic/test_generic.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

lgtm. ping on green.

@qwhelan
Copy link
Contributor Author

qwhelan commented Nov 1, 2018

@jreback All green, thanks!

@TomAugspurger
Copy link
Contributor

In [4]: %timeit x._concat_same_type([x, y])
   ...:
99.8 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.concat([s, t], ignore_index=True, copy=False)
   ...:
   ...:
194 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

so we're down to 2x. Let me know if you're interested in pushing this further @qwhelan (in a separate PR).

@jreback jreback merged commit c30c07f into pandas-dev:master Nov 2, 2018
@jreback
Copy link
Contributor

jreback commented Nov 2, 2018

thanks @qwhelan

@qwhelan
Copy link
Contributor Author

qwhelan commented Nov 2, 2018

@TomAugspurger Yeah, I'm looking at some related issues if you'd like to keep that issue open for now.

@TomAugspurger TomAugspurger mentioned this pull request Nov 2, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 3, 2018
…xamples

* repo_org/master: (66 commits)
  CLN: doc string (pandas-dev#23469)
  DOC: Add cookbook entry for triangular correlation matrix (GH22840) (pandas-dev#23032)
  add number of Errors, Warnings to scripts/validate_docstrings.py (pandas-dev#23150)
  BUG: Allow freq conversion from dt64 to period (pandas-dev#23460)
  ENH: Add FrozenList.union and .difference (pandas-dev#23394)
  REF: cython cleanup, typing, optimizations (pandas-dev#23464)
  strictness and checks for Timedelta _simple_new (pandas-dev#23433)
  Fixing flake8 problems new to flake8 3.6.0 (pandas-dev#23472)
  DOC: Updating the docstring of Series.dot  (pandas-dev#22890)
  TST: Fixturize series/test_analytics.py (pandas-dev#22755)
  BUG/ENH: Handle NonexistentTimeError in date rounding (pandas-dev#23406)
  PERF: speed up concat on Series by making _get_axis_number() a classmethod (pandas-dev#23404)
  REF: Remove DatetimelikeArrayMixin._shallow_copy (pandas-dev#23430)
  REF: strictness/simplification in DatetimeArray/Index _simple_new (pandas-dev#23431)
  REF: cython cleanup, typing, optimizations (pandas-dev#23456)
  TST: tweak Hypothesis configuration and idioms (pandas-dev#23441)
  BUG: fix HDFStore.append with all empty strings error (GH12242) (pandas-dev#23435)
  TST: Skip 32bit failing IntervalTree tests (pandas-dev#23442)
  BUG: Deprecate nthreads argument (pandas-dev#23112)
  style: fix import format at pandas/core/reshape (pandas-dev#23387)
  ...
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants