Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF, BENCH: Fix performance issue when indexing into non-unique DatetimeIndex/PeriodIndex. #27136

Merged
merged 2 commits into from Jul 6, 2019

Conversation

@qwhelan
Copy link
Contributor

commented Jun 30, 2019

This PR fixes a number of issues involving non-unique non-numeric indexing:

  • The NonNumericSeriesIndexing benchmark incorrectly assumes tm.makeStringIndex() returns a monotonic Index. It does not and unfortunately invalidates the existing results here.
    • While modifying this, figured should add PeriodIndex and non_monotonic cases
  • The setup() function creates an index but does not trigger the engine mapping to be populated. This leads to noisy and erroneous results.
  • A PeriodIndex or DatetimeIndex with non-unique, monotonic data creates an entirely new Index when being sliced. This means the engine mapping gets recomputed on every list-like query.
  • A PeriodIndex is actually a PeriodEngine plus a separate Int64Index (with associated Int64Engine). The former contains a mapping of {Period: value} while being a subclass of Int64Engine. This means calls to _bin_search() fail as the underlying array consists of Period objects. This only occurs when the index is non-unique, monotonic, and >= 1M elements long.
    • We add an explicit test for this failure mode
  • A number of index calls pass value.ordinal to the PeriodEngine; as that is a {Period: value} mapping, all such calls fail.
    • We instead pass value.ordinal queries to the Int64Engine, where they succeed.
  • Finally, PeriodIndex.get_memory() does not account for the memory of its associated Int64Index. This is fixed by simply adding it to the total.
       before           after         ratio
     [2811464a]       [a68caa2a]
     <import_bench~1>       <period_fix>
           failed         62.4±2μs      n/a  indexing.NonNumericSeriesIndexing.time_get_value('period', 'unique_monotonic_inc')
           failed         216±20μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('period', 'unique_monotonic_inc')
           failed      2.74±0.05ms      n/a  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'unique_monotonic_inc')
           failed        160±0.5μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('period', 'unique_monotonic_inc')
           failed       43.0±0.2μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_scalar('period', 'unique_monotonic_inc')
-        718±10μs         207±10μs     0.29  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('period', 'non_monotonic')
-     2.53±0.05ms         701±10μs     0.28  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'non_monotonic')
-      33.8±0.9ms      2.32±0.04ms     0.07  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'nonunique_monotonic_inc')
-      62.7±0.5ms          532±3μs     0.01  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')

       before           after         ratio
     [2811464a]       [a68caa2a]
     <import_bench~1>       <period_fix>
+      40.7±0.7μs       46.8±0.8μs     1.15  indexing.NonNumericSeriesIndexing.time_get_value('period', 'non_monotonic')
+        27.0±2μs       30.4±0.6μs     1.12  indexing.NonNumericSeriesIndexing.time_getitem_scalar('period', 'non_monotonic')
  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@pep8speaks

This comment has been minimized.

Copy link

commented Jun 30, 2019

Hello @qwhelan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-06 21:25:48 UTC

@qwhelan qwhelan force-pushed the qwhelan:period_fix branch 2 times, most recently from a857b09 to 1881cb5 Jun 30, 2019

@@ -906,6 +923,10 @@ def base(self):
FutureWarning, stacklevel=2)
return np.asarray(self._data)

def memory_usage(self, deep=False):
result = super(PeriodIndex, self).memory_usage(deep=deep)
result += self._int64index.memory_usage(deep=deep)

This comment has been minimized.

Copy link
@jreback

jreback Jun 30, 2019

Contributor

this only should add the int64index if it’s actually created

This comment has been minimized.

Copy link
@qwhelan

qwhelan Jun 30, 2019

Author Contributor

Checking for '_int64index' in self._cache now, which avoids triggering creation.

@qwhelan qwhelan force-pushed the qwhelan:period_fix branch 5 times, most recently from ea5d08d to e03dc41 Jun 30, 2019

@jreback
Copy link
Contributor

left a comment

can you also rebase on master; note that you will need to update with black

black . will pretty much fix things
(pip install black)

@@ -576,6 +576,30 @@ def test_indexing_over_size_cutoff():
_index._SIZE_CUTOFF = old_cutoff


def test_indexing_over_size_cutoff_period_index():

This comment has been minimized.

Copy link
@jreback

jreback Jul 4, 2019

Contributor

can you add a comment here (the title is pretty good though) with the issue number as a comment as well

@qwhelan qwhelan force-pushed the qwhelan:period_fix branch 3 times, most recently from 26b2c1e to 3809224 Jul 5, 2019

qwhelan added some commits Jun 29, 2019

PERF, BENCH: Fix performance issue when indexing into non-unique Date…
…timeIndex/PeriodIndex.

Additionally, fix asv benchmark bugs that hid the issue and correct PeriodIndex.memory_usage().

@qwhelan qwhelan force-pushed the qwhelan:period_fix branch from 3809224 to d2de56c Jul 6, 2019

@jreback jreback added this to the 0.25.0 milestone Jul 6, 2019

@jreback

jreback approved these changes Jul 6, 2019

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2019

lgtm. ping on green.

@jreback jreback merged commit af5b2a2 into pandas-dev:master Jul 6, 2019

14 checks passed

codecov/patch 86.66% of diff hit (target 50%)
Details
codecov/project 92.78% (target 82%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190706.28 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.