New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improved performance of CategoricalIndex.is_monotonic* #21025

Merged
merged 1 commit into from May 17, 2018

Conversation

Projects
None yet
4 participants
@topper-123
Contributor

topper-123 commented May 14, 2018

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
>>> n = 1000000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %t ci.is_monotonic_increasing
22 ms # v0.22 and master
227 ns  # this commit

There seem to be a few more like this, where CategoricalIndex should use self._engine but doesn't.

@TomAugspurger?

@jreback

This comment has been minimized.

Contributor

jreback commented May 14, 2018

this hit the same code path; so check this

@topper-123

This comment has been minimized.

Contributor

topper-123 commented May 14, 2018

Not sure I follow, but these two versions do not follow the same code path, as the old version required creating a new Int64Index which is expensive.

CategoricalIndex.is_monotonic is already tested in indexes/test_category.py::TestCategoricalIndex::test_is_monotonic.

@codecov

This comment has been minimized.

codecov bot commented May 14, 2018

Codecov Report

Merging #21025 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21025   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files         153      153           
  Lines       49495    49495           
=======================================
  Hits        45454    45454           
  Misses       4041     4041
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 501f041...c815d62. Read the comment docs.

@jreback

This comment has been minimized.

Contributor

jreback commented May 14, 2018

can you add additional tests using strings (and not just integers) in that same test. otherwise lgtm.

@jreback

This comment has been minimized.

Contributor

jreback commented May 14, 2018

do we have sufficient asv's for this?

@topper-123

This comment has been minimized.

Contributor

topper-123 commented May 14, 2018

There were no asv's for this. However, if you run my code snippet above, there is a huge spike in RAM usage, when run in the old version. I've even gotten a few MemoryErrors.

So my ASV is done using only N = 1000 to limit memory usage. The result is here 60 microseconds (old version) vs 260 ns (new version).

Also, Series.is_monotonic* wasn't added untill 0.19. should that be put in a try/except clause, to avoid failing on older versions of pandas?

@jreback

minor comment on the asv. its ok if it fails under 0.19, that's pretty far back now

self.c = pd.CategoricalIndex(list('a'*N + 'b'*N + 'c'*N))
self.s = pd.Series(self.c)
def time_categorical_index_is_monotonic(self):

This comment has been minimized.

@jreback

jreback May 14, 2018

Contributor

these shouldn't be in the same asv, you can do this with params I think

@pep8speaks

This comment has been minimized.

pep8speaks commented May 15, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 16, 2018 at 19:04 Hours UTC
@@ -1079,6 +1079,7 @@ Performance Improvements
- Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`)
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)
- Improved performance of :meth:`CategoricalIndex.is_monotonic_increasing`, :meth:`CategoricalIndex.is_monotonic_decreasing` and :meth:`CategoricalIndex.is_monotonic` (:issue:`21025`)

This comment has been minimized.

@jreback

jreback May 15, 2018

Contributor

will need to be in 0.23.1 (not yet in repo, soon)

This comment has been minimized.

@topper-123

topper-123 May 16, 2018

Contributor

Moved to 0.23.1.

@jreback jreback added this to the 0.23.1 milestone May 15, 2018

@jreback jreback merged commit 1ee5ecf into pandas-dev:master May 17, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Contributor

jreback commented May 17, 2018

thanks @topper-123

@topper-123 topper-123 deleted the topper-123:is_monotonic_perf branch May 21, 2018

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018

@fjetter fjetter referenced this pull request Jun 9, 2018

Closed

PERF: __contains__ method for Categorical #21022

4 of 4 tasks complete

jorisvandenbossche added a commit that referenced this pull request Jun 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment