Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improved performance of CategoricalIndex.is_monotonic* #21025

Merged
merged 1 commit into from May 17, 2018

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 14, 2018

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
>>> n = 1000000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %t ci.is_monotonic_increasing
22 ms # v0.22 and master
227 ns  # this commit

There seem to be a few more like this, where CategoricalIndex should use self._engine but doesn't.

@TomAugspurger?

@jreback
Copy link
Contributor

jreback commented May 14, 2018

this hit the same code path; so check this

@topper-123
Copy link
Contributor Author

Not sure I follow, but these two versions do not follow the same code path, as the old version required creating a new Int64Index which is expensive.

CategoricalIndex.is_monotonic is already tested in indexes/test_category.py::TestCategoricalIndex::test_is_monotonic.

@codecov
Copy link

codecov bot commented May 14, 2018

Codecov Report

Merging #21025 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21025   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files         153      153           
  Lines       49495    49495           
=======================================
  Hits        45454    45454           
  Misses       4041     4041
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 501f041...c815d62. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented May 14, 2018

can you add additional tests using strings (and not just integers) in that same test. otherwise lgtm.

@jreback
Copy link
Contributor

jreback commented May 14, 2018

do we have sufficient asv's for this?

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 14, 2018
@topper-123 topper-123 force-pushed the is_monotonic_perf branch 2 times, most recently from 1ee1d93 to 6bdbb5d Compare May 14, 2018 17:06
@topper-123
Copy link
Contributor Author

There were no asv's for this. However, if you run my code snippet above, there is a huge spike in RAM usage, when run in the old version. I've even gotten a few MemoryErrors.

So my ASV is done using only N = 1000 to limit memory usage. The result is here 60 microseconds (old version) vs 260 ns (new version).

Also, Series.is_monotonic* wasn't added untill 0.19. should that be put in a try/except clause, to avoid failing on older versions of pandas?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment on the asv. its ok if it fails under 0.19, that's pretty far back now

self.c = pd.CategoricalIndex(list('a'*N + 'b'*N + 'c'*N))
self.s = pd.Series(self.c)

def time_categorical_index_is_monotonic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these shouldn't be in the same asv, you can do this with params I think

@pep8speaks
Copy link

pep8speaks commented May 15, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 16, 2018 at 19:04 Hours UTC

@@ -1079,6 +1079,7 @@ Performance Improvements
- Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`)
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
- Improved performance of ``getattr(Series, attr)`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)
- Improved performance of :meth:`CategoricalIndex.is_monotonic_increasing`, :meth:`CategoricalIndex.is_monotonic_decreasing` and :meth:`CategoricalIndex.is_monotonic` (:issue:`21025`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to be in 0.23.1 (not yet in repo, soon)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to 0.23.1.

@jreback jreback added this to the 0.23.1 milestone May 15, 2018
@jreback jreback merged commit 1ee5ecf into pandas-dev:master May 17, 2018
@jreback
Copy link
Contributor

jreback commented May 17, 2018

thanks @topper-123

@topper-123 topper-123 deleted the is_monotonic_perf branch May 21, 2018 21:00
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants