PERF: add method Categorical.contains #21508

topper-123 · 2018-06-16T14:48:04Z

closes PERF: __contains__ method for Categorical #21022
xref PERF: Add __contains__ to CategoricalIndex #21369
tests added / passed
benchmark added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Currently, membership checks in Categorical is very slow as explained by @fjetter in #21022. This PR fixes the issue. See also #21369 which fixed the similar issue for CategoricalIndex.

Tests didn't exist beforehand and have been added.

ASV:

      before           after         ratio
     [9e982e18]       [28461f0c]
-      4.26±0.7ms         134±20μs     0.03  categoricals.Contains.time_categorical_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

pep8speaks · 2018-06-16T14:48:07Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 20, 2018 at 10:29 Hours UTC

jreback · 2018-06-16T15:10:31Z

pandas/core/arrays/categorical.py

@@ -1847,6 +1847,31 @@ def __iter__(self):
        """Returns an Iterator over the values of this Categorical."""
        return iter(self.get_values().tolist())

+    def __contains__(self, key):
+        """Returns True if `key` is in this Categorical."""
+        hash(key)


isn’t this very similar to what u did in CI ?
can we unify

I've made a unified version.

so it looks very similiar. why cannot you just use key in self._data for the CI

codecov · 2018-06-16T18:04:48Z

Codecov Report

Merging #21508 into master will increase coverage by <.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master   #21508      +/-   ##
==========================================
+ Coverage   91.91%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49572      +26     
==========================================
+ Hits        45542    45568      +26     
  Misses       4004     4004

Flag	Coverage Δ
#multiple	`90.32% <93.75%> (ø)`	⬆️
#single	`41.8% <18.75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.28% <100%> (+0.18%)`	⬆️
pandas/core/arrays/categorical.py	`95.63% <92.3%> (-0.06%)`	⬇️
pandas/core/frame.py	`97.23% <0%> (ø)`	⬆️
pandas/core/generic.py	`96.12% <0%> (ø)`	⬆️
pandas/core/indexing.py	`93.55% <0%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2625759...fd89a8a. Read the comment docs.

gfyoung · 2018-06-17T06:01:36Z

pandas/tests/categorical/test_algos.py

@@ -71,6 +71,23 @@ def test_isin_empty(empty):
    tm.assert_numpy_array_equal(expected, result)


+def test_contains():
+
+    c = pd.Categorical(list('aabbca'), categories=list('cab'))


I know this is not a bug, but can we reference the original issue anyway?

gfyoung · 2018-06-17T06:01:48Z

pandas/tests/categorical/test_algos.py

+    assert 1 not in c
+
+    c = pd.Categorical(list('aabbca') + [np.nan], categories=list('cab'))
+


Nit: unnecessary newline IMO

gfyoung · 2018-06-17T06:03:27Z

pandas/core/arrays/categorical.py

+
+        This is a helper method used in :method:`Categorical.__contains__`
+        and in :class:`CategoricalIndex.__contains__`.
+        """


As this is not a one-liner doc-string:

The docstring should start underneath the triple parentheses.

The first line is a summary should only be one line

List of parameters and their dtypes

Return description / type

gfyoung · 2018-06-17T06:03:52Z

@topper-123 : A bunch of nit-picking things, but overall, this looks pretty good!

topper-123 · 2018-06-17T09:05:57Z

Comments addressed.

gfyoung · 2018-06-17T09:10:12Z

pandas/core/arrays/categorical.py

+        Notes
+        -----
+        This method does not check for Nan values. Do that separately
+        before calling this method.


Nice! Couple of things:

Notes section goes at the bottom

The first line ("Helper for membership check...") should be underneath the triple quotes.

Ok, both done.

But I've never noticed "the first line should be underneath the triple quotes"-rule and it's not widely applied in pandas. Are you sure that's a style rule?

We generally try to follow the numpy style-docstrings, which have a newline as I have requested. You can have a look at their docstrings to a get a sense.

gfyoung

LGTM!

cc @jreback

gfyoung · 2018-06-17T09:25:43Z

doc/source/whatsnew/v0.23.2.txt

@@ -27,6 +27,9 @@ Performance Improvements
 - Improved performance of membership checks in :class:`CategoricalIndex`
  (i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
  is likewise much faster (:issue:`21369`)
+- Improved performance of membership checks in :class:`Categorical`
+  (i.e. ``x in categorical``-style checks are much faster) (:issue:`21369`)
+


How come there is a newline here?

Sorry, I LGTM'ed and then realized I wanted to make one more comment 😂

Premature approval

topper-123 · 2018-06-17T09:36:41Z

Ok, won't hold you up to that approval, fixed :-)

gfyoung

Appreciate it 😄 . LGTM!

cc @jreback

jreback · 2018-06-18T10:30:47Z

doc/source/whatsnew/v0.23.2.txt

@@ -27,6 +27,8 @@ Performance Improvements
 - Improved performance of membership checks in :class:`CategoricalIndex`
  (i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
  is likewise much faster (:issue:`21369`)
+- Improved performance of membership checks in :class:`Categorical`


can you make this refernce this PR number (or just coming the what's new into 1 line either way)

jreback · 2018-06-18T10:33:33Z

pandas/core/indexes/category.py

@@ -328,23 +327,8 @@ def __contains__(self, key):
        if isna(key):  # if key is a NaN, check if any NaN is in self.
            return self.hasnans

-        # is key in self.categories? Then get its location.


why can't you just do
```return key in self._data``?

That would give the same result, but _engine gives cached results, so that makes it much faster. Indexes should return cached results when possible, while categorical is mutable, so can't return cached values, requiring a different implementation.

jreback · 2018-06-18T10:34:11Z

pandas/core/arrays/categorical.py

@@ -1847,6 +1847,67 @@ def __iter__(self):
        """Returns an Iterator over the values of this Categorical."""
        return iter(self.get_values().tolist())

+    @staticmethod
+    def _contains(key, categories, container):


this seems pretty convoluted, see my comment below

hmm, I find it difficult to maintain commonality between the two __contains__ methods without something similar to this... Maybe cut down on comments, now that there is a doc string? Do you have a suggestion?

I've made a new proposal. This makes contains a common function. Perhaps clearer as a function rather than a method that this is shared functionality?

It's not convoluted in the sense, that the index version can look in _engine for greater speed, while the Categorical can't do that and so must look in _codes, otherwise it's the same.

jreback · 2018-06-18T21:46:56Z

pandas/core/arrays/categorical.py

@@ -1847,6 +1847,31 @@ def __iter__(self):
        """Returns an Iterator over the values of this Categorical."""
        return iter(self.get_values().tolist())

+    def __contains__(self, key):
+        """Returns True if `key` is in this Categorical."""
+        hash(key)


so it looks very similiar. why cannot you just use key in self._data for the CI

topper-123 · 2018-06-19T08:38:44Z

@jreback , using ._data would be slower, as it doesn't use the hashed value in ._engine:

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a'*n +'a' + 'b'*n + 'c'*n))
>>> %timeit ci.categories.get_loc('b') in ci._engine
1.6 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit 'b' in ci._data
183 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So there needs to be different implementations.

jreback

this ok. some comments. ping when green.

jreback · 2018-06-19T10:56:24Z

pandas/core/arrays/categorical.py

+
+    Notes
+    -----
+    This method does not check for Nan values. Do that separately


Nan -> NaN.

jreback · 2018-06-19T10:57:49Z

pandas/core/arrays/categorical.py

@@ -1847,6 +1896,15 @@ def __iter__(self):
        """Returns an Iterator over the values of this Categorical."""
        return iter(self.get_values().tolist())

+    def __contains__(self, key):
+        """Returns True if `key` is in this Categorical."""
+        hash(key)


I think your hash check needs to go into contains itself (null check is ok where it is).

can you add a test on something that isn't hashable as well

jreback · 2018-06-19T10:58:27Z

pandas/core/arrays/categorical.py

+        """Returns True if `key` is in this Categorical."""
+        hash(key)
+
+        if isna(key):  # if key is a NaN, check if any NaN is in self.


can you put comments on the line above

jreback · 2018-06-19T11:00:00Z

pandas/tests/categorical/test_algos.py

@@ -71,6 +71,22 @@ def test_isin_empty(empty):
    tm.assert_numpy_array_equal(expected, result)


+def test_contains():


move to test_indexing, should be other checks already there, de-duplicate with those.

I looked at test_indexing, didn't find anything very similar. I moved it to test_operators, which is more similar, IMO. Is that ok?

jorisvandenbossche

What is the reason you changed the private ._contains to public contains? (it's not needed for the implementation?)
It's just that I am not sure it is worth adding a new public method.

topper-123 · 2018-06-19T16:42:29Z

@jorisvandenbossche, to my understanding these modules are themselves private and it’s not necessary for functions in private modules to be marked as private - that would be superfluous.

When it was a method on Categorical, it was needed to mark it private, as Categorical is a public class.

topper-123 · 2018-06-19T18:55:39Z

The Travis error is an HTTP error:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/qt-5.9.5-h7e424d6_0.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
real	3m14.046s
user	1m7.620s
sys	0m3.084s
The command "ci/install_travis.sh" failed and exited with 1 during .
Your build has been stopped.

so the Travis failure has nothing to do with this PR.

jorisvandenbossche · 2018-06-19T21:58:00Z

@jorisvandenbossche, to my understanding these modules are themselves private and it’s not necessary for functions in private modules to be marked as private - that would be superfluous.

@topper-123 whoops, sorry, only looked very briefly at the incremental diff and missed that it was a function and not a method.

jreback · 2018-06-20T10:29:52Z

thanks @topper-123

topper-123 force-pushed the Categorical.__contains__ branch from 1ecce70 to 05e3680 Compare June 16, 2018 14:54

jreback requested changes Jun 16, 2018

View reviewed changes

topper-123 force-pushed the Categorical.__contains__ branch 3 times, most recently from b7f49c2 to 3eb49bb Compare June 16, 2018 17:10

gfyoung added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jun 17, 2018

gfyoung reviewed Jun 17, 2018

View reviewed changes

topper-123 force-pushed the Categorical.__contains__ branch from 7533c6f to a03c92f Compare June 17, 2018 09:15

gfyoung previously approved these changes Jun 17, 2018

View reviewed changes

gfyoung reviewed Jun 17, 2018

View reviewed changes

topper-123 force-pushed the Categorical.__contains__ branch from a03c92f to b685279 Compare June 17, 2018 09:34

gfyoung approved these changes Jun 17, 2018

View reviewed changes

topper-123 force-pushed the Categorical.__contains__ branch from b685279 to 07dd41c Compare June 17, 2018 10:46

jreback requested changes Jun 18, 2018

View reviewed changes

jorisvandenbossche added this to the 0.24.0 milestone Jun 18, 2018

topper-123 force-pushed the Categorical.__contains__ branch 3 times, most recently from c3ff8d7 to 77386f1 Compare June 18, 2018 14:25

jreback requested changes Jun 18, 2018

View reviewed changes

jreback mentioned this pull request Jun 19, 2018

PERF: __contains__ method for Categorical #21022

Closed

4 tasks

jreback requested changes Jun 19, 2018

View reviewed changes

jreback modified the milestones: 0.24.0, 0.23.2 Jun 19, 2018

jreback added the Needs Backport label Jun 19, 2018

tp added 4 commits June 19, 2018 16:56

add Categorical.__contains__

b44614b

Improve doc string of Categorical._contains

f5fd77c

Reimplenent Categorical._contains

913790d

changed according to comments

b0f12ec

topper-123 force-pushed the Categorical.__contains__ branch from 89e92da to b0f12ec Compare June 19, 2018 16:00

jorisvandenbossche reviewed Jun 19, 2018

View reviewed changes

jorisvandenbossche modified the milestones: 0.23.2, 0.24.0 Jun 19, 2018

jreback modified the milestones: 0.24.0, 0.23.2 Jun 20, 2018

jreback added 2 commits June 20, 2018 06:28

Merge branch 'master' into PR_TOOL_MERGE_PR_21508

6d3de32

doc

fd89a8a

jreback approved these changes Jun 20, 2018

View reviewed changes

jreback merged commit 89874d3 into pandas-dev:master Jun 20, 2018

topper-123 deleted the Categorical.__contains__ branch June 20, 2018 11:31

This was referenced Jun 24, 2018

CLN: make CategoricalIndex._create_categorical a classmethod #21618

Merged

DOC: minor correction to v0.23.2.txt #21644

Merged

jorisvandenbossche removed the Needs Backport label Jul 2, 2018

jorisvandenbossche modified the milestones: 0.23.2, 0.24.0 Jul 2, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

PERF: add method Categorical.__contains__ (pandas-dev#21508)

cc51575

		assert 1 not in c

		c = pd.Categorical(list('aabbca') + [np.nan], categories=list('cab'))

		@@ -71,6 +71,22 @@ def test_isin_empty(empty):
		tm.assert_numpy_array_equal(expected, result)


		def test_contains():

PERF: add method Categorical.__contains__ #21508

PERF: add method Categorical.__contains__ #21508

Conversation

topper-123 commented Jun 16, 2018

pep8speaks commented Jun 16, 2018 • edited Loading

Comment last updated on June 20, 2018 at 10:29 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Jun 17, 2018 • edited Loading

topper-123 commented Jun 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 17, 2018 • edited Loading

Choose a reason for hiding this comment

gfyoung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jun 17, 2018

gfyoung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jun 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jun 19, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jun 19, 2018 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

topper-123 commented Jun 19, 2018

topper-123 commented Jun 19, 2018

jorisvandenbossche commented Jun 19, 2018 • edited Loading

jreback commented Jun 20, 2018

PERF: add method Categorical.contains #21508

PERF: add method Categorical.contains #21508

pep8speaks commented Jun 16, 2018 •

edited

Loading

codecov bot commented Jun 16, 2018 •

edited

Loading

gfyoung commented Jun 17, 2018 •

edited

Loading

gfyoung Jun 17, 2018 •

edited

Loading

topper-123 Jun 18, 2018 •

edited

Loading

topper-123 commented Jun 19, 2018 •

edited

Loading

topper-123 Jun 19, 2018 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche commented Jun 19, 2018 •

edited

Loading