Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: add method Categorical.__contains__ #21508

Merged
merged 6 commits into from
Jun 20, 2018

Conversation

topper-123
Copy link
Contributor

Currently, membership checks in Categorical is very slow as explained by @fjetter in #21022. This PR fixes the issue. See also #21369 which fixed the similar issue for CategoricalIndex.

Tests didn't exist beforehand and have been added.

ASV:

      before           after         ratio
     [9e982e18]       [28461f0c]
-      4.26±0.7ms         134±20μs     0.03  categoricals.Contains.time_categorical_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@pep8speaks
Copy link

pep8speaks commented Jun 16, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 20, 2018 at 10:29 Hours UTC

@@ -1847,6 +1847,31 @@ def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values().tolist())

def __contains__(self, key):
"""Returns True if `key` is in this Categorical."""
hash(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn’t this very similar to what u did in CI ?
can we unify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a unified version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it looks very similiar. why cannot you just use key in self._data for the CI

@topper-123 topper-123 force-pushed the Categorical.__contains__ branch 3 times, most recently from b7f49c2 to 3eb49bb Compare June 16, 2018 17:10
@codecov
Copy link

codecov bot commented Jun 16, 2018

Codecov Report

Merging #21508 into master will increase coverage by <.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21508      +/-   ##
==========================================
+ Coverage   91.91%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49572      +26     
==========================================
+ Hits        45542    45568      +26     
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <93.75%> (ø) ⬆️
#single 41.8% <18.75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.28% <100%> (+0.18%) ⬆️
pandas/core/arrays/categorical.py 95.63% <92.3%> (-0.06%) ⬇️
pandas/core/frame.py 97.23% <0%> (ø) ⬆️
pandas/core/generic.py 96.12% <0%> (ø) ⬆️
pandas/core/indexing.py 93.55% <0%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2625759...fd89a8a. Read the comment docs.

@gfyoung gfyoung added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jun 17, 2018
@@ -71,6 +71,23 @@ def test_isin_empty(empty):
tm.assert_numpy_array_equal(expected, result)


def test_contains():

c = pd.Categorical(list('aabbca'), categories=list('cab'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not a bug, but can we reference the original issue anyway?

assert 1 not in c

c = pd.Categorical(list('aabbca') + [np.nan], categories=list('cab'))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary newline IMO


This is a helper method used in :method:`Categorical.__contains__`
and in :class:`CategoricalIndex.__contains__`.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is not a one-liner doc-string:

  • The docstring should start underneath the triple parentheses.
  • The first line is a summary should only be one line
  • List of parameters and their dtypes
  • Return description / type

@gfyoung
Copy link
Member

gfyoung commented Jun 17, 2018

@topper-123 : A bunch of nit-picking things, but overall, this looks pretty good!

@topper-123
Copy link
Contributor Author

Comments addressed.

Notes
-----
This method does not check for Nan values. Do that separately
before calling this method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Couple of things:

  • Notes section goes at the bottom
  • The first line ("Helper for membership check...") should be underneath the triple quotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, both done.

But I've never noticed "the first line should be underneath the triple quotes"-rule and it's not widely applied in pandas. Are you sure that's a style rule?

Copy link
Member

@gfyoung gfyoung Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to follow the numpy style-docstrings, which have a newline as I have requested. You can have a look at their docstrings to a get a sense.

gfyoung
gfyoung previously approved these changes Jun 17, 2018
Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

cc @jreback

@@ -27,6 +27,9 @@ Performance Improvements
- Improved performance of membership checks in :class:`CategoricalIndex`
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
is likewise much faster (:issue:`21369`)
- Improved performance of membership checks in :class:`Categorical`
(i.e. ``x in categorical``-style checks are much faster) (:issue:`21369`)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come there is a newline here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I LGTM'ed and then realized I wanted to make one more comment 😂

@gfyoung gfyoung dismissed their stale review June 17, 2018 09:26

Premature approval

@topper-123
Copy link
Contributor Author

Ok, won't hold you up to that approval, fixed :-)

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate it 😄 . LGTM!

cc @jreback

@@ -27,6 +27,8 @@ Performance Improvements
- Improved performance of membership checks in :class:`CategoricalIndex`
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
is likewise much faster (:issue:`21369`)
- Improved performance of membership checks in :class:`Categorical`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this refernce this PR number (or just coming the what's new into 1 line either way)

@@ -328,23 +327,8 @@ def __contains__(self, key):
if isna(key): # if key is a NaN, check if any NaN is in self.
return self.hasnans

# is key in self.categories? Then get its location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you just do
```return key in self._data``?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would give the same result, but _engine gives cached results, so that makes it much faster. Indexes should return cached results when possible, while categorical is mutable, so can't return cached values, requiring a different implementation.

@@ -1847,6 +1847,67 @@ def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values().tolist())

@staticmethod
def _contains(key, categories, container):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty convoluted, see my comment below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I find it difficult to maintain commonality between the two __contains__ methods without something similar to this... Maybe cut down on comments, now that there is a doc string? Do you have a suggestion?

Copy link
Contributor Author

@topper-123 topper-123 Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a new proposal. This makes contains a common function. Perhaps clearer as a function rather than a method that this is shared functionality?

It's not convoluted in the sense, that the index version can look in _engine for greater speed, while the Categorical can't do that and so must look in _codes, otherwise it's the same.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jun 18, 2018
@topper-123 topper-123 force-pushed the Categorical.__contains__ branch 3 times, most recently from c3ff8d7 to 77386f1 Compare June 18, 2018 14:25
@@ -1847,6 +1847,31 @@ def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values().tolist())

def __contains__(self, key):
"""Returns True if `key` is in this Categorical."""
hash(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it looks very similiar. why cannot you just use key in self._data for the CI

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 19, 2018

@jreback , using ._data would be slower, as it doesn't use the hashed value in ._engine:

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a'*n +'a' + 'b'*n + 'c'*n))
>>> %timeit ci.categories.get_loc('b') in ci._engine
1.6 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit 'b' in ci._data
183 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So there needs to be different implementations.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ok. some comments. ping when green.


Notes
-----
This method does not check for Nan values. Do that separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nan -> NaN.

@@ -1847,6 +1896,15 @@ def __iter__(self):
"""Returns an Iterator over the values of this Categorical."""
return iter(self.get_values().tolist())

def __contains__(self, key):
"""Returns True if `key` is in this Categorical."""
hash(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your hash check needs to go into contains itself (null check is ok where it is).

can you add a test on something that isn't hashable as well

"""Returns True if `key` is in this Categorical."""
hash(key)

if isna(key): # if key is a NaN, check if any NaN is in self.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put comments on the line above

@@ -71,6 +71,22 @@ def test_isin_empty(empty):
tm.assert_numpy_array_equal(expected, result)


def test_contains():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to test_indexing, should be other checks already there, de-duplicate with those.

Copy link
Contributor Author

@topper-123 topper-123 Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at test_indexing, didn't find anything very similar. I moved it to test_operators, which is more similar, IMO. Is that ok?

@jreback jreback modified the milestones: 0.24.0, 0.23.2 Jun 19, 2018
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you changed the private ._contains to public contains? (it's not needed for the implementation?)
It's just that I am not sure it is worth adding a new public method.

@topper-123
Copy link
Contributor Author

@jorisvandenbossche, to my understanding these modules are themselves private and it’s not necessary for functions in private modules to be marked as private - that would be superfluous.

When it was a method on Categorical, it was needed to mark it private, as Categorical is a public class.

@topper-123
Copy link
Contributor Author

The Travis error is an HTTP error:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/qt-5.9.5-h7e424d6_0.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
real	3m14.046s
user	1m7.620s
sys	0m3.084s
The command "ci/install_travis.sh" failed and exited with 1 during .
Your build has been stopped.

so the Travis failure has nothing to do with this PR.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 19, 2018

@jorisvandenbossche, to my understanding these modules are themselves private and it’s not necessary for functions in private modules to be marked as private - that would be superfluous.

@topper-123 whoops, sorry, only looked very briefly at the incremental diff and missed that it was a function and not a method.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.2, 0.24.0 Jun 19, 2018
@jreback jreback modified the milestones: 0.24.0, 0.23.2 Jun 20, 2018
@jreback jreback merged commit 89874d3 into pandas-dev:master Jun 20, 2018
@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants