Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fixed .str.contains(..., na=False) for categorical series #22170

Merged
merged 36 commits into from Nov 20, 2018

Conversation

pulkitmaloo
Copy link
Contributor

@pulkitmaloo pulkitmaloo commented Aug 2, 2018

@datapythonista datapythonista added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Categorical Categorical Data Type labels Aug 2, 2018
@TomAugspurger
Copy link
Contributor

Failure is at https://travis-ci.org/pandas-dev/pandas/jobs/411345500#L2287-L2289

Basically, you're fix only works if there are missing values.

This may fix it, but may break other things

diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 8097ed196..06fd1391e 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -1855,7 +1855,12 @@ class StringMethods(NoNewAttributesMixin):
         # before the transformation...
         if use_codes and self._is_categorical:
             result = take_1d(result, self._orig.cat.codes)
-            result[isna(result)] = na
+            missing = isna(result)
+
+            if missing.any():
+                result_type = np.result_type(result, na)
+                result = result.astype(result_type, copy=False)
+                result[isna(result)] = na
 
         if not hasattr(result, 'ndim') or not hasattr(result, 'dtype'):
             return result

@pep8speaks
Copy link

pep8speaks commented Aug 17, 2018

Hello @pulkitmaloo! Thanks for updating the PR.

Comment last updated on September 24, 2018 at 01:15 Hours UTC

@TomAugspurger
Copy link
Contributor

Looks like a maybe unrelated CI failure. Can you merge master and push again?

@TomAugspurger
Copy link
Contributor

And can you ensure (and test) that na is handled correctly for all the string functions? It looks like at least match isn't handled correctly.

@codecov
Copy link

codecov bot commented Aug 24, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@960a73f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22170   +/-   ##
=========================================
  Coverage          ?   92.19%           
=========================================
  Files             ?      169           
  Lines             ?    50950           
  Branches          ?        0           
=========================================
  Hits              ?    46975           
  Misses            ?     3975           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.62% <100%> (?)
#single 42.27% <0%> (?)
Impacted Files Coverage Δ
pandas/core/strings.py 98.58% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...44b36a4. Read the comment docs.

@@ -7,7 +7,7 @@ dependencies:
- cython=0.28.2
- jinja2=2.8
- numexpr=2.4.4 # we test that we correctly don't use an unsupported numexpr
- numpy=1.9.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this (and the rest of the .yaml)

pandas/core/strings.py Outdated Show resolved Hide resolved
pandas/tests/test_strings.py Outdated Show resolved Hide resolved
@pulkitmaloo
Copy link
Contributor Author

@jreback Thanks for the feedback, I have updated the PR as you requested. However, the changes in yaml files leads to circleci failure.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

@pulkitmaloo you need to rebase on master

@jreback jreback added this to the 0.24.0 milestone Sep 25, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, can you add a whatsnew note in bug fixes

@@ -704,7 +704,7 @@ Numeric
Strings
^^^^^^^

-
- Bug in :class:`StringMethods` where `str_contains()` was not filling missing values with given argument for na for a categorical Series (:issue:`22158`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use double backticks on Series, can you use the actual reference here, e.g. :func:`Series.str.contains`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Thanks, I've updated it. Please let me know if there's anything else you'd like to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Can you please review and merge this PR.

Co-Authored-By: pulkitmaloo <maloo.pulkit@gmail.com>
values = Series(["a", "b", "c", "a", np.nan], dtype="category")
result = values.str.contains('a', na=True).astype(object)
expected = Series([True, False, False, True, True], dtype=np.object_)
assert isinstance(result, Series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed. The assert_series_equal checks that.

assert res.loc[2] == "foo"
# na for category
values = Series(["a", "b", "c", "a", np.nan], dtype="category")
result = values.str.contains('a', na=True).astype(object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the .astype(object). What's the dtype when you don't do that? Categorical? If so, I'd rather we document that as part of the API and test it.


# na for objects
values = Series(["a", "b", "c", "a", np.nan])
result = values.str.contains('a', na=True).astype(object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the astype object.

@TomAugspurger
Copy link
Contributor

@pulkitmaloo are you able to update?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

@TomAugspurger should be fixed up.

@jreback jreback merged commit 2af56d4 into pandas-dev:master Nov 20, 2018
@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

thanks @pulkitmaloo

thoo added a commit to thoo/pandas that referenced this pull request Nov 20, 2018
…fixed

* upstream/master:
  DOC: more consistent flake8-commands in contributing.rst (pandas-dev#23724)
  DOC: Fixed the doctsring for _set_axis_name (GH 22895) (pandas-dev#22969)
  DOC: Improve GL03 message re: blank lines at end of docstrings. (pandas-dev#23649)
  TST: add tests for keeping dtype in Series.update (pandas-dev#23604)
  TST: For GH4861, Period and datetime in multiindex (pandas-dev#23776)
  TST: move .str-test to strings.py & parametrize it; precursor to pandas-dev#23582 (pandas-dev#23777)
  STY: isort tests/scalar, tests/tslibs, import libwindow instead of _window (pandas-dev#23787)
  BUG: fixed .str.contains(..., na=False) for categorical series (pandas-dev#22170)
  BUG: Maintain column order with groupby.nth (pandas-dev#22811)
  API/DEPR: replace kwarg "pat" with "sep" in str.[r]partition (pandas-dev#23767)
  CLN: Finish isort core (pandas-dev#23765)
  TST: Mark test_pct_max_many_rows with pytest.mark.single (pandas-dev#23799)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.str.contains(..., na=False) consistency between categorical and object series
5 participants