New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Accept dict or Series in fillna for categorical Series #18293

Merged
merged 11 commits into from Nov 22, 2017

Conversation

Projects
None yet
4 participants
@reidy-p
Contributor

reidy-p commented Nov 14, 2017

  • closes #17033
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@pep8speaks

This comment has been minimized.

pep8speaks commented Nov 14, 2017

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 20, 2017 at 20:06 Hours UTC
# GH 17033
# Test fillna for a Categorical series
data = ['a', np.nan, 'b', np.nan, np.nan]
s = pd.Series(pd.Categorical(data, categories=['a', 'b']))

This comment has been minimized.

@jschendel

jschendel Nov 14, 2017

Member

Series and Categorical have already been imported so you can remove the pd. throughout the tests you added. Instances of this in existing tests should already be covered by #18277.

This comment has been minimized.

@reidy-p

reidy-p Nov 15, 2017

Contributor

Thanks, I'll change that.

@codecov

This comment has been minimized.

codecov bot commented Nov 15, 2017

Codecov Report

Merging #18293 into master will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18293      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         164      164              
  Lines       49721    49729       +8     
==========================================
- Hits        45429    45427       -2     
- Misses       4292     4302      +10
Flag Coverage Δ
#multiple 89.14% <94.11%> (-0.01%) ⬇️
#single 39.6% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.73% <100%> (ø) ⬆️
pandas/core/categorical.py 95.66% <93.75%> (-0.09%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d04daf...c484f49. Read the comment docs.

@jreback

some comments. whatsnew for 0.22 (you can put in other enhancements)

@@ -1660,16 +1660,26 @@ def fillna(self, value=None, method=None, limit=None):
else:
if not isna(value) and value not in self.categories:
raise ValueError("fill value must be in categories")
if isinstance(value, ABCSeries):

This comment has been minimized.

@jreback

jreback Nov 15, 2017

Contributor

this could be a dict as well (which you can simply convert to a series)

This comment has been minimized.

@reidy-p

reidy-p Nov 15, 2017

Contributor

A dict has already been converted to a Series in the fillna function in pandas/core/generic.py.

So, as far as I can tell, by the time we reach the fillna function in pandas/core/categorical.py the value is either:

  1. A Series (if the user passed a Series or dict)
  2. A scalar (if the user passed a scalar)

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

ok, can you indicate that in a comment

This comment has been minimized.

@jreback

jreback Nov 17, 2017

Contributor

can you add a comment here? otherwise lgtm.

if not isna(value) and value not in self.categories:
raise ValueError("fill value must be in categories")
mask = values == -1

This comment has been minimized.

@jreback

jreback Nov 15, 2017

Contributor

you might be able to share code with the array case (basically convert the scalar to an array)

This comment has been minimized.

@reidy-p

reidy-p Nov 15, 2017

Contributor

Yeah, I was thinking about that but couldn't get it to work properly. The problem I was having was that if the user passes a scalar all the NaNs will be filled with that single scalar. But if the user passes a Series (or a dict which is then converted to a Series in the function) the NaNs will be filled with different values according to the index values:

In [1]: data = ['a', np.nan, 'b', np.nan, np.nan]
In [2]: s = pd.Series(pd.Categorical(data, categories=['a', 'b']))

In [3]: s.fillna('a')
Out[3]: 
0    a
1    a
2    b
3    a
4    a
dtype: category
Categories (2, object): [a, b]

In [4]: s.fillna(pd.Series(['a', 'b', 'b'], index=[1, 3, 4])
Out[4]:
0    a
1    a
2    b
3    b
4    b
dtype: category
Categories (2, object): [a, b]

And I was finding it difficult to deal with both cases with the same code. But I can keep thinking about it.

def test_fillna_series_categorical_errormsg(self):
data = ['a', np.nan, 'b', np.nan, np.nan]
s = pd.Series(pd.Categorical(data, categories=['a', 'b']))

This comment has been minimized.

@jreback

jreback Nov 15, 2017

Contributor

these tests can all go in pandas/tests/series/test_missing.py. also pls move any related fillna tests that are directly on Series/DataFrame as well (to tests/dataframe/test_missing.py), if they are only on Categorical itself then leave here.

This comment has been minimized.

@reidy-p

reidy-p Nov 15, 2017

Contributor

Moved the new tests relating to Series to pandas/tests/series/test_missing.py and moved the existing test_fillna from pandas/tests/test_categorical.py to pandas/tests/frame/test_missing.py

@@ -24,7 +24,7 @@ Other Enhancements
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`)
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
-
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`)

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

for a categorical dtype

mask = values == -1
if mask.any():
values = values.copy()
if isna(value):

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

you can simplify this to

values[mask] = self.categories.get_indexer([value])[0]

This comment has been minimized.

@reidy-p

reidy-p Nov 16, 2017

Contributor

get_indexer seems to cause problems with PeriodIndex. When I replace the code with your suggestion:

In [1]: idx = pd.PeriodIndex(['2011-01', '2011-01', pd.NaT], freq='M')
In [2]: df = pd.DataFrame({'a': pd.Categorical(idx)})

In [3]: df.fillna(value=pd.NaT))
Out[3]:  
pandas._libs.period.IncompatibleFrequency: Input has different freq=None from PeriodIndex(freq=M)

This comment has been minimized.

@jreback

jreback Nov 17, 2017

Contributor

hmm, this my be a known bug

@@ -270,6 +270,80 @@ def test_fillna(self):
pd.Timestamp('2012-11-11 00:00:00+01:00')]})
assert_frame_equal(df.fillna(method='bfill'), exp)
def test_na_actions(self):

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

can you add the issue number as a comment

def f():
df.fillna(value={"cats": 4, "vals": "c"})

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

this test is way too long. can you parametrize it? would make it simpler. could also break it into a couple of tests, just use descriptive names.

This comment has been minimized.

@jreback

jreback Nov 16, 2017

Contributor

haha, realized you are just moving this test! ok see what you can do here.

# If value is not a dict or Series it should be a scalar
else:
if not isna(value) and value not in self.categories:

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

you need to test if its a is_scalar before you use isna, because if its not then this will raise a different error (add a test for that as well). (you can make this an elif which is_scalar is True and raise in an else

This comment has been minimized.

@reidy-p

reidy-p Nov 19, 2017

Contributor

Thanks, that's a good idea.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 19, 2017

pls rebase as well (generally anytime you are pushing you should)

else:
values[mask] = self.categories.get_loc(value)
else:
raise TypeError('"value" parameter must be a scalar, dict '

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

do we have testing to hit this?

This comment has been minimized.

@reidy-p

reidy-p Nov 19, 2017

Contributor

The problem is that I can't think of an example where we would actually hit this specific line. If the user passed a list, tuple, or DataFrame as a value in fillna this will be caught by fillna in generic.py rather than in categorical.py with an identical error message to the one here. So in some ways this exception is kind of redundant because all of the work should be done in generic.py. But it might be useful to keep it anyway.

I have included tests in test_fillna_categorical_raise() in tests/series/test_missing.py to check for the TypeError when the user passes a list, tuple, or DataFrame fill value. But the tests aren't parametrized.

raise ValueError("invalid fill value with a %s" %
type(value))
raise TypeError('"value" parameter must be a scalar, dict '
'or Series, but you passed a '

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

ideally we would have a parametriezed test that hits this (with multiple invalid things that should raise)

This comment has been minimized.

@reidy-p

reidy-p Nov 19, 2017

Contributor

Yeah, as I said above, I have included tests in test_fillna_categorical_raise() in tests/series/test_missing.py to check for the TypeError when the user passes a list, tuple, or DataFrame fill value. But the tests aren't parametrized.

@@ -26,6 +26,7 @@ Other Enhancements
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`)
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

can you add another note in api breaking changes. note that the previous exception was a ValueError, now its a TypeError (good change). use this PR number for that note.

@jreback

minor change, rebase

.. _whatsnew_0220.api_breaking:
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
- :func:`Series.fillna` now raises a ``TypeError`` instead of a ``ValueError`` when passed a list, tuple or DataFrame as a ``value`` (`PR18293 <https://github.com/pandas-dev/pandas/pull/18293>`__)

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

just use the PR number and use issue (it works0

@jreback jreback added this to the 0.22.0 milestone Nov 22, 2017

@jreback jreback merged commit 103ea6f into pandas-dev:master Nov 22, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 22, 2017

thanks @reidy-p

@reidy-p reidy-p deleted the reidy-p:fillna_cat_series branch Nov 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment