Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix categorical from codes nan 21767 #21775

Merged

Conversation

miker985
Copy link
Contributor

@miker985 miker985 commented Jul 6, 2018

@@ -628,6 +628,8 @@ def from_codes(cls, codes, categories, ordered=False):
categorical. If not given, the resulting categorical will be
unordered.
"""
if isna(codes).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the array type, rather than this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_integer_dtype

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Jul 6, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 6, 2018
@jreback
Copy link
Contributor

jreback commented Jul 6, 2018

lgtm. ping on green.

@TomAugspurger
Copy link
Contributor

I would slightly prefer to warn for a version when people provide floats. The current changes breaks Categorical.from_codes([1.0, 0.0], ['a', 'b']), which is incorrect, but previously worked.

We can consider checking whether np.all(codes == codes.astype(int)), to detect cases like np.nan.

@codecov
Copy link

codecov bot commented Jul 7, 2018

Codecov Report

Merging #21775 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21775      +/-   ##
==========================================
- Coverage   92.07%   92.06%   -0.01%     
==========================================
  Files         170      170              
  Lines       50688    50700      +12     
==========================================
+ Hits        46671    46678       +7     
- Misses       4017     4022       +5
Flag Coverage Δ
#multiple 90.47% <100%> (-0.01%) ⬇️
#single 42.28% <25%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.75% <100%> (-0.21%) ⬇️
pandas/core/dtypes/common.py 94.87% <0%> (-0.34%) ⬇️
pandas/util/testing.py 85.69% <0%> (-0.21%) ⬇️
pandas/core/arrays/datetimes.py 95.44% <0%> (-0.03%) ⬇️
pandas/core/dtypes/dtypes.py 96.05% <0%> (+0.02%) ⬆️
pandas/core/indexes/period.py 93.5% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b3e7d...b8d4b83. Read the comment docs.

@@ -628,8 +629,11 @@ def from_codes(cls, codes, categories, ordered=False):
categorical. If not given, the resulting categorical will be
unordered.
"""
codes = np.asarray(codes)
if not is_integer_dtype(codes):
raise ValueError("codes need to be array-like integers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can you check for float dtype here and issue a warning, something like

if is_float_dtype(codes):
    icodes = codes.astype('i8')
    if (icodes != codes).all():
       raise ValueError("not integer-like float codes are invalid")
    warn.warning("float codes will be disallowed in the future", FutureWarning)
    codes = icodes

and tests for same.

@pep8speaks
Copy link

pep8speaks commented Jul 9, 2018

Hello @miker985! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 31, 2018 at 14:49 Hours UTC

@miker985 miker985 force-pushed the fix_categorical_from_codes_nan_21767 branch 2 times, most recently from be7b765 to 7b42fa4 Compare July 9, 2018 15:05
FutureWarning)
else:
err = True
if err:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using err, you can create the message say around line 635 like
msg = ......

and then just raise ValueError(msg) where needed, much easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        if not is_integer_dtype(codes):
            msg = "codes need to be array-like integers"
            if is_float_dtype(codes):
                icodes = codes.astype('i8')
                if (icodes == codes).all():
                    err = False
                    codes = icodes
                    warn("float codes will be disallowed in the future",
                         FutureWarning)
                else:
                    raise ValueError(msg)
            else:
                raise ValueError(msg)

is that what you want? That doesn't read easier to me, but I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the msg way to the err way as well.

# GH21767
codes = [1.0, 2.0, 0]
categories = ['a', 'b', 'c']
with warnings.catch_warnings(record=True) as w:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_produces_warning here.

you may also have to change some tests which inadvertantly are testing using float codes, so pls change those (or you can assert_produces_warning there too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change to assert_produces_warning

No tests display as having warnings when I run pytest pandas/tests/categorical/test_constructors.py

Please tell me what you want me to fix, because I see nothing obvious

-
-
-
- Bug in :meth:`Categorical.from_codes` where `NaN` values in `codes` were silently converted to `0` (:issue:`21767`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks around NaN & 0.

-
-
-
- Bug in :meth:`Categorical.from_codes` where `NaN` values in `codes` were silently converted to `0` (:issue:`21767`)

Datetimelike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note in deprection section about from_codes accepting floats.

codes = [1.0, 2.0, 0] # integer, but in float dtype
categories = ['a', 'b', 'c']

with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to check the stacklevel here. You may need to adjust it in from_codes so that the correct line of user-code is reported.

@@ -240,7 +240,7 @@ Deprecations
- :meth:`DataFrame.to_stata`, :meth:`read_stata`, :class:`StataReader` and :class:`StataWriter` have deprecated the ``encoding`` argument. The encoding of a Stata dta file is determined by the file type and cannot be changed (:issue:`21244`).
- :meth:`MultiIndex.to_hierarchical` is deprecated and will be removed in a future version (:issue:`21613`)
- :meth:`Series.ptp` is deprecated. Use ``numpy.ptp`` instead (:issue:`21614`)
-
- :meth:`Categorical.from_codes` has deprecated providing float values for the ``codes`` argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference this PR number.

Note the future behavior (ValueError)

-
-
-
- Bug in :meth:`Categorical.from_codes` where ``NaN`` values in `codes` were silently converted to ``0`` (:issue:`21767`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also changing the behavior of .from_codes([1.1, 2.0]). Make a note of that here.

Also note that both of these cases now raise a ValueError.

if (icodes == codes).all():
err = False
codes = icodes
warn("float codes will be disallowed in the future",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say that in the future this will raise a ValueError.

FutureWarning)
else:
err = True
if err:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the msg way to the err way as well.

@TomAugspurger
Copy link
Contributor

@miker985 can you merge in master and fix the conflict?

@miker985 miker985 force-pushed the fix_categorical_from_codes_nan_21767 branch from 5bcf6f6 to 990240e Compare July 24, 2018 16:30
@miker985
Copy link
Contributor Author

@jreback It says you have requested changes but I think I have addressed them. Can you comment?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment

pls rebase in master and be sure no warning are generated by the current code (is why u r failing i think)

pls fix any occurrences of this (as u r already testing the warning)

@@ -629,8 +631,21 @@ def from_codes(cls, codes, categories, ordered=False):
categorical. If not given, the resulting categorical will be
unordered.
"""
codes = np.asarray(codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a comment with the issue reference here

@miker985 miker985 force-pushed the fix_categorical_from_codes_nan_21767 branch from 990240e to 3d955d1 Compare July 30, 2018 16:00
@miker985
Copy link
Contributor Author

@jreback pinging again - I'm not clear if I've fixed the requested changes and the tool is broken, or if I've missed something (it shows changes requested, but then says the area you requested them is outdated)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miker985 just a small addition to the test. ping no green.

categories = ['a', 'b', 'c']

with tm.assert_produces_warning(FutureWarning):
Categorical.from_codes(codes, categories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can you assert the results of the float converted to integer here (as it just warns but doesn't raise).

@miker985
Copy link
Contributor Author

@jreback ping

@jreback jreback merged commit 647f3f0 into pandas-dev:master Aug 1, 2018
@jreback
Copy link
Contributor

jreback commented Aug 1, 2018

thanks @miker985 nice patch!

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018
minggli added a commit to minggli/pandas that referenced this pull request Aug 5, 2018
* master: (47 commits)
  Run tests in conda build [ci skip] (pandas-dev#22190)
  TST: Check DatetimeIndex.drop on DST boundary (pandas-dev#22165)
  CI: Fix Travis failures due to lint.sh on pandas/core/strings.py (pandas-dev#22184)
  Documentation: typo fixes in MultiIndex / Advanced Indexing (pandas-dev#22179)
  DOC: added .join to 'see also' in Series.str.cat (pandas-dev#22175)
  DOC: updated Series.str.contains see also section (pandas-dev#22176)
  0.23.4 whatsnew (pandas-dev#22177)
  fix: scalar timestamp assignment (pandas-dev#19843) (pandas-dev#19973)
  BUG: Fix get dummies unicode error (pandas-dev#22131)
  Fixed py36-only syntax [ci skip] (pandas-dev#22167)
  DEPR: pd.read_table (pandas-dev#21954)
  DEPR: Removing previously deprecated datetools module (pandas-dev#6581) (pandas-dev#19119)
  BUG: Matplotlib scatter datetime (pandas-dev#22039)
  CLN: Use public method to capture UTC offsets (pandas-dev#22164)
  implement tslibs/src to make tslibs self-contained (pandas-dev#22152)
  Fix categorical from codes nan 21767 (pandas-dev#21775)
  BUG: Better handling of invalid na_option argument for groupby.rank(pandas-dev#22124) (pandas-dev#22125)
  use memoryviews instead of ndarrays (pandas-dev#22147)
  Remove depr. warning in SeriesGroupBy.count (pandas-dev#22155)
  API: Default to_* methods to compression='infer' (pandas-dev#22011)
  ...
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.Categorical.from_codes incorrectly converts NaN codes to 0.
4 participants