Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DOC: Deprecate and Advise against having np.nan in Categoricals #10748

Closed
TomAugspurger opened this issue Aug 4, 2015 · 10 comments
Closed
Labels
API Design Categorical Categorical Data Type Deprecate Functionality to remove in pandas Docs
Milestone

Comments

@TomAugspurger
Copy link
Contributor

This came out of work on #10729

In the documentation, we mention that

There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.

In the first case, NaN is not in .categories, and in the second case it is. I think we should only
recommend the first.

The option of NaNs in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories.

@JanSchulz do you remember the initial reason for allowing either representation?

Some bad things that come out of NaN in .categories:

  • Can't rely on a value of nan mapping to a code of -1:
In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])

In [3]: s
Out[3]:
[a, b, a, NaN]
Categories (3, object): [a, b, NaN]

In [4]: s.categories
Out[4]: Index(['a', 'b', nan], dtype='object')

In [5]: s.codes
Out[5]: array([0, 1, 0, 2], dtype=int8)
  • potentially have to upcast the index type or mix strings and floats (nan) in the .categories Index.
  • extra code if you want to generically handle Categoricals that may or may not have NaN in categories.
@TomAugspurger TomAugspurger added Docs Categorical Categorical Data Type labels Aug 4, 2015
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Aug 4, 2015
@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

hmm, we don't have an explicity prohibition in _validate_categories, but if you pass in your own categories (not as an Index) a null would be removed.

I think if you added a test and checked in validate_categories (maybe exempting fastpath=True) would be ok. (and I dont think anything would fail).

@jorisvandenbossche
Copy link
Member

Unless someone comes up with a good use case of NaNs in the categories, I think we should just disallow it. This will make it a lot easier I think. Because even if we advise against it in the docs, if it is possible, people will do it and we will have to deal with the bug reports.

@TomAugspurger TomAugspurger changed the title DOC: Advise against having np.nan in Categoricals API/DOC: Deprecate and Advise against having np.nan in Categoricals Aug 4, 2015
@TomAugspurger
Copy link
Contributor Author

Ok, I've updated the title to recommend deprecation (unless there's a good reason otherwise). Later tonight I'll try out a check disallowing it and see what happens.

@jreback jreback added the Deprecate Functionality to remove in pandas label Aug 4, 2015
@jankatins
Copy link
Contributor

The original reason, AFAIR, was that R handled that corner case (NA as factor and NA as missing -> two different NA in a vector). See here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

The codes of a factor may contain NA. For a numeric x, set exclude = NULL to make NA an extra level (prints as <NA>); by default, this is the last level.

If NA is a level, the way to set a code to be missing (as opposed to the code of the missing level) is to use is.na on the left-hand-side of an assignment (as in is.na(f)[i] <- TRUE; indexing inside is.na does not work). Under those circumstances missing values are currently printed as <NA>, i.e., identical to entries of level NA.

@TomAugspurger
Copy link
Contributor Author

Thanks.

I think this is an area where with the pandas way of handling missing data (np.nan for everything) means it makes sense to differ from R. Unless the potential of adding integer NA with dynd means we would want to allow NA in the .categories. I still don't really see the benefit of that though.

@shoyer
Copy link
Member

shoyer commented Aug 4, 2015

I agree, I don't think there is anything to be gained from copying this R behavior in pandas.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

@TomAugspurger let's just change this. I don't even thing its worth deprecating.

@jorisvandenbossche
Copy link
Member

@TomAugspurger any chance you would have some time for this to get it into 0.17?

@TomAugspurger
Copy link
Contributor Author

I'll give it a shot today. I think FutureWarning though right, not a straight removal? People could be relying on this.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2015

yes deprecation warning is fine

@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 29, 2015
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 26, 2017
jreback added a commit that referenced this issue Mar 27, 2017
Deprecated in 0.17.0.
xref #10748
xref #13648

Author: Jeff Reback <jeff@reback.net>
Author: gfyoung <gfyoung17@gmail.com>

Closes #15806 from gfyoung/categories-nan-drop and squashes the following commits:

318175b [Jeff Reback] TST: test pd.NaT with correct dtype
4dce349 [gfyoung] Drop support for NaN categories in Categorical
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
Deprecated in 0.17.0.
xref pandas-dev#10748
xref pandas-dev#13648

Author: Jeff Reback <jeff@reback.net>
Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#15806 from gfyoung/categories-nan-drop and squashes the following commits:

318175b [Jeff Reback] TST: test pd.NaT with correct dtype
4dce349 [gfyoung] Drop support for NaN categories in Categorical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Deprecate Functionality to remove in pandas Docs
Projects
None yet
Development

No branches or pull requests

5 participants