Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: awkward syntax to add categories to a Categorical #9927

Closed
jreback opened this issue Apr 18, 2015 · 7 comments · Fixed by #9929
Closed

API/BUG: awkward syntax to add categories to a Categorical #9927

jreback opened this issue Apr 18, 2015 · 7 comments · Fixed by #9929
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Apr 18, 2015

from SO

  • the .add_categories should be able to take an Index/ndarray, ATM it must be converted to a list
  • should be a keyword for something like: add any additional categories that I am passing to you (even if the current ones are duplicates, just add these to the end). e.g.
In [147]: s = pd.Categorical(list('aabbcd'))

In [148]: s2 = list('aabbcdefg')

In [149]: s
Out[149]: 
[a, a, b, b, c, d]
Categories (4, object): [a, b, c, d]

In [150]: s.add_categories(s.categories.sym_diff(Index(s2)).tolist())
Out[150]: 
[a, a, b, b, c, d]
Categories (7, object): [a, b, c, d, e, f, g]

I would ideally just like to say:

s.add_categories(s2, take_new=True)

(maybe not the best keyword, but something like this)

@jreback jreback added API Design Categorical Categorical Data Type labels Apr 18, 2015
@jreback jreback added this to the 0.17.0 milestone Apr 18, 2015
@jreback
Copy link
Contributor Author

jreback commented Apr 18, 2015

cc @JanSchulz

@jankatins
Copy link
Contributor

I'm against s.add_categories(s2, take_new=True): how should the whole things be ordered? If it is an ordered cat, the order has meaning and a user will run into surprises, no matter how good we document that :-(

A bad example for "append the difference": take two series, one ["a","b"] and one ["a","c"] and then use the above with ["a","b","c"] -> you get two different categories, one ["a","b","c"] and one ["a","c","b"].

We do have set_categories(new), which is more well defined (i.e. take the order from new, even if they are currently unused).

I think accepting ndarray/ndex would be good, but with the same restraints as the current list.

@jreback
Copy link
Contributor Author

jreback commented Apr 18, 2015

cc @JanSchulz I am not actually suggesting anything new, here, just appending, where the order is not defined, except that it occurs AFTER the existing categories (as .add_categories does now), see #9929

I think this is a very natural thing to do, and pretty awkward to force the user to make sure that they are not adding duplicate categories. Of course they can use .set_categories, but then what is the point of .add_categories then?

@jankatins
Copy link
Contributor

jankatins commented Apr 18, 2015

regarding "what's the point of add_cats": for the usecases, which categorical should IMO optimized for ("lickert scales" or "american states"), this method is mostly useless: I can't really think about a usecase where set_categories isn't better or at least "good enough" (but then, I only only used questionairs). As far as I remember, it was added in the discussion to round the API off (in addition to add_categories).

IMO this issue/SO question is another "workaround"/indicator for the "categoricals as memory efficient strings": if you have a need for setting arbitrary length categories (e.g. s.unique()) without bothering about the order or what you actually add as categoricals then a memory efficient string type would be better. This is also nicely visible in the SO question: "I'm trying to reduce the size of ~300 csv files (about a billion rows) by replacing lengthy fields with shorter, categorical, values[...]"

@jreback jreback modified the milestones: 0.16.1, 0.17.0 Apr 20, 2015
@jreback
Copy link
Contributor Author

jreback commented Apr 20, 2015

see #9929

seems reasonable to me. @shoyer @jorisvandenbossche

@JanSchulz I not sure where you think this doesn't apply to Categoricals. I want to add a bunch of categories. Not saying they are in any particular order, just that we don't rewrite the existing category mappings. I think that is a reasonable API guarantee.

This is actually what we do in reality now, this just codifies it and makes it a convient to do.

@jankatins
Copy link
Contributor

I just think that a report starting with "I'm trying to reduce the size..." is not something which should be used to influence the API design of Categorical but should be used to model a new MemoryEfficientString (which maybe has a common superclass with Categorical...).

Regarding the the usecases: I think that add_categories and remove_categories are not really needed if you only work with survey data (which is what I did and I've no experience with any other categorical data apart from the "I want more memory efficient strings" cases -> if there are more and they need the above methods, then my point is moot here), where you have something like 7 item lickert scales or 200 names of states: for them it makes as much (or more) sense to simple use set_categories as you start with the complete list of states or even want to have a defined order (lickert).

If you have a cat with ordered==True, then s.add_categories(s2, take_new=True) is IMO plain madness, as you can't be sure what the final order of the categories is (the a-b-c example above).

Even if you want "append in the end", a simple s.cat.add_categories(set(all_cats)-set(s.cat.categories)) is IMO not so terrible problematic to warrant a new kwarg (and I still think that s.cat.set_categories(all_cats) is better/easier/less error prone and IMO the more common case if you need a special order, where you anyway need to have the re-code/factorize).

@jreback
Copy link
Contributor Author

jreback commented Apr 21, 2015

@JanSchulz ok, closing for now (I did merge the bug fix though, but separate issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants