Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero counts in Series.value_counts for categoricals #8559

Closed
fkaufer opened this issue Oct 15, 2014 · 32 comments
Closed

Zero counts in Series.value_counts for categoricals #8559

fkaufer opened this issue Oct 15, 2014 · 32 comments
Labels

Comments

@fkaufer
Copy link

fkaufer commented Oct 15, 2014

Series.value_counts() also shows categories with count 0.

Thought this would be a bug but according to doc it is intentional.

This makes the output of value_counts inconsistent when switching between category and non-category dtype. Apart from that it blows up the value_counts output for series with many categories.

I would prefer to hide counts (i.e. zero) for non-occuring categories by default and rather consider a parameter dropzero=True similar to dropna (see also #5569).

@jreback
Copy link
Contributor

jreback commented Oct 15, 2014

can u show a specific example (of what u think it should do)
this with 0,15rc1?

@jreback
Copy link
Contributor

jreback commented Oct 15, 2014

cc @JanSchulz

@fkaufer
Copy link
Author

fkaufer commented Oct 15, 2014

s = pd.Series(['a','b','a','c','d','c'])
count_str = s[s.isin(['a','u'])].value_counts()
count_cat = s.astype('category')[s.isin(['a','u'])].value_counts()
count_str
a    2
count_cat
a    2
d    0
c    0
b    0
assert count_str==count_cat
...
ValueError: Series lengths must match to compare

@fkaufer
Copy link
Author

fkaufer commented Oct 15, 2014

.. and yes version is 0.15.0rc1-24-g56dbb8c

@jankatins
Copy link
Contributor

I think the current behaviour is correct: A categorical is not a more memory efficient string dtype but a dtype with a fixed set of values. One of the main points for categoricals is that "unused" categories show up in all kind of operations, e.g. during groupby and during value counts. This will come in handy in ggplot, where plot axis should be the same for all facets and unused cats should show up with length zero bars.

If you want to have the same output, you need to do the "isin" with the results of value_count() (untested, I don't have a recent pandas env right now):

count_str = s[s.isin(['a','u'])].value_counts()
temp = s.astype('category').value_counts()
count_cat = temp[temp.index.isin(["a","u"])] # untested, I hope index has that method

@jankatins
Copy link
Contributor

IMO, it's also consistent, as 'value_count' counts every value it knows about and in case of categoricals it knows that there are more than only the "used" categories.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2014

what about adding the dropna=False arg?
since the default is different would this be confusing?

@jankatins
Copy link
Contributor

dropzero=False would be ok, but on the other hand you can do that afterwards as well in a similar manner:

ret= ... 
if dropzero: 
    return ret[ret.count!=0] # probably needs a copy to get around the 'setting_with_copy`thingy...
else 
   return ret

@jankatins
Copy link
Contributor

@fkaufer What is actually the use case here, i.e., why do you need a Categorical and zero-cats removed?

@jreback jreback added Categorical Categorical Data Type API Design labels Oct 15, 2014
@fkaufer
Copy link
Author

fkaufer commented Oct 16, 2014

  • Series example with isin was only a minimalistic example. Does not reflect my use cases.
  • Typical application is complex filtering on many df columns, then using value_counts as convenient tool to find out for which categories this filter holds true. Typical use case: interactive data cleansing/exploration with negative filter supposed to return only those - few - categories for which idiosyncrasies (in the other columns) are found. For instance:
    • df_quotes[outlier_filter].symbol.value_counts()
    • df_shipment[outlier_filter].airport_dep.value_counts()
  • This issue seems to be deeper and not restricted to value_counts, Series.unique() also returns all categories which I consider even more problematic. So df_quotes[outlier_filter].symbol.unique() is equivalent to df_quotes.symbol.cat.categories. Only df_quotes[outlier_filter].symbol.astype(str).unique() does what I'd expect but I hope I don't have to do that. Gotcha alert!
  • Regarding the plotting argument: I guess there are situations where one or the other (keep zero-cats or not) comes in handy (similar to dropna). And speaking of facet plotting, I'm having trouble with seaborn.FacetGrid right now due to keep-all-categoricals behaviour (many empty facet subplots), so I have to convert back to string before using FacetGrid. FacetGrid relies on Series.unique, see https://github.com/mwaskom/seaborn/blob/master/seaborn/axisgrid.py#L205. That's exactly what I meant regarding the inconsistency between explicit categoricals and implicit string categoricals as used so far. So this is really a matter of being backward-compatible in a way.
  • Conceptual argument: IMO categorical is a separation of internal and external representation. The external representation, the label, is only meta-data but not data and meta-data should not be present if the underlying data is - virtually - not existent. But admittedly there is no consensual view. R's table keeps zero frequencies for factor variables, Stata's tabulate doesn't for encoded variables.
  • Re "A categorical is not a more memory efficient string dtype": I'd say "not only", but for me memory efficiency is a very important and for now the most important reason to use categoricals. Thanks to that (and thanks to you!) I'm currently working with a dataset on my laptop which before I could only handle on a server.

@jankatins
Copy link
Contributor

Re plotting: having all values preserved in all facets (so not cat x cat facets but a cat variable as x axis in each facet -> zero values turn up as zero length bars) is actually the use case for ggplot where cats are wanted (in ggplot this relies on value_counts()). I'm not sure what to make of the 'cat as facet variable" case. R's ggplot2 removes empty categories in that case:

library(reshape2)
library(ggplot2)
levels(tips$sex) <- c("f", "m", "-")
sp <- ggplot(tips, aes(x=total_bill, y=tip/total_bill)) + geom_point(shape=1)
sp + facet_grid(sex ~ .)

Interestingly unique returns a factor (with all levels, but only the "used" levels as values) when the input is a factor:

> unique(tips$sex)
[1] f m
Levels: f m -
> unique(as.character(tips$sex))
[1] "f" "m"

This is IMO a argument to drop unused categories in unique().

As a workaround for your seaborn problem, you can use df.variable.cat.drop_unused_categories(inplace=True) before faceting

Re value_count: value_count() is equivalent to:

> table(tips$sex)
  f   m   - 
 87 157   0 

I think a dropzero=False default argument can be done for value_count() but that essential means that this is either ported to all value_count() methods (where it makes not sense) or you have to test for categorical series (in which case you could also simple remove all zero counted values of the returned dataframe).

Or you have a "remove unused categories" step in between...

What will happen in your app when you reorder the categories (e.g., "one" < "two" < "three")?

Re metadata: I see it that the levels are actually part of every item of the categorical data (which it is in R, but right now not in pandas: getting a single item will return a single item cat in R but a int/string/... in pandas):

> tips$sex[1]
[1] f
Levels: f m -

From your metadata comment and the last bullet I think what you want is a memory efficient string dtype. This could actually be done by subclassing categorical and "hiding" the categorical thingies and add categories during set automatically. Should actually be almost trivial... This was actually one argument to implement such a data type in numpy so that they have a proper variable length string dtype :-)

=> I see the problem with unique() but not with value_count() and seeing categories as "to be hidden" meta data.

@jankatins
Copy link
Contributor

Oh my:

> library(reshape2)
> library(dplyr)
> levels(tips$sex) <- c("f", "m", "-")
> gb = group_by(tips, sex)
> summarise(gb,count = n())
Source: local data frame [2 x 2]

  sex count
1   f    87
2   m   157

-> dplyr omits unused levels in group_by

Following this would mean that pandas groupby should also not return empty (unused) categories...

@hadley is that intentional?

@jorisvandenbossche jorisvandenbossche added this to the 0.15.0 milestone Oct 16, 2014
@jorisvandenbossche
Copy link
Member

For the unique case I also think we should only return the categories that occur in the series (or return a Categorical)

@jankatins
Copy link
Contributor

Ok, I will prepare a PR for the unique case.

What about the rest? removing empty groups from groupby will be deeper than the unique change...

@jreback
Copy link
Contributor

jreback commented Oct 16, 2014

@JanSchulz IIRC we specifically made the groupby return ALL of the categorical groups (I like this and think this makes sense). Unique I suppose is a different issue, though (and I agree with the above).

@jreback
Copy link
Contributor

jreback commented Oct 16, 2014

so I'd like to ask if we think their are actually 2 'categorical' types:

  1. a memory saving object representation, which is implemented by a categorical but in reality is object like (e.g. imagine make all string-like series actually use this as an implementation.
  2. a real categorical which for example would show all of the groups in a groupby (whereas an object dtypes would not).

These seem really close (and in fact we don't distinguish these), should we?

@jankatins
Copy link
Contributor

The biggest difference: how "not in categories" values are handled: e.g when using concat or setting new values.

@fkaufer
Copy link
Author

fkaufer commented Oct 17, 2014

Just to clarify: I do not only (mis)use categoricals for memory efficiency of string variables. But this is - along the custom ordering - something I get right out of box now, where as for the other benefits (signalling for stats/ML, plotting) it will take some time until the respective libs directly support pandas categoricals.

That said, I don't think there should be two different categorical types. I guess the difference in our views on categoricals is rather a matter of the size (cardinality) of the categorical. To me it seems the current design is for categoricals of small cardinality, rather coming from boolean vars. In this cases I can understand your take on value_counts and plotting. I have these categorials as well but I have many categoricals of cardinality in the order of tens, hundreds and even thousands. To me that makes perfectly sense and I consider them "real categoricals" (IMO the main strong criterion to qualify as a "real categorical" is the fixed range of values). Plotting diagrams with these large-cardinalities categoricals typically means you have applied some filtering before which has virtually decreased the cardinality, hence plotting zero-length bars and showing zero frequencies is really not what you want. Probably I would even use dropna=False more often then dropzero=False. Personally, I would rather suggest to have a separate new method "levels", "tabulate" or "cat_freq" and keep existing methods (unique, value_counts, groupby, ...) consistent with other data types. Such a new method could then also be applied to all dtypes.

Similar to the meta-data perspective I also like to consider categoricals as separate dimension/lookup tables as in databases with the built-in feature of being auto-joined whenever I use the categorical for projection (i.e. SELECT in SQL) or certain selections (WHERE cat=scalar). For a database query you would then apply a left or inner/natural join (for real categories with fixed values left and inner join are the same), which is also the default behaviour of pandas' merge (default: inner) and join (default: left). The current behaviour of value_counts et al corresponds to a full/right outer join (analogous to left/inner: full and right outer join are equivalent for real categoricals) which feels similarly unnatural as if pandas' join/merge default would be set to full/right outer join.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2014

@fkaufer join behavior will have to think about but all for consistency

@JanSchulz can u prepare a pr for reverting
unique, value_counts, groupby to not return nan/0n categories by default

I think this just means honoring the dropna in unique, value_counts
we can add an option (dropna) for groupby in the next version to optionally include all groups
pls create an issue for that

@hadley
Copy link

hadley commented Oct 17, 2014

@JanSchulz it's on the long term to do list.

@jankatins
Copy link
Contributor

@hadley Just that I understand it correctly: you plan to change group_by to include empty levels?

@jankatins
Copy link
Contributor

@fkaufer
yep, currently Categorical is optimized for usecases like lickert scales (~7 items) or names of states (10-200) and the behaviour should make sense in that context (like counting persons for each state -> zero persons in one state makes IMO sense). In the end it comes down to what is more useful and needs less user code: including empty categories or excluding them. E.g. how often would I need to add empty categories to the result of a groupby (probably a reindex with the original categories) or how often do I need to remove empty groups (res = res[pd.isnull(res.whatever)]) or how many extra lines do I need to come up with a apply function (if there is an error for empty dataframes) vs how many line to add empty groups afterwards. I'm no sure if anybody has an idea what cases there are.

Re your app: If you use custom ordering and therefore special cases categorical data, then it wouldn't matter to use a drop_unused_categories afterward or filter nan/zero in any results of groupby based functions.

Re categorical and "metadata": I don't see them as a "join operation between codes and categories", but as a new data type which only can take a few values (like you can't put a longer than max-int into a int array). As such each individual entry consists of "value and metadata" the same as an int is "value and metadata", only that in the int case the metadata is encoded in the length of the memory block which is used to store the int. They are "just" implemented like a database join...

@jankatins
Copy link
Contributor

@jreback
Removing empty groups from groupby will need more than a few code changes, as most of the rest of the code also right now depends on that behaviour (e.g. value_count) and I'm not so sure whats needs changing in groupby.py to make that change, so I think this is post 0.15 work. I would also find it good to model our behaviour after the one in dplr, as I think that will be the "expected" behaviour for people coming from R.

The unique case is IMO smaller, as it only takes a few lines in unique() and the tests. I can do that for 0.15 or 0.15.1

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Oct 18, 2014
@jreback
Copy link
Contributor

jreback commented Oct 18, 2014

going to move this to 0.15.1

@jreback
Copy link
Contributor

jreback commented Nov 24, 2014

@JanSchulz

ok, so to summarise:

  • groupby/value_counts WILL return all of the categories. I think is correct/intuitive. leave as is
  • (maybe implemente a dropna=False argument to optionally not do this)
  • unique on the other hand, should only return the 'used' categories

@JanSchulz you are doing a PR for unique case?

@jankatins
Copy link
Contributor

Unique should be easy, just do the unique on the codes and then take the results. Will do...

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

@JanSchulz can you revisit. See what we need from this issue.

@quantumds
Copy link

I believe that the output of value_counts when applied to categorical variables, shouldn't print values that are inexistent/not assigned for that variable in the current dataframe. This issue would benefit a lot Pandas users in data Analysis. Do we have an estimation of when this improvement will be included? Thanks!

@wesm
Copy link
Member

wesm commented Jul 6, 2018

In the absence of clear guidance about whether to change anything, I'm closing this as Won't Fix.

Note: R maintains the empty categories when tabulating factor counts

> values <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
> values
 [1] 1 1 1 2 2 2 3 3 3 4 4 4
> values <- factor(values)
> values
 [1] 1 1 1 2 2 2 3 3 3 4 4 4
Levels: 1 2 3 4
> values == 2
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> (values == 2) | (values == 4)
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
> values[(values == 2) | (values == 4)]
[1] 2 2 2 4 4 4
Levels: 1 2 3 4
> values2 <- values[(values == 2) | (values == 4)]
> values2
[1] 2 2 2 4 4 4
Levels: 1 2 3 4
> table(values2)
values2
1 2 3 4 
0 3 0 3 

@wesm wesm closed this as completed Jul 6, 2018
@wesm wesm added the Won't Fix label Jul 6, 2018
@codenigma1
Copy link

s = pd.Series(['a','b','a','c','d','c'])
count_str = s[s.isin(['a','u'])].value_counts()
count_cat = s.astype('category')[s.isin(['a','u'])].value_counts()
count_str
a    2
count_cat
a    2
d    0
c    0
b    0
assert count_str==count_cat
...
ValueError: Series lengths must match to compare

we can avoid the series zero by simply : series[series!=0] or series[series > 0]

@NumberPiOso
Copy link
Contributor

In case anyone is reading this issue 8 years later...

You should follow #14942 and #20583

The default behavior of value_counts and group_by changed.

@crypdick
Copy link

also necroposting... if you need to drop categoricals with zero counts, do

df.col_name = df.col_name.cat.remove_unused_categories()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants