Zero counts in Series.value_counts for categoricals #8559

fkaufer · 2014-10-15T07:14:22Z

Series.value_counts() also shows categories with count 0.

Thought this would be a bug but according to doc it is intentional.

This makes the output of value_counts inconsistent when switching between category and non-category dtype. Apart from that it blows up the value_counts output for series with many categories.

I would prefer to hide counts (i.e. zero) for non-occuring categories by default and rather consider a parameter dropzero=True similar to dropna (see also #5569).

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-15T07:16:25Z

can u show a specific example (of what u think it should do)
this with 0,15rc1?

jreback · 2014-10-15T07:16:56Z

cc @JanSchulz

fkaufer · 2014-10-15T07:53:26Z

s = pd.Series(['a','b','a','c','d','c'])
count_str = s[s.isin(['a','u'])].value_counts()
count_cat = s.astype('category')[s.isin(['a','u'])].value_counts()

count_str
a    2

count_cat
a    2
d    0
c    0
b    0

assert count_str==count_cat
...
ValueError: Series lengths must match to compare

fkaufer · 2014-10-15T07:58:05Z

.. and yes version is 0.15.0rc1-24-g56dbb8c

jankatins · 2014-10-15T10:06:41Z

I think the current behaviour is correct: A categorical is not a more memory efficient string dtype but a dtype with a fixed set of values. One of the main points for categoricals is that "unused" categories show up in all kind of operations, e.g. during groupby and during value counts. This will come in handy in ggplot, where plot axis should be the same for all facets and unused cats should show up with length zero bars.

If you want to have the same output, you need to do the "isin" with the results of value_count() (untested, I don't have a recent pandas env right now):

count_str = s[s.isin(['a','u'])].value_counts()
temp = s.astype('category').value_counts()
count_cat = temp[temp.index.isin(["a","u"])] # untested, I hope index has that method

jankatins · 2014-10-15T10:08:50Z

IMO, it's also consistent, as 'value_count' counts every value it knows about and in case of categoricals it knows that there are more than only the "used" categories.

jreback · 2014-10-15T10:11:14Z

what about adding the dropna=False arg?
since the default is different would this be confusing?

jankatins · 2014-10-15T10:13:56Z

dropzero=False would be ok, but on the other hand you can do that afterwards as well in a similar manner:

ret= ... 
if dropzero: 
    return ret[ret.count!=0] # probably needs a copy to get around the 'setting_with_copy`thingy...
else 
   return ret

jankatins · 2014-10-15T10:16:12Z

@fkaufer What is actually the use case here, i.e., why do you need a Categorical and zero-cats removed?

fkaufer · 2014-10-16T16:33:40Z

Series example with isin was only a minimalistic example. Does not reflect my use cases.
Typical application is complex filtering on many df columns, then using value_counts as convenient tool to find out for which categories this filter holds true. Typical use case: interactive data cleansing/exploration with negative filter supposed to return only those - few - categories for which idiosyncrasies (in the other columns) are found. For instance:
- df_quotes[outlier_filter].symbol.value_counts()
- df_shipment[outlier_filter].airport_dep.value_counts()
This issue seems to be deeper and not restricted to value_counts, Series.unique() also returns all categories which I consider even more problematic. So df_quotes[outlier_filter].symbol.unique() is equivalent to df_quotes.symbol.cat.categories. Only df_quotes[outlier_filter].symbol.astype(str).unique() does what I'd expect but I hope I don't have to do that. Gotcha alert!
Regarding the plotting argument: I guess there are situations where one or the other (keep zero-cats or not) comes in handy (similar to dropna). And speaking of facet plotting, I'm having trouble with seaborn.FacetGrid right now due to keep-all-categoricals behaviour (many empty facet subplots), so I have to convert back to string before using FacetGrid. FacetGrid relies on Series.unique, see https://github.com/mwaskom/seaborn/blob/master/seaborn/axisgrid.py#L205. That's exactly what I meant regarding the inconsistency between explicit categoricals and implicit string categoricals as used so far. So this is really a matter of being backward-compatible in a way.
Conceptual argument: IMO categorical is a separation of internal and external representation. The external representation, the label, is only meta-data but not data and meta-data should not be present if the underlying data is - virtually - not existent. But admittedly there is no consensual view. R's table keeps zero frequencies for factor variables, Stata's tabulate doesn't for encoded variables.
Re "A categorical is not a more memory efficient string dtype": I'd say "not only", but for me memory efficiency is a very important and for now the most important reason to use categoricals. Thanks to that (and thanks to you!) I'm currently working with a dataset on my laptop which before I could only handle on a server.

jankatins · 2014-10-16T19:25:12Z

Re plotting: having all values preserved in all facets (so not cat x cat facets but a cat variable as x axis in each facet -> zero values turn up as zero length bars) is actually the use case for ggplot where cats are wanted (in ggplot this relies on value_counts()). I'm not sure what to make of the 'cat as facet variable" case. R's ggplot2 removes empty categories in that case:

library(reshape2)
library(ggplot2)
levels(tips$sex) <- c("f", "m", "-")
sp <- ggplot(tips, aes(x=total_bill, y=tip/total_bill)) + geom_point(shape=1)
sp + facet_grid(sex ~ .)

Interestingly unique returns a factor (with all levels, but only the "used" levels as values) when the input is a factor:

> unique(tips$sex)
[1] f m
Levels: f m -
> unique(as.character(tips$sex))
[1] "f" "m"

This is IMO a argument to drop unused categories in unique().

As a workaround for your seaborn problem, you can use df.variable.cat.drop_unused_categories(inplace=True) before faceting

Re value_count: value_count() is equivalent to:

> table(tips$sex)
  f   m   - 
 87 157   0

I think a dropzero=False default argument can be done for value_count() but that essential means that this is either ported to all value_count() methods (where it makes not sense) or you have to test for categorical series (in which case you could also simple remove all zero counted values of the returned dataframe).

Or you have a "remove unused categories" step in between...

What will happen in your app when you reorder the categories (e.g., "one" < "two" < "three")?

Re metadata: I see it that the levels are actually part of every item of the categorical data (which it is in R, but right now not in pandas: getting a single item will return a single item cat in R but a int/string/... in pandas):

> tips$sex[1]
[1] f
Levels: f m -

From your metadata comment and the last bullet I think what you want is a memory efficient string dtype. This could actually be done by subclassing categorical and "hiding" the categorical thingies and add categories during set automatically. Should actually be almost trivial... This was actually one argument to implement such a data type in numpy so that they have a proper variable length string dtype :-)

=> I see the problem with unique() but not with value_count() and seeing categories as "to be hidden" meta data.

jankatins · 2014-10-16T20:12:53Z

Oh my:

> library(reshape2)
> library(dplyr)
> levels(tips$sex) <- c("f", "m", "-")
> gb = group_by(tips, sex)
> summarise(gb,count = n())
Source: local data frame [2 x 2]

  sex count
1   f    87
2   m   157

-> dplyr omits unused levels in group_by

Following this would mean that pandas groupby should also not return empty (unused) categories...

@hadley is that intentional?

jorisvandenbossche · 2014-10-16T21:02:26Z

For the unique case I also think we should only return the categories that occur in the series (or return a Categorical)

jankatins · 2014-10-16T21:20:29Z

Ok, I will prepare a PR for the unique case.

What about the rest? removing empty groups from groupby will be deeper than the unique change...

jreback · 2014-10-16T22:40:27Z

@JanSchulz IIRC we specifically made the groupby return ALL of the categorical groups (I like this and think this makes sense). Unique I suppose is a different issue, though (and I agree with the above).

jreback · 2014-10-16T22:43:46Z

so I'd like to ask if we think their are actually 2 'categorical' types:

a memory saving object representation, which is implemented by a categorical but in reality is object like (e.g. imagine make all string-like series actually use this as an implementation.
a real categorical which for example would show all of the groups in a groupby (whereas an object dtypes would not).

These seem really close (and in fact we don't distinguish these), should we?

jankatins · 2014-10-17T06:16:04Z

The biggest difference: how "not in categories" values are handled: e.g when using concat or setting new values.

fkaufer · 2014-10-17T06:58:50Z

Just to clarify: I do not only (mis)use categoricals for memory efficiency of string variables. But this is - along the custom ordering - something I get right out of box now, where as for the other benefits (signalling for stats/ML, plotting) it will take some time until the respective libs directly support pandas categoricals.

That said, I don't think there should be two different categorical types. I guess the difference in our views on categoricals is rather a matter of the size (cardinality) of the categorical. To me it seems the current design is for categoricals of small cardinality, rather coming from boolean vars. In this cases I can understand your take on value_counts and plotting. I have these categorials as well but I have many categoricals of cardinality in the order of tens, hundreds and even thousands. To me that makes perfectly sense and I consider them "real categoricals" (IMO the main strong criterion to qualify as a "real categorical" is the fixed range of values). Plotting diagrams with these large-cardinalities categoricals typically means you have applied some filtering before which has virtually decreased the cardinality, hence plotting zero-length bars and showing zero frequencies is really not what you want. Probably I would even use dropna=False more often then dropzero=False. Personally, I would rather suggest to have a separate new method "levels", "tabulate" or "cat_freq" and keep existing methods (unique, value_counts, groupby, ...) consistent with other data types. Such a new method could then also be applied to all dtypes.

Similar to the meta-data perspective I also like to consider categoricals as separate dimension/lookup tables as in databases with the built-in feature of being auto-joined whenever I use the categorical for projection (i.e. SELECT in SQL) or certain selections (WHERE cat=scalar). For a database query you would then apply a left or inner/natural join (for real categories with fixed values left and inner join are the same), which is also the default behaviour of pandas' merge (default: inner) and join (default: left). The current behaviour of value_counts et al corresponds to a full/right outer join (analogous to left/inner: full and right outer join are equivalent for real categoricals) which feels similarly unnatural as if pandas' join/merge default would be set to full/right outer join.

jreback · 2014-10-17T07:45:06Z

@fkaufer join behavior will have to think about but all for consistency

@JanSchulz can u prepare a pr for reverting
unique, value_counts, groupby to not return nan/0n categories by default

I think this just means honoring the dropna in unique, value_counts
we can add an option (dropna) for groupby in the next version to optionally include all groups
pls create an issue for that

hadley · 2014-10-17T13:51:08Z

@JanSchulz it's on the long term to do list.

jankatins · 2014-10-17T18:21:47Z

@hadley Just that I understand it correctly: you plan to change group_by to include empty levels?

jankatins · 2014-10-17T18:26:14Z

@fkaufer
yep, currently Categorical is optimized for usecases like lickert scales (~7 items) or names of states (10-200) and the behaviour should make sense in that context (like counting persons for each state -> zero persons in one state makes IMO sense). In the end it comes down to what is more useful and needs less user code: including empty categories or excluding them. E.g. how often would I need to add empty categories to the result of a groupby (probably a reindex with the original categories) or how often do I need to remove empty groups (res = res[pd.isnull(res.whatever)]) or how many extra lines do I need to come up with a apply function (if there is an error for empty dataframes) vs how many line to add empty groups afterwards. I'm no sure if anybody has an idea what cases there are.

Re your app: If you use custom ordering and therefore special cases categorical data, then it wouldn't matter to use a drop_unused_categories afterward or filter nan/zero in any results of groupby based functions.

Re categorical and "metadata": I don't see them as a "join operation between codes and categories", but as a new data type which only can take a few values (like you can't put a longer than max-int into a int array). As such each individual entry consists of "value and metadata" the same as an int is "value and metadata", only that in the int case the metadata is encoded in the length of the memory block which is used to store the int. They are "just" implemented like a database join...

jankatins · 2014-10-17T18:28:53Z

@jreback
Removing empty groups from groupby will need more than a few code changes, as most of the rest of the code also right now depends on that behaviour (e.g. value_count) and I'm not so sure whats needs changing in groupby.py to make that change, so I think this is post 0.15 work. I would also find it good to model our behaviour after the one in dplr, as I think that will be the "expected" behaviour for people coming from R.

The unique case is IMO smaller, as it only takes a few lines in unique() and the tests. I can do that for 0.15 or 0.15.1

jreback · 2014-10-18T12:56:25Z

going to move this to 0.15.1

jreback · 2014-11-24T12:50:41Z

@JanSchulz

ok, so to summarise:

groupby/value_counts WILL return all of the categories. I think is correct/intuitive. leave as is
(maybe implemente a dropna=False argument to optionally not do this)
unique on the other hand, should only return the 'used' categories

@JanSchulz you are doing a PR for unique case?

jankatins · 2014-11-25T08:10:48Z

Unique should be easy, just do the unique on the codes and then take the results. Will do...

jreback · 2015-02-10T14:27:44Z

@JanSchulz can you revisit. See what we need from this issue.

quantumds · 2018-04-20T13:06:11Z

I believe that the output of value_counts when applied to categorical variables, shouldn't print values that are inexistent/not assigned for that variable in the current dataframe. This issue would benefit a lot Pandas users in data Analysis. Do we have an estimation of when this improvement will be included? Thanks!

wesm · 2018-07-06T22:49:02Z

In the absence of clear guidance about whether to change anything, I'm closing this as Won't Fix.

Note: R maintains the empty categories when tabulating factor counts

> values <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
> values
 [1] 1 1 1 2 2 2 3 3 3 4 4 4
> values <- factor(values)
> values
 [1] 1 1 1 2 2 2 3 3 3 4 4 4
Levels: 1 2 3 4
> values == 2
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> (values == 2) | (values == 4)
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
> values[(values == 2) | (values == 4)]
[1] 2 2 2 4 4 4
Levels: 1 2 3 4
> values2 <- values[(values == 2) | (values == 4)]
> values2
[1] 2 2 2 4 4 4
Levels: 1 2 3 4
> table(values2)
values2
1 2 3 4 
0 3 0 3

codenigma1 · 2020-09-08T04:13:12Z

s = pd.Series(['a','b','a','c','d','c'])
count_str = s[s.isin(['a','u'])].value_counts()
count_cat = s.astype('category')[s.isin(['a','u'])].value_counts()

count_str
a    2

count_cat
a    2
d    0
c    0
b    0

assert count_str==count_cat
...
ValueError: Series lengths must match to compare

we can avoid the series zero by simply : series[series!=0] or series[series > 0]

NumberPiOso · 2022-01-29T00:24:43Z

In case anyone is reading this issue 8 years later...

You should follow #14942 and #20583

The default behavior of value_counts and group_by changed.

crypdick · 2022-03-10T03:08:26Z

also necroposting... if you need to drop categoricals with zero counts, do

df.col_name = df.col_name.cat.remove_unused_categories()

jreback added Categorical Categorical Data Type API Design labels Oct 15, 2014

jorisvandenbossche added this to the 0.15.0 milestone Oct 16, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Oct 18, 2014

jankatins mentioned this issue Nov 29, 2014

Categorical: let unique only return used categories #8937

Merged

jreback modified the milestones: 0.16.0, 0.15.2 Nov 29, 2014

This was referenced Jan 22, 2015

Seaborn should respect categorical order when sorting pd.Categorical objects mwaskom/seaborn#361

Closed

BUG: don't sort unique values from categoricals #9331

Merged

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

toobaz mentioned this issue Feb 22, 2016

BUG: groupby nunique with Categorical and missing categories gives ValueError #11635

Closed

toobaz mentioned this issue Jul 20, 2017

Groupby inconsistency with categorical values #17032

Closed

jreback added the Groupby label Jul 20, 2017

topper-123 mentioned this issue Jun 27, 2018

API: Categorical.unique() should not drop unused categories #21648

Closed

wesm closed this as completed Jul 6, 2018

wesm added the Won't Fix label Jul 6, 2018

corriebar mentioned this issue Oct 12, 2021

BUG: Inconsistent behaviour for DataFrame.value_counts and Series.value_counts on categoricals #44001

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero counts in Series.value_counts for categoricals #8559

Zero counts in Series.value_counts for categoricals #8559

fkaufer commented Oct 15, 2014

jreback commented Oct 15, 2014

jreback commented Oct 15, 2014

fkaufer commented Oct 15, 2014

fkaufer commented Oct 15, 2014

jankatins commented Oct 15, 2014

jankatins commented Oct 15, 2014

jreback commented Oct 15, 2014

jankatins commented Oct 15, 2014

jankatins commented Oct 15, 2014

fkaufer commented Oct 16, 2014

jankatins commented Oct 16, 2014

jankatins commented Oct 16, 2014

jorisvandenbossche commented Oct 16, 2014

jankatins commented Oct 16, 2014

jreback commented Oct 16, 2014

jreback commented Oct 16, 2014

jankatins commented Oct 17, 2014

fkaufer commented Oct 17, 2014

jreback commented Oct 17, 2014

hadley commented Oct 17, 2014

jankatins commented Oct 17, 2014

jankatins commented Oct 17, 2014

jankatins commented Oct 17, 2014

jreback commented Oct 18, 2014

jreback commented Nov 24, 2014

jankatins commented Nov 25, 2014

jreback commented Feb 10, 2015

quantumds commented Apr 20, 2018

wesm commented Jul 6, 2018

codenigma1 commented Sep 8, 2020

NumberPiOso commented Jan 29, 2022

crypdick commented Mar 10, 2022

Zero counts in Series.value_counts for categoricals #8559

Zero counts in Series.value_counts for categoricals #8559

Comments

fkaufer commented Oct 15, 2014

jreback commented Oct 15, 2014

jreback commented Oct 15, 2014

fkaufer commented Oct 15, 2014

fkaufer commented Oct 15, 2014

jankatins commented Oct 15, 2014

jankatins commented Oct 15, 2014

jreback commented Oct 15, 2014

jankatins commented Oct 15, 2014

jankatins commented Oct 15, 2014

fkaufer commented Oct 16, 2014

jankatins commented Oct 16, 2014

jankatins commented Oct 16, 2014

jorisvandenbossche commented Oct 16, 2014

jankatins commented Oct 16, 2014

jreback commented Oct 16, 2014

jreback commented Oct 16, 2014

jankatins commented Oct 17, 2014

fkaufer commented Oct 17, 2014

jreback commented Oct 17, 2014

hadley commented Oct 17, 2014

jankatins commented Oct 17, 2014

jankatins commented Oct 17, 2014

jankatins commented Oct 17, 2014

jreback commented Oct 18, 2014

jreback commented Nov 24, 2014

jankatins commented Nov 25, 2014

jreback commented Feb 10, 2015

quantumds commented Apr 20, 2018

wesm commented Jul 6, 2018

codenigma1 commented Sep 8, 2020

NumberPiOso commented Jan 29, 2022

crypdick commented Mar 10, 2022