API: categorical grouping will no longer return the cartesian product #20583

jreback · 2018-04-02T14:06:17Z

closes #14942
closes #15217
closes #17594
closes #8869

jreback · 2018-04-02T14:08:00Z

this makes categorical groupers work like other groupers and should make things more performant and intuitive. It is somewhat walking back a change from when we first introduced categorical groupers. But the information is preserved (meaning the categories are still there), just removes the automatic re-indexing (which was causing memory to blow up).

still need some increased test coverage. note the first commit actually has almost all of the changes, the next are just cleaning up tests.

TomAugspurger · 2018-04-02T14:50:13Z

Just making sure, this affects grouping by a single categorical as well?

jreback · 2018-04-02T19:51:43Z

yes this affects just a single categorical column also

codecov · 2018-04-02T22:36:05Z

Codecov Report

Merging #20583 into master will increase coverage by <.01%.
The diff coverage is 98.07%.

@@            Coverage Diff             @@
##           master   #20583      +/-   ##
==========================================
+ Coverage   91.78%   91.79%   +<.01%     
==========================================
  Files         153      153              
  Lines       49341    49371      +30     
==========================================
+ Hits        45287    45319      +32     
+ Misses       4054     4052       -2

Flag	Coverage Δ
#multiple	`90.18% <98.07%> (+0.01%)`	⬆️
#single	`41.92% <5.76%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.94% <ø> (ø)`	⬆️
pandas/core/arrays/categorical.py	`95.67% <100%> (+0.05%)`	⬆️
pandas/core/groupby/groupby.py	`92.62% <100%> (+0.07%)`	⬆️
pandas/core/indexes/category.py	`97.03% <100%> (ø)`	⬆️
pandas/core/reshape/pivot.py	`96.97% <87.5%> (ø)`	⬆️
pandas/util/testing.py	`84.59% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28edd06...bdf7525. Read the comment docs.

jreback · 2018-04-09T16:36:11Z

so this is all ready to go if any comments. @TomAugspurger @jorisvandenbossche

jorisvandenbossche · 2018-04-09T21:48:03Z

yes this affects just a single categorical column also

That doesn't seem to be the case?

With this branch, and the example of the whatsnew docs:

In [14]: df.groupby(['A', 'B']).sum()
Out[14]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

In [15]: df.groupby('A').sum()
Out[15]: 
   values
A        
a       3
b       7
z       0               <---------- now unobserved category is included

Further a general comment (will try to do more detailed review later this week): I am not sure we can just change this. First, it is a API breaking change in several places, eg also pivot_table*. And second, I think the current behaviour can actually useful in certain cases and it would be nice to have a way to keep this behaviour.
I know this is very ugly, but it would be worth to have a keyword for this in groupby ?

* but I agree we should look into it and try to make this more consistent. As, for example, pivot does not seem to include unobserved categories (already currently on master), while pivot_table does include them but apparently only for the index and not columns? (try df.pivot_table('values', 'A', 'B') with the example of whatsnew). On the other hand, value_counts does include them (and I think rightly so), but that also introduces an inconsistency with groupby.

jreback · 2018-04-12T10:39:08Z

So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent.

We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for:

an option to turn this on/off.
differing behavior for single vs multi
change single grouping behavior as well

TomAugspurger · 2018-04-12T10:49:03Z

I’d favor an option. I think the default should eventually be to exclude unobserved categories (after a deprecation cycle)

…

________________________________ From: Jeff Reback <notifications@github.com> Sent: Thursday, April 12, 2018 5:39:16 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] API: categorical grouping will no longer return the cartesian product (#20583) So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent. We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for: * an option to turn this on/off. * differing behavior for single vs multi * change single grouping behavior as well — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#20583 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHItJEfaXLGyuztAiN7hpsUYEzKUqRks5tny7UgaJpZM4TDk6b>.

jreback · 2018-04-20T00:23:50Z

pandas/tests/groupby/test_categorical.py

+
+
+@pytest.mark.xfail(reason="failing with observed")
+def test_observed_failing(observed):


if anyone wants to take a crack at this test. Its the only one that uses an IntervalIndex as its category, though this might be a red herring.

cc @WillAyd @toobaz @jschendel

Haven't been able to follow all the way through yet but I think this is a regression with the below method:

pandas/pandas/core/sorting.py

Line 127 in 41db527

def decons_group_index(comp_labels, shape):

Using the failing test example this returns [array(1, 1, 0, 0], ... but on master that same object would return [array(1, 1, 2, 2], .... As a result, I think the reconstructed group labels are getting swapped in _wrap_agged_blocks and causing the failure.

Will try to find time next day or so to walk through in more detail but sharing in case it helps anyone else reviewing the issue

jreback · 2018-04-26T01:37:55Z

this is read for a look

@TomAugspurger @jorisvandenbossche

TomAugspurger

Looks good. Main thing is the change to pivot_table.

It'd be good to document whether the unobserved categories are present in the resulting index's type.

i.e. is

In [7]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b'])).count().index.dtype

Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)

or

Out[7]: CategoricalDtype(categories=['a'], ordered=False)

This could go in the whatsnew and either groupby.rst or categorical.rst probably. I think either behavior is fine.

Still going through the test changes.

TomAugspurger · 2018-04-26T01:42:33Z

doc/source/whatsnew/v0.23.0.txt

+
+.. ipython:: python
+
+.. code-block:: python


Maybe have an example showing the future warning? So just this without observed=False, and an :okwarning: directive. Then you can say "use observed=False to retain the previous behavior and silence the warning.

TomAugspurger · 2018-04-26T01:42:40Z

doc/source/whatsnew/v0.23.0.txt

+   df.groupby(['A', 'B', 'C'], observed=False).count()
+
+
+New Behavior (show only observed values):


New -> Future?

TomAugspurger · 2018-04-26T01:55:48Z

pandas/core/groupby/groupby.py

+                    msg = ("pass observed=True to ensure that a "
+                           "categorical grouper only returns the "
+                           "observed groupers, or\n"
+                           "observed=False to return NA for non-observed"


Some things like 'count' return 0 for unobserved. You could rephrase as "observed=False to include unobserved categories."

TomAugspurger · 2018-04-26T02:01:24Z

pandas/core/reshape/pivot.py

@@ -79,7 +79,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
                pass
        values = list(values)

-    grouped = data.groupby(keys)
+    grouped = data.groupby(keys, observed=dropna)


Need to think about this a bit. Are we all OK with "overloading" dropna to serve two purposes? I think it's ok...

yes, but that's the meaning of the dropna now here anyhow

TomAugspurger · 2018-04-26T02:02:50Z

pandas/core/reshape/pivot.py

@@ -241,10 +241,13 @@ def _all_key(key):
            return (key, margins_name) + ('',) * (len(cols) - 1)

        if len(rows) > 0:
-            margin = data[rows + values].groupby(rows).agg(aggfunc)
+            margin = data[rows + values].groupby(


observed=True for these changes are all backwards compatible?

TomAugspurger · 2018-04-26T02:05:30Z

pandas/tests/groupby/aggregate/test_other.py

@@ -488,12 +488,12 @@ def test_agg_structs_series(structure, expected):


 @pytest.mark.xfail(reason="GH-18869: agg func not called on empty groups.")
-def test_agg_category_nansum():
+def test_agg_category_nansum(observed):


Does this need to be xfailed for observed=True? I think it may be the right answer (not sure though).

fixed so that observed=True is XPASS (not sure how to xfail on a particular fixture value)

TomAugspurger · 2018-04-26T02:05:47Z

pandas/tests/groupby/conftest.py

@@ -4,6 +4,11 @@
 from pandas.util import testing as tm


+@pytest.fixture(params=[True, False])
+def observed(request):


Docstring would be nice.

TomAugspurger · 2018-04-26T02:12:18Z

pandas/tests/reshape/test_pivot.py

            names=['A', 'B'])
        expected = DataFrame(
-            {'values': [1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan]},


Ah, so those changes in pivot are API breaking? I'd prefer that this goes through the same deprecation cycle. On the other hand, this would mean an additional keyword, so that we can do the deprecation cycle properly...

Do we think pivot(..., dropna=True, observed=False) (i.e. the current default) is a useful combination? I could see it being desired.

observed works like dropna, so I am not sure we need this. I can make the default work like the existing I think (which is effectively dropna=False)

TomAugspurger · 2018-04-26T02:15:55Z

pandas/tests/groupby/test_categorical.py

-def test_groupby_sort_categorical():
+def test_sort():
+
+    # http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby


Is this line too long? https://stackoverflow.com/q/23814368/1889400 is a short link to the same q.

TomAugspurger · 2018-04-26T13:23:35Z

http://pandas-docs.github.io/pandas-docs-travis/categorical.html#operations would be a good place to show observed=False.

About pivot_table, I always forget what dropna does... Does it control dropping columns / rows that are all NA before or after aggregating?

jorisvandenbossche · 2018-04-26T21:43:18Z

I will try to give this a more detailed look tomorrow.

But general comment: now that we have the keyword to specify the behaviour (for now with observed=False for back compat), I am not really sure that we should change the default in the future.

It's difficult to judge, as I personally don't run much in such situations, but my gut feeling says that the original issue that motivated the change (the combinatorial explosion with multiple categorical groupers) is not the majority usage pattern of categoricals. And if you don't want to include unobserved categoricals in your analyses in general (groupby, pivot, value_counts, plotting, ..), you always have the easy functionality of remove_unused_categories.

cc @jankatins

jreback · 2018-04-26T22:08:09Z

@TomAugspurger the pivot was a red herring, was not passing things thru. and dropna (in pivot) is exactly equivalent of the observed kwarg (to groupby). So maybe should just rename observed -> dropna, and would be consistent across other functions (value_counts) as well.

@jorisvandenbossche I disagree. You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default. (its also easy to create if you need). Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index, (observed or not), its the groupers that are the issue.

toobaz · 2018-04-26T22:18:14Z

@jorisvandenbossche I also think the default should change. Categoricals are mostly an implementation detail, and should behave as similarly as possible as ordinary Series.

jorisvandenbossche · 2018-04-26T22:28:02Z

You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default.

But in most cases you have no cartesian product, you only have a single categorical key

It can easily blow you up and shouldn't be the default. (its also easy to create if you need).

It's also easy to get the version with only the observed ones (certainly now there is the keyword), so that is not really an argument IMO

Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index

Yes, and that is certainly a good thing. But for me it is still the question what we want the visible output to be.
For example a pivot table is often used as a kind of a summary table. In such a case, I often do want to know that a certain category I care about (otherwise it would not be in the categories) has no values.
And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

jankatins · 2018-04-27T06:50:30Z

My original motivation to work on categoricals were stuff like lickert scales ("completly agree ... completly disagree", 5 to 7 values). For that seing unobserved categoricals in group bys are a good thing.

Since then a lot of changes have been made to make categoricals usefull in other situations (like as a memory efficient string replacement or to analyse genes?).

This change feels like it is a change to make working with the latter easier but will make the former harder.

toobaz · 2018-04-27T07:15:41Z

And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

I gave it as granted that if we change groupby we will also change value_counts - otherwise I agree it doesn't make any sense.

This change feels like it is a change to make working with the latter easier but will make the former harder.

True... I just think your use case is more rare (but I have no data to support this statement, other than personal experience). Anyway, with the new argument both things will be pretty easy anyway, it is mainly a matter of which use case should require awareness from the user.

jreback · 2018-05-01T10:08:37Z

I changed the default back to observed=None to preserve options. I think we can discuss wether / if to change this in the future. But for now allows folks to deal with grouping in a better way now.

jankatins · 2018-05-01T10:55:44Z

I don't see why it is should be good that my code breaks when appending the new data to the old data.

If you have a defined order, where would you put the new categorical into the orders? E.g. good - middle - bad, where would you put 'extreme' -> if you care about order, then you also care about not adding new stuff

I mostly work on the 5-10 most frequent types, and in those cases I will always use observed=False

This usecase might also be satisfied by a collapse or lump method (e.g. see here: http://r4ds.had.co.nz/factors.html#modifying-factor-levels)

TomAugspurger · 2018-05-01T11:08:28Z

FYI, discussion for a new dict-encoded / interned values type can go at #20899 so we don't lose it.

Reviewing this PR one more time, but I think people are all on board with having the default be observed=False (back compat) for now?

TomAugspurger · 2018-05-01T11:11:21Z

doc/source/whatsnew/v0.23.0.txt

@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our

 .. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/

+.. _whatsnew_0230.enhancements.categorical_grouping:
+
+Categorical Groupers has gained an observed keyword


has -> have? Because "categorical Groupers" is plural right?

TomAugspurger · 2018-05-01T11:14:19Z

pandas/core/arrays/categorical.py

@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
            categories in the original order.
        """

+        # we only care about observed values
+        if observed:
+            unique_codes = unique1d(self.codes)


Haven't thought this through, but can this if block be replaced with self.remove_unused_cateogories()._codes_for_groupby(sort=sort, observed=False)?

no, you actually need the uniques

jorisvandenbossche

@jreback I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

jorisvandenbossche · 2018-05-01T11:58:32Z

doc/source/whatsnew/v0.23.0.txt

+Categorical Groupers has gained an observed keyword
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for


To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).

I didn't change this on purpose, this is more correct.

"Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as

"Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage."

Yep, the explanation of Tom is exactly what I meant.

@jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :))

jorisvandenbossche · 2018-05-01T12:02:20Z

doc/source/groupby.rst

+Handling of (un)observed Categorical values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword


we don't use "grouper" as terminology in our documentation (except for the pd.Grouper object), so I would write "groupby key" or "to group by"

also "multipler" -> "multiple"

jorisvandenbossche · 2018-05-01T12:02:31Z

doc/source/groupby.rst

+
+When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
+controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
+that are observed groupers (``observed=True``).


"or only those that are observed groupers" -> "or only the observed categories"

jorisvandenbossche · 2018-05-01T12:03:10Z

doc/source/groupby.rst

+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
+
+The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.


catergories -> categories

jorisvandenbossche · 2018-05-01T12:03:38Z

doc/source/groupby.rst

+
+.. ipython:: python
+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()


I would maybe just create s and cat to avoid repeating this a few times

jorisvandenbossche · 2018-05-01T12:11:25Z

pandas/core/generic.py

@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
        squeeze : boolean, default False
            reduce the dimensionality of the return type if possible,
            otherwise return a consistent type
+        observed : boolean, default None
+            if True: only show observed values for categorical groupers.


capital If (below as well)

Also, can you start this explanation with noting this keyword is only when grouping by categorical values?

jorisvandenbossche · 2018-05-01T12:11:27Z

pandas/core/generic.py

+            if True: only show observed values for categorical groupers.
+            if False: show all values for categorical groupers.
+            if None: if any categorical groupers, show a FutureWarning,
+                default to False.


no identation for rst formatting

jorisvandenbossche · 2018-05-01T12:13:11Z

pandas/core/groupby/groupby.py

@@ -2898,14 +2907,16 @@ class Grouping(object):
    """

    def __init__(self, index, grouper=None, obj=None, name=None, level=None,
-                 sort=True, in_axis=False):
+                 sort=True, observed=None, in_axis=False):


Why not have it as False default? (if we want to deprecate in the future, we can then just use None ?)

jorisvandenbossche · 2018-05-01T12:14:58Z

pandas/core/groupby/groupby.py

+
+        # TODO(jreback): remove completely
+        # when observed parameter is defaulted to True
+        # gh-20583


should this comment be removed for now?

jorisvandenbossche · 2018-05-01T12:28:19Z

My remaining comments are mainly doc related, so I am fine with merging this now for 0.23rc, if @jreback does a follow-up PR.

jreback · 2018-05-01T12:34:31Z

@jorisvandenbossche

I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

and I did push new ones.

jorisvandenbossche · 2018-05-01T12:43:22Z

and I did push new ones.

Well, if I look at the diff for only the commits you added the last day (https://github.com/pandas-dev/pandas/pull/20583/files/19c9cf7871847de8f0a8504e9f121ad1460512d0..bdf7525812ca670f9406ab8df333030d36d30947), there is no change in the _codes_for_groupby function, while you did update it according to my comments.

TomAugspurger · 2018-05-01T14:17:51Z

@jreback opened #20902 for the followup.

Will merge in ~1 hour.

jreback · 2018-05-01T14:48:48Z

@TomAugspurger if you want to do the RC now, then i'll have to followup on comments later.

TomAugspurger · 2018-05-01T15:00:48Z

If we're doing this for 0.23 then it should go in the RC I think. I can wait a bit longer if you plan to push more changes.

jreback · 2018-05-01T15:06:48Z

changes will only be cosmetic so u can merge if u want now

TomAugspurger · 2018-05-01T15:09:02Z

That's what I thought, thanks.

dragoljub · 2018-05-02T00:40:53Z

Great work! Looking forward to trying this out. :+1

closes pandas-dev#20902

ammar-nizami · 2019-01-30T17:41:58Z

I think there is a bug with observed=True. The below statement removed the unobserved categorical values, but the dataframe returned wrong counts.
The categorical values were not sorted and the counts were mismatched with the categories.

df.groupby('categorical_column', observed=True, as_index=False).count()

TomAugspurger · 2019-01-30T17:44:14Z

Try with pandas 0.24.0 if you're not already on it. There were a couple bug fixes in this area. If it still persists, search for open issues / open a new one with a minimal example.

…

On Wed, Jan 30, 2019 at 11:42 AM abdullah-online ***@***.***> wrote: I think there is a bug with observed=True. The below statement removed the unobserved categorical values, but the dataframe returned wrong counts. The categorical values were not sorted and the counts were mismatched with the categories. df.groupby('categorical_column', observed=True, as_index=False).count() — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#20583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIjzDDTxRSKcgL86j5bjm_fabMd5oks5vIdlrgaJpZM4TDk6b> .

jreback added Groupby API Design Categorical Categorical Data Type labels Apr 2, 2018

jreback added this to the 0.23.0 milestone Apr 2, 2018

jreback force-pushed the cats branch from 32ee855 to c819578 Compare April 2, 2018 22:35

jreback force-pushed the cats branch 2 times, most recently from d53e6f6 to 582da12 Compare April 9, 2018 15:05

jreback changed the title ~~WIP/API: categorical grouping will no longer return the cartesian product~~ API: categorical grouping will no longer return the cartesian product Apr 9, 2018

jreback force-pushed the cats branch from 582da12 to df3533a Compare April 20, 2018 00:21

jreback commented Apr 20, 2018

View reviewed changes

jreback force-pushed the cats branch 2 times, most recently from 422a6be to 7cd56cd Compare April 26, 2018 00:33

TomAugspurger reviewed Apr 26, 2018

View reviewed changes

jreback force-pushed the cats branch from 7cd56cd to 602cba4 Compare April 26, 2018 22:04

jreback added 2 commits May 1, 2018 06:07

make observed=False the default, remove deprecation warning

bdb7ad3

more tests & change observed=None

bdf7525

jreback force-pushed the cats branch from fc39e64 to bdf7525 Compare May 1, 2018 10:07

TomAugspurger reviewed May 1, 2018

View reviewed changes

TomAugspurger approved these changes May 1, 2018

View reviewed changes

toobaz mentioned this pull request May 1, 2018

API: Add Dictionary-encoded Extension Type #20899

Open

jorisvandenbossche reviewed May 1, 2018

View reviewed changes

TomAugspurger mentioned this pull request May 1, 2018

Followup to #20583 (observed keyword for Groupby) #20902

Closed

TomAugspurger merged commit b020891 into pandas-dev:master May 1, 2018

jreback added a commit to jreback/pandas that referenced this pull request May 3, 2018

DOC: followup to pandas-dev#20583, observed kwarg for .groupby

243d087

closes pandas-dev#20902

jreback added a commit to jreback/pandas that referenced this pull request May 4, 2018

DOC: followup to pandas-dev#20583, observed kwarg for .groupby

4d39dc7

closes pandas-dev#20902

jreback added a commit that referenced this pull request May 5, 2018

DOC: followup to #20583, observed kwarg for .groupby (#20941)

c94a68c

kokes mentioned this pull request Aug 28, 2018

crosstab in 0.23.4 does not respect categorical variables #22453

Closed

TheoSimier mentioned this pull request Oct 17, 2019

BUG: inconsistent behaviour of Groupby (probably a regression) #29051

Closed

This was referenced Jan 29, 2022

Zero counts in Series.value_counts for categoricals #8559

Closed

BUG: Inconsistent behaviour for DataFrame.value_counts and Series.value_counts on categoricals #44001

Open

This pull request was closed.



		@pytest.mark.xfail(reason="failing with observed")
		def test_observed_failing(observed):

		df.groupby(['A', 'B', 'C'], observed=False).count()


		New Behavior (show only observed values):


		pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()

		The returned dtype of the grouped will always include all of the catergories that were grouped.


		.. ipython:: python

		pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()

API: categorical grouping will no longer return the cartesian product #20583

API: categorical grouping will no longer return the cartesian product #20583

Conversation

jreback commented Apr 2, 2018 • edited Loading

jreback commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

jreback commented Apr 2, 2018

codecov bot commented Apr 2, 2018 • edited Loading

Codecov Report

jreback commented Apr 9, 2018

jorisvandenbossche commented Apr 9, 2018

jreback commented Apr 12, 2018

TomAugspurger commented Apr 12, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 26, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 26, 2018

jorisvandenbossche commented Apr 26, 2018

jreback commented Apr 26, 2018

toobaz commented Apr 26, 2018

jorisvandenbossche commented Apr 26, 2018

jankatins commented Apr 27, 2018

toobaz commented Apr 27, 2018

jreback commented May 1, 2018

jankatins commented May 1, 2018

TomAugspurger commented May 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger May 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 1, 2018

jreback commented May 1, 2018

jorisvandenbossche commented May 1, 2018

TomAugspurger commented May 1, 2018 • edited Loading

jreback commented May 1, 2018

TomAugspurger commented May 1, 2018

jreback commented May 1, 2018

TomAugspurger commented May 1, 2018

dragoljub commented May 2, 2018

ammar-nizami commented Jan 30, 2019

TomAugspurger commented Jan 30, 2019 via email

jreback commented Apr 2, 2018 •

edited

Loading

codecov bot commented Apr 2, 2018 •

edited

Loading

TomAugspurger May 1, 2018 •

edited

Loading

TomAugspurger commented May 1, 2018 •

edited

Loading