New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: categorical grouping will no longer return the cartesian product #20583

Merged
merged 6 commits into from May 1, 2018

Conversation

Projects
None yet
7 participants
@jreback
Contributor

jreback commented Apr 2, 2018

closes #14942
closes #15217
closes #17594
closes #8869

xref #8138

@jreback jreback added this to the 0.23.0 milestone Apr 2, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 2, 2018

this makes categorical groupers work like other groupers and should make things more performant and intuitive. It is somewhat walking back a change from when we first introduced categorical groupers. But the information is preserved (meaning the categories are still there), just removes the automatic re-indexing (which was causing memory to blow up).

still need some increased test coverage. note the first commit actually has almost all of the changes, the next are just cleaning up tests.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 2, 2018

Just making sure, this affects grouping by a single categorical as well?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 2, 2018

yes this affects just a single categorical column also

@codecov

This comment has been minimized.

codecov bot commented Apr 2, 2018

Codecov Report

Merging #20583 into master will increase coverage by <.01%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20583      +/-   ##
==========================================
+ Coverage   91.78%   91.79%   +<.01%     
==========================================
  Files         153      153              
  Lines       49341    49371      +30     
==========================================
+ Hits        45287    45319      +32     
+ Misses       4054     4052       -2
Flag Coverage Δ
#multiple 90.18% <98.07%> (+0.01%) ⬆️
#single 41.92% <5.76%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.67% <100%> (+0.05%) ⬆️
pandas/core/groupby/groupby.py 92.62% <100%> (+0.07%) ⬆️
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️
pandas/core/reshape/pivot.py 96.97% <87.5%> (ø) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28edd06...bdf7525. Read the comment docs.

@jreback jreback changed the title from WIP/API: categorical grouping will no longer return the cartesian product to API: categorical grouping will no longer return the cartesian product Apr 9, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 9, 2018

so this is all ready to go if any comments. @TomAugspurger @jorisvandenbossche

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 9, 2018

yes this affects just a single categorical column also

That doesn't seem to be the case?

With this branch, and the example of the whatsnew docs:

In [14]: df.groupby(['A', 'B']).sum()
Out[14]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

In [15]: df.groupby('A').sum()
Out[15]: 
   values
A        
a       3
b       7
z       0               <---------- now unobserved category is included

Further a general comment (will try to do more detailed review later this week): I am not sure we can just change this. First, it is a API breaking change in several places, eg also pivot_table*. And second, I think the current behaviour can actually useful in certain cases and it would be nice to have a way to keep this behaviour.
I know this is very ugly, but it would be worth to have a keyword for this in groupby ?

* but I agree we should look into it and try to make this more consistent. As, for example, pivot does not seem to include unobserved categories (already currently on master), while pivot_table does include them but apparently only for the index and not columns? (try df.pivot_table('values', 'A', 'B') with the example of whatsnew). On the other hand, value_counts does include them (and I think rightly so), but that also introduces an inconsistency with groupby.

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 12, 2018

So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent.

We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for:

  • an option to turn this on/off.
  • differing behavior for single vs multi
  • change single grouping behavior as well
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 12, 2018

@pytest.mark.xfail(reason="failing with observed")
def test_observed_failing(observed):

This comment has been minimized.

@jreback

jreback Apr 20, 2018

Contributor

if anyone wants to take a crack at this test. Its the only one that uses an IntervalIndex as its category, though this might be a red herring.

cc @WillAyd @toobaz @jschendel

This comment has been minimized.

@WillAyd

WillAyd Apr 24, 2018

Member

Haven't been able to follow all the way through yet but I think this is a regression with the below method:

def decons_group_index(comp_labels, shape):

Using the failing test example this returns [array(1, 1, 0, 0], ... but on master that same object would return [array(1, 1, 2, 2], .... As a result, I think the reconstructed group labels are getting swapped in _wrap_agged_blocks and causing the failure.

Will try to find time next day or so to walk through in more detail but sharing in case it helps anyone else reviewing the issue

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 26, 2018

this is read for a look

@TomAugspurger @jorisvandenbossche

@TomAugspurger

Looks good. Main thing is the change to pivot_table.

It'd be good to document whether the unobserved categories are present in the resulting index's type.

i.e. is

In [7]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b'])).count().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)

or

Out[7]: CategoricalDtype(categories=['a'], ordered=False)

This could go in the whatsnew and either groupby.rst or categorical.rst probably. I think either behavior is fine.

Still going through the test changes.

.. ipython:: python
.. code-block:: python

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Maybe have an example showing the future warning? So just this without observed=False, and an :okwarning: directive. Then you can say "use observed=False to retain the previous behavior and silence the warning.

df.groupby(['A', 'B', 'C'], observed=False).count()
New Behavior (show only observed values):

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

New -> Future?

msg = ("pass observed=True to ensure that a "
"categorical grouper only returns the "
"observed groupers, or\n"
"observed=False to return NA for non-observed"

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Some things like 'count' return 0 for unobserved. You could rephrase as "observed=False to include unobserved categories."

@@ -79,7 +79,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)
grouped = data.groupby(keys)
grouped = data.groupby(keys, observed=dropna)

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Need to think about this a bit. Are we all OK with "overloading" dropna to serve two purposes? I think it's ok...

This comment has been minimized.

@jreback

jreback Apr 26, 2018

Contributor

yes, but that's the meaning of the dropna now here anyhow

@@ -241,10 +241,13 @@ def _all_key(key):
return (key, margins_name) + ('',) * (len(cols) - 1)
if len(rows) > 0:
margin = data[rows + values].groupby(rows).agg(aggfunc)
margin = data[rows + values].groupby(

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

observed=True for these changes are all backwards compatible?

@@ -488,12 +488,12 @@ def test_agg_structs_series(structure, expected):
@pytest.mark.xfail(reason="GH-18869: agg func not called on empty groups.")
def test_agg_category_nansum():
def test_agg_category_nansum(observed):

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Does this need to be xfailed for observed=True? I think it may be the right answer (not sure though).

This comment has been minimized.

@jreback

jreback Apr 26, 2018

Contributor

fixed so that observed=True is XPASS (not sure how to xfail on a particular fixture value)

@@ -4,6 +4,11 @@
from pandas.util import testing as tm
@pytest.fixture(params=[True, False])
def observed(request):

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Docstring would be nice.

names=['A', 'B'])
expected = DataFrame(
{'values': [1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan]},

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Ah, so those changes in pivot are API breaking? I'd prefer that this goes through the same deprecation cycle. On the other hand, this would mean an additional keyword, so that we can do the deprecation cycle properly...

Do we think pivot(..., dropna=True, observed=False) (i.e. the current default) is a useful combination? I could see it being desired.

This comment has been minimized.

@jreback

jreback Apr 26, 2018

Contributor

observed works like dropna, so I am not sure we need this. I can make the default work like the existing I think (which is effectively dropna=False)

def test_groupby_sort_categorical():
def test_sort():
# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby

This comment has been minimized.

@TomAugspurger

TomAugspurger Apr 26, 2018

Contributor

Is this line too long? https://stackoverflow.com/q/23814368/1889400 is a short link to the same q.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

http://pandas-docs.github.io/pandas-docs-travis/categorical.html#operations would be a good place to show observed=False.

About pivot_table, I always forget what dropna does... Does it control dropping columns / rows that are all NA before or after aggregating?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 26, 2018

I will try to give this a more detailed look tomorrow.

But general comment: now that we have the keyword to specify the behaviour (for now with observed=False for back compat), I am not really sure that we should change the default in the future.

It's difficult to judge, as I personally don't run much in such situations, but my gut feeling says that the original issue that motivated the change (the combinatorial explosion with multiple categorical groupers) is not the majority usage pattern of categoricals. And if you don't want to include unobserved categoricals in your analyses in general (groupby, pivot, value_counts, plotting, ..), you always have the easy functionality of remove_unused_categories.

cc @jankatins

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 26, 2018

@TomAugspurger the pivot was a red herring, was not passing things thru. and dropna (in pivot) is exactly equivalent of the observed kwarg (to groupby). So maybe should just rename observed -> dropna, and would be consistent across other functions (value_counts) as well.

@jorisvandenbossche I disagree. You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default. (its also easy to create if you need). Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index, (observed or not), its the groupers that are the issue.

@toobaz

This comment has been minimized.

Member

toobaz commented Apr 26, 2018

@jorisvandenbossche I also think the default should change. Categoricals are mostly an implementation detail, and should behave as similarly as possible as ordinary Series.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 26, 2018

You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default.

But in most cases you have no cartesian product, you only have a single categorical key

It can easily blow you up and shouldn't be the default. (its also easy to create if you need).

It's also easy to get the version with only the observed ones (certainly now there is the keyword), so that is not really an argument IMO

Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index

Yes, and that is certainly a good thing. But for me it is still the question what we want the visible output to be.
For example a pivot table is often used as a kind of a summary table. In such a case, I often do want to know that a certain category I care about (otherwise it would not be in the categories) has no values.
And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

@jankatins

This comment has been minimized.

Contributor

jankatins commented Apr 27, 2018

My original motivation to work on categoricals were stuff like lickert scales ("completly agree ... completly disagree", 5 to 7 values). For that seing unobserved categoricals in group bys are a good thing.

Since then a lot of changes have been made to make categoricals usefull in other situations (like as a memory efficient string replacement or to analyse genes?).

This change feels like it is a change to make working with the latter easier but will make the former harder.

@toobaz

This comment has been minimized.

Member

toobaz commented Apr 27, 2018

And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

I gave it as granted that if we change groupby we will also change value_counts - otherwise I agree it doesn't make any sense.

This change feels like it is a change to make working with the latter easier but will make the former harder.

True... I just think your use case is more rare (but I have no data to support this statement, other than personal experience). Anyway, with the new argument both things will be pretty easy anyway, it is mainly a matter of which use case should require awareness from the user.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 27, 2018

You can indeed use Categorical as memory-efficient string storage if you have some repetition, but, I don't think we should take that use case into account for designing the default API (note I say default, it's certainly we make that use case possible, which it is with the keyword).

Of course, you can still have "real" categorical data with many categories, from which you typically only observe a subset. And then you might want to have the dropping behaviour as default.
My feeling says that the use case with fewer categories of which you typically observe the majority is more common, but I also have completely no objective idea about this :-)

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 27, 2018

agree for consistency should change:

In [1]:  pd.Series(list('aabc')).astype(pd.api.types.CategoricalDtype(list('abcd'))).value_counts(dropna=True)
Out[1]: 
a    2
c    1
b    1
d    0
dtype: int64

In [2]:  pd.Series(list('aabc')).astype(pd.api.types.CategoricalDtype(list('abcd'))).value_counts(dropna=False)
Out[2]: 
a    2
c    1
b    1
d    0
dtype: int64

so that True would remove the unobserved.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 27, 2018

I think in the long run I'd prefer unobserved categories to be excluded from groupby, primarily because the cartesian product issue seems like such a difficult gotcha. Though if / when we have an "interned string" array type (categorical, but with different semantics), would we want to go back on that change?


value_counts highlights my discomfort with overloading dropna. It will require an observed keyword to handle cases like

In [21]: c1 = pd.Categorical(['a', 'a', None], categories=['a'])

In [22]: c2 = pd.Categorical(['a', 'a', None], categories=['a', 'b'])

I don't think c1.value_counts(dropna=True) should be the same as c2.value_counts(dropna=True). c2 has the additional category. So either recommend c2.remove_unused_categories.value_counts(dropna=True) or c2.value_counts(dropna=True, observed=True).

@toobaz

This comment has been minimized.

Member

toobaz commented Apr 28, 2018

Might be a crazy idea... but just in case it's not, I'll rather propose it now than later: what if showing empty categories (or not) was a property of the dtype, preserved across transformations, just like the (possible) ordering?

That way, @jankatins would need to specify only once, when creating the categorical for the first time, that his categories (the Likert scale values) are "all important". Otherwise, the categories would be considered as merely "functional".

We could even think that a categorical created by passing the categories has "important" categories by default, while a categorical which creates its own categories automatically has "functional" categories by default (clearly with a parameter allowing to change this). (But this might be too much magic.)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 28, 2018

I would rather make a "dict-encoded / interned string" type than overloading Categorical with this. Because in such a case, you would typically also not care about introducing new categories (eg when concatting etc), which now raises errors for Categorical. We have had discussion about exactly that before: #8640 (but you will see the discussion is also a bit dated :-))

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 28, 2018

In a pandas with an interned string array, would we want the observed keyword? I think so, but the default would stay False, as it is now.

@jorisvandenbossche

General question (in case we keep the deprecation warning): do we want to show it always? Or only in the case where you actually have unobserved values?

Didn't look at the tests yet

*before* applying the aggregation function.
.. _groupby.observed:
observed hanlding

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

Typos: Observed handling

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

Or maybe a bit longer "Handling of (un)observed Categorical values"

Categorical Groupers will now require passing the observed keyword
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

Can you not use the "cartesian product" to introduce this? The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).

@@ -659,6 +661,22 @@ def _codes_for_groupby(self, sort):
categories in the original order.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

Can you update this docstring?

take_codes = np.sort(take_codes)
# we recode according to the uniques
cat._categories = self.categories.take(take_codes)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

What is this _categories attribute? I don't think that actually exists, so you can pass the new categories directly the function below?

I am also a bit confused by what is going on in the code here, so some additional comments might be useful.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

To explain my confusion: what this returns is "strange", as it does returns a modified categorical that does not match the data any longer. Eg for

cat = pd.Categorical(['a', 'c', 'a'], categories=['a', 'b', 'c'])

this returns a categorical with values ['a', 'b', 'a'], because you recode the codes (based on observed categories) without changing the categories.

This makes that the Grouping._group_index is "incorrect", but apparently this is fine for the actual groupby implementation because the groupby results are correct (but, it makes the above function somewhat confusing). But with playing around with it, I found a place where it is actually propagated:

In [52]: cat = pd.Categorical(['a', 'c', 'a'], categories=['a', 'b', 'c'])

In [53]: df = pd.DataFrame({'cat': cat, 'vals': [1, 2, 3]})

In [54]: g = df.groupby('cat', observed=True)

In [55]: g.sum()   <--- correct
Out[55]: 
     vals
cat      
a       4
c       2

In [56]: g.groups    <--- incorrect
Out[56]: 
{'a': Int64Index([0, 2], dtype='int64'),
 'b': Int64Index([1], dtype='int64'),
 'c': Int64Index([], dtype='int64')}

(same for g.indices)

@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default None
if True: only show observed values for categorical groupers

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

Can you format this a bit? (either with capitals and punctuation so it reads correctly in continuous text, or either with bullet points)

if observed is None:
msg = ("pass observed=True to ensure that a "
"categorical grouper only returns the "
"observed groupers, or\n"

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 28, 2018

Member

"groupers" -> "categories" or "values"

@toobaz

This comment has been minimized.

Member

toobaz commented May 1, 2018

Grouper/valuecount/etc show only observed (string) or also not observed categoricals (Lickert like Categorical) -> taken care of here if observed is a property of the dtype -> you would add a StringDtype which always observed == True
Sorting is done by defined categorical order (cat) or value (string)

I see these two points as orthogonal. I'm sure I will mostly use observed=True, and this doesn't mean I attribute any meaning to the "natural" (alphabetic) order of categories (which, by the way, won't necessarily be strings).

I am currently working on administrative data in which a variable can take ~30 values (types), but most of them are extremely rare, or anyway uninteresting for the analysis. I mostly work on the 5-10 most frequent types, and in those cases I will always use observed=False (I currently often drop zeros manually). The types have their own ordering, while the alphabetic one is meaningless.

You can insert new values / merge with other values / compare with unknown values (string) or not (cat)

I know I'm repeating myself, but I was never able to consider this limitation as a feature, regardless of order, dtype, and observed=. If my data provider introduces a new category, I don't see why it is should be good that my code breaks when appending the new data to the old data. I totally accept it as a limitation (with annoying consequences e.g. for roundtrips, and compatibility with other types), just not as a feature.

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2018

I changed the default back to observed=None to preserve options. I think we can discuss wether / if to change this in the future. But for now allows folks to deal with grouping in a better way now.

@jankatins

This comment has been minimized.

Contributor

jankatins commented May 1, 2018

I don't see why it is should be good that my code breaks when appending the new data to the old data.

If you have a defined order, where would you put the new categorical into the orders? E.g. good - middle - bad, where would you put 'extreme' -> if you care about order, then you also care about not adding new stuff

I mostly work on the 5-10 most frequent types, and in those cases I will always use observed=False

This usecase might also be satisfied by a collapse or lump method (e.g. see here: http://r4ds.had.co.nz/factors.html#modifying-factor-levels)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 1, 2018

FYI, discussion for a new dict-encoded / interned values type can go at #20899 so we don't lose it.

Reviewing this PR one more time, but I think people are all on board with having the default be observed=False (back compat) for now?

@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/
.. _whatsnew_0230.enhancements.categorical_grouping:
Categorical Groupers has gained an observed keyword

This comment has been minimized.

@TomAugspurger

TomAugspurger May 1, 2018

Contributor

has -> have? Because "categorical Groupers" is plural right?

@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
categories in the original order.
"""
# we only care about observed values
if observed:
unique_codes = unique1d(self.codes)

This comment has been minimized.

@TomAugspurger

TomAugspurger May 1, 2018

Contributor

Haven't thought this through, but can this if block be replaced with self.remove_unused_cateogories()._codes_for_groupby(sort=sort, observed=False)?

This comment has been minimized.

@jreback

jreback May 3, 2018

Contributor

no, you actually need the uniques

@jorisvandenbossche

@jreback I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

Categorical Groupers has gained an observed keyword
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).

This comment has been minimized.

@jreback

jreback May 1, 2018

Contributor

I didn't change this on purpose, this is more correct.

This comment has been minimized.

@TomAugspurger

TomAugspurger May 1, 2018

Contributor

"Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as

"Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage."

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

Yep, the explanation of Tom is exactly what I meant.

@jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :))

Handling of (un)observed Categorical values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

we don't use "grouper" as terminology in our documentation (except for the pd.Grouper object), so I would write "groupby key" or "to group by"

also "multipler" -> "multiple"

When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
that are observed groupers (``observed=True``).

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

"or only those that are observed groupers" -> "or only the observed categories"

pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

catergories -> categories

.. ipython:: python
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

I would maybe just create s and cat to avoid repeating this a few times

@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default None
if True: only show observed values for categorical groupers.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

capital If (below as well)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

Also, can you start this explanation with noting this keyword is only when grouping by categorical values?

if True: only show observed values for categorical groupers.
if False: show all values for categorical groupers.
if None: if any categorical groupers, show a FutureWarning,
default to False.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

no identation for rst formatting

@@ -2898,14 +2907,16 @@ class Grouping(object):
"""
def __init__(self, index, grouper=None, obj=None, name=None, level=None,
sort=True, in_axis=False):
sort=True, observed=None, in_axis=False):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

Why not have it as False default? (if we want to deprecate in the future, we can then just use None ?)

# TODO(jreback): remove completely
# when observed parameter is defaulted to True
# gh-20583

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 1, 2018

Member

should this comment be removed for now?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 1, 2018

My remaining comments are mainly doc related, so I am fine with merging this now for 0.23rc, if @jreback does a follow-up PR.

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2018

@jorisvandenbossche

I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

and I did push new ones.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 1, 2018

and I did push new ones.

Well, if I look at the diff for only the commits you added the last day (https://github.com/pandas-dev/pandas/pull/20583/files/19c9cf7871847de8f0a8504e9f121ad1460512d0..bdf7525812ca670f9406ab8df333030d36d30947), there is no change in the _codes_for_groupby function, while you did update it according to my comments.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 1, 2018

@jreback opened #20902 for the followup.

Will merge in ~1 hour.

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2018

@TomAugspurger if you want to do the RC now, then i'll have to followup on comments later.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 1, 2018

If we're doing this for 0.23 then it should go in the RC I think. I can wait a bit longer if you plan to push more changes.

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2018

changes will only be cosmetic so u can merge if u want now

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 1, 2018

That's what I thought, thanks.

@TomAugspurger TomAugspurger merged commit b020891 into pandas-dev:master May 1, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dragoljub

This comment has been minimized.

dragoljub commented May 2, 2018

Great work! Looking forward to trying this out. :+1

jreback added a commit to jreback/pandas that referenced this pull request May 3, 2018

jreback added a commit to jreback/pandas that referenced this pull request May 4, 2018

jreback added a commit that referenced this pull request May 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment