BUG: Wrong grouping of categoricals when observed=True #21151

topper-123 · 2018-05-21T15:46:14Z

Setup:

>>> s1 = pd.Categorical([np.nan, 'a', np.nan, 'a'], categories=['a', 'b'])
>>> s2 = pd.Series([1, 2, 3, 4])
>>> df = pd.DataFrame({'s1': s1, 's2': s2})

Comparing results with observed=False and observed=True:

>>> df.groupby('s1').sum()  # ok
    s2
s1
a    6
b    0
>>> df.groupby('s1', observed=True).sum()
     s2
s1
NaN   6  # should not be shown
a     0  # should be 6

Notice the value are assigned wrongly.

Also, notice that NaN is now a possible label.

If the first value is not a Nan, the assignment works fine (but Nan is still a possible label)

>>> df[1:].groupby('s1', observed=True).sum()
     s2
s1
a     6
NaN   0

Problem description

The problem concerns when there are unobserved labels and the first value is Nan. If there are no unobserved values, everyting seems alright from my checks.

Nan should probably not be a possible label.

Expected Output

I would assume same output as when òbserved=False, but without unobserved labels:

>>> df.groupby('s1', observed=True).sum()
    s2
s1
a    6

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.4.0
pip: 10.0.1
setuptools: 38.4.1
Cython: 0.26.1
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.7.4
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

diegogarcilazo · 2018-05-23T16:55:00Z

it seems to be related to issue #21133

cwkwong · 2018-05-24T13:32:50Z

I've also tested the above groupby() followed by first() and it seems to produce some unexpected results.

Setup:

>>> s1 = pd.Categorical([np.nan, 'a', np.nan, 'a'], categories=['a', 'b'])
>>> s2 = pd.Series([1,2,3,4])
>>> 
>>> df = pd.DataFrame({'s1':s1, 's2':s2})
>>> df
    s1  s2
0  NaN   1
1    a   2
2  NaN   3
3    a   4

Comparing results with observed=False and observed=True:

>>> df.groupby('s1').first()
     s2
s1     
a   2.0
b   NaN

>>> df.groupby('s1', observed=True).first()
      s2
s1      
NaN  2.0
a    NaN

When observed=True, the results is somewhat strange.

Incorrect s2=NaN for s1=a. I think should be s2=2.0 ?
NaN appearing in the groupby(...).first() result for s1? I would expect no NaN under s1?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: 3.3.1
pip: 9.0.1
setuptools: 18.2
Cython: 0.25.2
numpy: 1.14.0
scipy: 0.16.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2014.4
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.5.0
bs4: None
html5lib: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

xref pandas-dev#21151

xref #21151

xref pandas-dev#21151

cgangwar11 · 2019-01-21T17:46:38Z

In [9]: import pandas as pd  
   ...: s1 = pd.Categorical([np.nan, 'a', np.nan, 'a'], categories=['a', 'b','c']) 
   ...: s2 = pd.Series([1,2,3,4]) 
   ...: df = pd.DataFrame({'s1':s1, 's2':s2})                                                                    

In [10]: df                                                                                                      
Out[10]: 
    s1  s2
0  NaN   1
1    a   2
2  NaN   3
3    a   4

In [11]: df.groupby('s1').first()                                                                                
Out[11]: 
     s2
s1     
a   2.0
b   NaN
c   NaN

In [12]: df.groupby('s1',observed=True).first()                                                                  
Out[12]: 
    s2
s1    
a    2

In [13]:

jschendel added Bug Groupby Categorical Categorical Data Type labels May 21, 2018

jorisvandenbossche added this to the 0.23.1 milestone May 24, 2018

topper-123 mentioned this issue May 30, 2018

BUG: dropna incorrect with categoricals in pivot_table #21252

Merged

jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018

jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018

jreback added a commit to jreback/pandas that referenced this issue Jul 5, 2018

CLN: clean up groupby / categorical

f8b02d6

xref pandas-dev#21151

jreback mentioned this issue Jul 5, 2018

CLN: clean up groupby / categorical #21753

Merged

jreback added a commit to jreback/pandas that referenced this issue Jul 6, 2018

CLN: clean up groupby / categorical

da877df

xref pandas-dev#21151

jreback added a commit that referenced this issue Jul 6, 2018

CLN: clean up groupby / categorical (#21753)

e2f2e21

xref #21151

jreback modified the milestones: 0.23.4, 0.23.5 Aug 2, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

CLN: clean up groupby / categorical (pandas-dev#21753)

10abc40

xref pandas-dev#21151

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

topper-123 mentioned this issue Dec 28, 2018

BUG: Fix groupby observed=True when aggregating a column #24412

Merged

4 tasks

jorisvandenbossche mentioned this issue Jan 12, 2019

Exception when grouping on a modified category #24740

Closed

cgangwar11 mentioned this issue Jan 21, 2019

BUG : ValueError in case on NaN value in groupby columns #24850

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Jan 22, 2019

jreback closed this as completed in #24850 Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Wrong grouping of categoricals when observed=True #21151

BUG: Wrong grouping of categoricals when observed=True #21151

topper-123 commented May 21, 2018 •

edited

INSTALLED VERSIONS

diegogarcilazo commented May 23, 2018

cwkwong commented May 24, 2018 •

edited

INSTALLED VERSIONS

cgangwar11 commented Jan 21, 2019 •

edited

BUG: Wrong grouping of categoricals when observed=True #21151

BUG: Wrong grouping of categoricals when observed=True #21151

Comments

topper-123 commented May 21, 2018 • edited

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

diegogarcilazo commented May 23, 2018

cwkwong commented May 24, 2018 • edited

Output of pd.show_versions()

INSTALLED VERSIONS

cgangwar11 commented Jan 21, 2019 • edited

topper-123 commented May 21, 2018 •

edited

Output of `pd.show_versions()`

cwkwong commented May 24, 2018 •

edited

Output of `pd.show_versions()`

cgangwar11 commented Jan 21, 2019 •

edited