# GroupBy Not Throwing KeyError When Names Exist in MultiIndex #25704

opened this issue Mar 13, 2019 · 6 comments

### vss888 commented Mar 13, 2019

Here is a link to the discussion: pandas Series groupby with one group

#### Code Sample, a copy-pastable example if possible

```# from the stackoverflow link above
import pandas as pd
data = pd.DataFrame(data={'date':[pd.Timestamp('2016-02-15')]*3,
'time':[pd.Timedelta(x) for x in ('07:30:00','10:10:00','11:10:00')],'name':['A']*3, 'N':[1,2,3]}
).set_index(['date','time','name']).sort_index()
data = data[ data.index.get_level_values('time')>=pd.to_timedelta('09:30:00') ]
dataGB = data['N'].groupby(['date','name'])
print(data)
print('Number of groups:',len(dataGB))
print(dataGB.sum())
print(pd.__version__)
```

#### Problem description

1. The code produces 2 groups while clearly there should be only one.
2. dataGB.sum() result is incorrect

#### Real Output

``````>>> print(data)
N
date       time     name
2016-02-15 10:10:00 A     2
11:10:00 A     3
>>> print('Number of groups:',len(dataGB))
Number of groups: 2
>>> print(dataGB.sum())
date    2
name    3
Name: N, dtype: int64
>>> print(pd.__version__)
0.24.1
``````

#### Expected Output

``````>>> print(data)
N
date       time     name
2016-02-15 10:10:00 A     2
11:10:00 A     3
>>> print('Number of groups:',len(dataGB))
Number of groups: 1
>>> dataGB.sum()
date        name
2016-02-15  A       5
Name: N, dtype: int64
>>> print(pd.__version__)
0.24.1
``````

#### Output of `pd.show_versions()`

pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.11.6.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf-8
LANG: en_US.utf-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 3.3.2
pip: 19.0.3
setuptools: 39.0.1
Cython: 0.27.3
numpy: 1.16.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
gcsfs: None

### WillAyd commented Mar 13, 2019 • edited

 You should get the desired behavior by selecting levels: ```In [11]: data['N'].groupby(level=['date', 'name']).sum() Out[11]: date name 2016-02-15 A 5 Name: N, dtype: int64``` This should be raising a `KeyError` as dates and names aren't actually column labels. PRs to make that happen would certainly be welcome

### vss888 commented Mar 13, 2019 • edited

 @WillAyd According to Grouping DataFrame with Index Levels and Columns : "Index level names may be specified as keys directly to groupby" (starting with version 0.20, see In/Out[51] on the page). So, what I did should be correct. Am I misunderstanding anything?
### vss888 commented Mar 13, 2019 • edited

 I think, it has something to do with the following selection line in my example: `data = data[ data.index.get_level_values('time')>=pd.to_timedelta('09:30:00') ]` Without it, the code works correctly.
### vss888 commented Mar 13, 2019

 Or it might have something to do with the `Series` input to `groupby` having only two rows, since the following example (without any selection) also produces incorrect result: ```import pandas as pd data = pd.DataFrame(data={'date':[pd.Timestamp('2016-02-15')]*2, 'time':[pd.Timedelta(x) for x in ('10:10:00','11:10:00')],'name':['A']*2, 'N':[2,3]}).set_index(['date','time','name']).sort_index() dataGB = data['N'].groupby(['date','name']) print(data) print('Number of groups:',len(dataGB)) print(dataGB.sum()) print(pd.__version__)```
### WillAyd commented Mar 13, 2019

 Hmm OK thanks for sharing that. I see this was actually implemented in #14432 Looking at the test coverage there I don't see anything that has multiple index levels without a column selection, so that may be the culprit here. Investigation and PRs would be welcome

### ArtificialQualia commented Mar 15, 2019

 I can take this one, I see where the problem is. It appears to be in `core/groupby/grouper.py:_get_grouper` where `all_in_columns_index` isn't properly checking for series like it does for DataFrame. And since `len(keys) == len(group_axis)` in this specific case, it isn't grouping properly.

