Groupby on categoricals includes unused categories

#### In a nutshell:
* `groupby` on a single categorical column with prescribed categories incorrectly returns results for **all** categories, even those that are not actually present in the DataFrame.
* In addition, when aggregating after grouping on categoricals (`groupby.sum` and the likes), with both prescribed and non-prescribed categories, we get values for **all possible combinations** of categories, including those not present in the DataFrame.

#### Problem description and code samples

##### Case 1: group by a single column
Consider the following code where we define a DataFrame with a categorical column with prescribed categories.
```python
import pandas as pd

labels = pd.Series(list('abcbabcab'))
df = pd.DataFrame({'label1': labels,
                   'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))
```
When we group by the categorical `label1` column and aggregate, we incorrectly get results for **all** prescribed categories, including those that are not present in the DataFrame (see rows below where the value of x is `NaN`):
```python
In[6]: df.groupby('label1').sum()
Out[6]:
          x
label1     
a       2.0
b       3.0
c       2.0
d       NaN
e       NaN
f       NaN
```
The elements in excess are already present in the `groupby` object:
```python
In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
 'b': Int64Index([1, 3, 5, 8], dtype='int64'),
 'c': Int64Index([2, 6], dtype='int64'),
 'd': Int64Index([], dtype='int64'),
 'e': Int64Index([], dtype='int64'),
 'f': Int64Index([], dtype='int64')}
```
The above doesn't happen if the categories for the `label1` column are not prescribed or if the column is not converted to a categorical at all.

##### Case 2: group by multiple columns
Consider now the case where we have two categorical columns:
```python
df = pd.DataFrame({'label1': labels,
                   'label2': labels,
                   'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))
df.label2 = labels.astype('category', categories=list('abcdef'))
```
Contrary to the single column case above, we do get the correct group labels when we group by both categorical columns `label1` and `label2`:
```python
In[11]: df.groupby(['label1', 'label2']).groups
Out[11]:
{('a', 'a'): Int64Index([0, 4, 7], dtype='int64'),
 ('b', 'b'): Int64Index([1, 3, 5, 8], dtype='int64'),
 ('c', 'c'): Int64Index([2, 6], dtype='int64')}
```

But we still get incorrect results if we aggregate:
```python
In[12]: df.groupby(['label1', 'label2']).sum()
Out[12]:
                 x
label1 label2     
a      a       2.0
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
b      a       NaN
       b       3.0
       c       NaN
       d       NaN
       e       NaN
       f       NaN
c      a       NaN
       b       NaN
       c       2.0
       d       NaN
       e       NaN
       f       NaN
d      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
e      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
f      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
```

**Note**: The aggregation shows the same inccorect behaviour also when we *don't* prescribe the categories:
```python
In[13]: df.label1 = labels.astype('category')
   ...: df.label2 = labels.astype('category')
In[14]: df.groupby(['label1', 'label2']).sum()
Out[14]:
                 x
label1 label2     
a      a       2.0
       b       NaN
       c       NaN
b      a       NaN
       b       3.0
       c       NaN
c      a       NaN
       b       NaN
       c       2.0

```




#### Expected Output
For **Case 1**:
```python
In[6]: df.groupby('label1').sum()
Out[6]:
        x
label1   
a       2
b       3
c       2
```

```python
In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
 'b': Int64Index([1, 3, 5, 8], dtype='int64'),
 'c': Int64Index([2, 6], dtype='int64')}
```

For **Case 2**:
```python
In[19]: df.groupby(['label1', 'label2']).sum()
Out[19]:
               x
label1 label2   
a      a       2
b      b       3
c      c       2
```

#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Groupby on categoricals includes unused categories #17631

In a nutshell:

Problem description and code samples

Case 1: group by a single column

Case 2: group by multiple columns

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Groupby on categoricals includes unused categories #17631

Description

In a nutshell:

Problem description and code samples

Case 1: group by a single column

Case 2: group by multiple columns

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `pd.show_versions()`