BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?)

Whilst teaching "more efficient Pandas" I dug into memory usage in a `groupby` with `memory_profiler` and was surprised by the output (below). For a 1.6GB DataFrame of random data with a indicator column (all random data), a groupby on the indicator generates a result that doubles the RAM usage. I was surprised that the groupby would take a further 1.6GB during the operation when the result is a tiny DataFrame.

In this case there are 20 columns by 1 million rows of random floats with a 21st column as an int indicator in the range [0, 9], a gropuby on this creates 10 groups resulting in a mean groupby result of 10 rows by 20 columns. This works as expected.

The groupby operation, shown further below with `memory_profiler`, seems to make a _copy_ of each group before performing a mean, so the total groupby costs a further 1.6GB. I'd have expected that a light reference was taken to the underlying data rather than (apparently, but maybe I read this incorrectly?) a copy being taken. I've also taken out a single group in further lines of code and each group costs 1/10th of the RAM (160-200MB) which gives some further evidence that a copy is being taken.

Is my mental model wrong? Is it expected that a copy is taken of each group? Is there a way to run this code with a smaller total RAM envelope?

```
$ python -m memory_profiler dataframe_example2.py 
(10000000, 20) shape for our array
df.head():
          0         1         2         3         4         5         6  ...        14        15        16        17        18        19  indicator
0  0.236551  0.588519  0.629860  0.736470  0.064391  0.755922  0.693302  ...  0.031612  0.593927  0.523154  0.704383  0.800547  0.730927          1
1  0.044981  0.305559  0.156594  0.014646  0.339585  0.177476  0.242033  ...  0.428930  0.099833  0.256720  0.326671  0.037584  0.435411          8
2  0.837702  0.246343  0.380937  0.990791  0.586211  0.155818  0.990258  ...  0.453055  0.363815  0.979012  0.220975  0.650783  0.338048          7
3  0.721275  0.327818  0.689749  0.715901  0.617750  0.550584  0.686884  ...  0.172825  0.083338  0.474990  0.213201  0.236640  0.962145          5
4  0.698709  0.998042  0.805397  0.971646  0.260935  0.602839  0.012762  ...  0.458625  0.248945  0.114550  0.212636  0.019970  0.159915          4

[5 rows x 21 columns]
# row by row memory usage clipped and shown further below as a detail

Mean shape: (10, 20)

(20,)
Filename: dataframe_example2.py

Line #    Mem usage    Increment   Line Contents
================================================
     9   76.633 MiB   76.633 MiB   @profile
    10                             def run():
    11                                 # make a big dataframe with an indicator column and lots of random data
    12                                 # use 20 columns to make it clear we have "a chunk of data"
    13                                 # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    14 1602.570 MiB 1525.938 MiB       arr = np.random.random((SIZE, 20))
    15 1602.570 MiB    0.000 MiB       print(f"{arr.shape} shape for our array")
    16 1602.570 MiB    0.000 MiB       df = pd.DataFrame(arr)
    17 1602.570 MiB    0.000 MiB       cols_to_agg = list(df.columns) # get [0, 1, 2,...]
    18                             
    19                                 # (0, 10] range for 10 indicator ints for grouping
    20 1679.258 MiB   76.688 MiB       df['indicator'] = np.random.randint(0, 10, SIZE)
    21 1679.258 MiB    0.000 MiB       print("df.head():")
    22 1680.133 MiB    0.875 MiB       print(df.head())
    23 1680.133 MiB    0.000 MiB       print("Memory usage:\n", df.memory_usage())
    24                                 
    25                                 # calculate summary statistic across grouped rows by all columns
    26 1680.133 MiB    0.000 MiB       gpby = df.groupby('indicator')
    27 3292.250 MiB 1612.117 MiB       means = gpby.mean()
    28 3292.250 MiB    0.000 MiB       print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    29                                 
    30 3454.648 MiB  162.398 MiB       gp0_indexes = gpby.groups[0]
    31 3470.438 MiB   15.789 MiB       manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    32 3470.438 MiB    0.000 MiB       print(manual_lookup_mean.shape)
    33 3470.820 MiB    0.383 MiB       np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
    34                             
    35 3699.656 MiB  228.836 MiB       gp0 = gpby.get_group(0)
    36 3699.750 MiB    0.094 MiB       manual_lookup_mean2 = gp0[cols_to_agg].mean()
    37 3699.750 MiB    0.000 MiB       np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    38                                 #breakpoint()
    39 3699.750 MiB    0.000 MiB       return df, gpby, means


# row by row memory usage clipped from above and pasted here as it is less relevant detail:
Memory usage:
 Index             128
0            80000000
1            80000000
2            80000000
3            80000000
4            80000000
5            80000000
6            80000000
7            80000000
8            80000000
9            80000000
10           80000000
11           80000000
12           80000000
13           80000000
14           80000000
15           80000000
16           80000000
17           80000000
18           80000000
19           80000000
indicator    80000000
dtype: int64

```

#### Code Sample, a copy-pastable example

```python
import numpy as np
import pandas as pd
        
# this assumes you have memory_profiler installed
#SIZE = 100_000 # quick run
SIZE = 10_000_000 # takes 20s or so & 4GB RAM

@profile
def run():
    # make a big dataframe with an indicator column and lots of random data
    # use 20 columns to make it clear we have "a chunk of data"
    # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    arr = np.random.random((SIZE, 20))
    print(f"{arr.shape} shape for our array")
    df = pd.DataFrame(arr)
    cols_to_agg = list(df.columns) # get [0, 1, 2,...]

    # (0, 10] range for 10 indicator ints for grouping
    df['indicator'] = np.random.randint(0, 10, SIZE)
    print("df.head():")
    print(df.head())
    print("Memory usage:\n", df.memory_usage())
    
    # calculate summary statistic across grouped rows by all columns
    gpby = df.groupby('indicator')
    means = gpby.mean()
    print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    
    gp0_indexes = gpby.groups[0]
    manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    print(manual_lookup_mean.shape)
    np.testing.assert_allclose(manual_lookup_mean, means.loc[0])

    gp0 = gpby.get_group(0)
    manual_lookup_mean2 = gp0[cols_to_agg].mean()
    np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    #breakpoint()
    return df, gpby, means
    

if __name__ == "__main__":
    df, gpby, means = run()

```

#### Output of ``pd.show_versions()``

<details>

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python           : 3.8.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.8.1-050801-generic
Version          : #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.1.2
numpy            : 1.19.2
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.3
setuptools       : 49.6.0.post20200917
Cython           : None
pytest           : 6.1.0
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.2
bottleneck       : 1.3.2
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.51.2


</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?) #37139

Code Sample, a copy-pastable example

Output of `pd.show_versions()`

INSTALLED VERSIONS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?) #37139

Description

Code Sample, a copy-pastable example

Output of pd.show_versions()

INSTALLED VERSIONS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `pd.show_versions()`