Skip to content

BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?) #37139

@ianozsvald

Description

@ianozsvald

Whilst teaching "more efficient Pandas" I dug into memory usage in a groupby with memory_profiler and was surprised by the output (below). For a 1.6GB DataFrame of random data with a indicator column (all random data), a groupby on the indicator generates a result that doubles the RAM usage. I was surprised that the groupby would take a further 1.6GB during the operation when the result is a tiny DataFrame.

In this case there are 20 columns by 1 million rows of random floats with a 21st column as an int indicator in the range [0, 9], a gropuby on this creates 10 groups resulting in a mean groupby result of 10 rows by 20 columns. This works as expected.

The groupby operation, shown further below with memory_profiler, seems to make a copy of each group before performing a mean, so the total groupby costs a further 1.6GB. I'd have expected that a light reference was taken to the underlying data rather than (apparently, but maybe I read this incorrectly?) a copy being taken. I've also taken out a single group in further lines of code and each group costs 1/10th of the RAM (160-200MB) which gives some further evidence that a copy is being taken.

Is my mental model wrong? Is it expected that a copy is taken of each group? Is there a way to run this code with a smaller total RAM envelope?

$ python -m memory_profiler dataframe_example2.py 
(10000000, 20) shape for our array
df.head():
          0         1         2         3         4         5         6  ...        14        15        16        17        18        19  indicator
0  0.236551  0.588519  0.629860  0.736470  0.064391  0.755922  0.693302  ...  0.031612  0.593927  0.523154  0.704383  0.800547  0.730927          1
1  0.044981  0.305559  0.156594  0.014646  0.339585  0.177476  0.242033  ...  0.428930  0.099833  0.256720  0.326671  0.037584  0.435411          8
2  0.837702  0.246343  0.380937  0.990791  0.586211  0.155818  0.990258  ...  0.453055  0.363815  0.979012  0.220975  0.650783  0.338048          7
3  0.721275  0.327818  0.689749  0.715901  0.617750  0.550584  0.686884  ...  0.172825  0.083338  0.474990  0.213201  0.236640  0.962145          5
4  0.698709  0.998042  0.805397  0.971646  0.260935  0.602839  0.012762  ...  0.458625  0.248945  0.114550  0.212636  0.019970  0.159915          4

[5 rows x 21 columns]
# row by row memory usage clipped and shown further below as a detail

Mean shape: (10, 20)

(20,)
Filename: dataframe_example2.py

Line #    Mem usage    Increment   Line Contents
================================================
     9   76.633 MiB   76.633 MiB   @profile
    10                             def run():
    11                                 # make a big dataframe with an indicator column and lots of random data
    12                                 # use 20 columns to make it clear we have "a chunk of data"
    13                                 # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    14 1602.570 MiB 1525.938 MiB       arr = np.random.random((SIZE, 20))
    15 1602.570 MiB    0.000 MiB       print(f"{arr.shape} shape for our array")
    16 1602.570 MiB    0.000 MiB       df = pd.DataFrame(arr)
    17 1602.570 MiB    0.000 MiB       cols_to_agg = list(df.columns) # get [0, 1, 2,...]
    18                             
    19                                 # (0, 10] range for 10 indicator ints for grouping
    20 1679.258 MiB   76.688 MiB       df['indicator'] = np.random.randint(0, 10, SIZE)
    21 1679.258 MiB    0.000 MiB       print("df.head():")
    22 1680.133 MiB    0.875 MiB       print(df.head())
    23 1680.133 MiB    0.000 MiB       print("Memory usage:\n", df.memory_usage())
    24                                 
    25                                 # calculate summary statistic across grouped rows by all columns
    26 1680.133 MiB    0.000 MiB       gpby = df.groupby('indicator')
    27 3292.250 MiB 1612.117 MiB       means = gpby.mean()
    28 3292.250 MiB    0.000 MiB       print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    29                                 
    30 3454.648 MiB  162.398 MiB       gp0_indexes = gpby.groups[0]
    31 3470.438 MiB   15.789 MiB       manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    32 3470.438 MiB    0.000 MiB       print(manual_lookup_mean.shape)
    33 3470.820 MiB    0.383 MiB       np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
    34                             
    35 3699.656 MiB  228.836 MiB       gp0 = gpby.get_group(0)
    36 3699.750 MiB    0.094 MiB       manual_lookup_mean2 = gp0[cols_to_agg].mean()
    37 3699.750 MiB    0.000 MiB       np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    38                                 #breakpoint()
    39 3699.750 MiB    0.000 MiB       return df, gpby, means


# row by row memory usage clipped from above and pasted here as it is less relevant detail:
Memory usage:
 Index             128
0            80000000
1            80000000
2            80000000
3            80000000
4            80000000
5            80000000
6            80000000
7            80000000
8            80000000
9            80000000
10           80000000
11           80000000
12           80000000
13           80000000
14           80000000
15           80000000
16           80000000
17           80000000
18           80000000
19           80000000
indicator    80000000
dtype: int64

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
        
# this assumes you have memory_profiler installed
#SIZE = 100_000 # quick run
SIZE = 10_000_000 # takes 20s or so & 4GB RAM

@profile
def run():
    # make a big dataframe with an indicator column and lots of random data
    # use 20 columns to make it clear we have "a chunk of data"
    # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    arr = np.random.random((SIZE, 20))
    print(f"{arr.shape} shape for our array")
    df = pd.DataFrame(arr)
    cols_to_agg = list(df.columns) # get [0, 1, 2,...]

    # (0, 10] range for 10 indicator ints for grouping
    df['indicator'] = np.random.randint(0, 10, SIZE)
    print("df.head():")
    print(df.head())
    print("Memory usage:\n", df.memory_usage())
    
    # calculate summary statistic across grouped rows by all columns
    gpby = df.groupby('indicator')
    means = gpby.mean()
    print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    
    gp0_indexes = gpby.groups[0]
    manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    print(manual_lookup_mean.shape)
    np.testing.assert_allclose(manual_lookup_mean, means.loc[0])

    gp0 = gpby.get_group(0)
    manual_lookup_mean2 = gp0[cols_to_agg].mean()
    np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    #breakpoint()
    return df, gpby, means
    

if __name__ == "__main__":
    df, gpby, means = run()

Output of pd.show_versions()

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.1-050801-generic
Version : #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200917
Cython : None
pytest : 6.1.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.2
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsGroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions