-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Whilst teaching "more efficient Pandas" I dug into memory usage in a groupby
with memory_profiler
and was surprised by the output (below). For a 1.6GB DataFrame of random data with a indicator column (all random data), a groupby on the indicator generates a result that doubles the RAM usage. I was surprised that the groupby would take a further 1.6GB during the operation when the result is a tiny DataFrame.
In this case there are 20 columns by 1 million rows of random floats with a 21st column as an int indicator in the range [0, 9], a gropuby on this creates 10 groups resulting in a mean groupby result of 10 rows by 20 columns. This works as expected.
The groupby operation, shown further below with memory_profiler
, seems to make a copy of each group before performing a mean, so the total groupby costs a further 1.6GB. I'd have expected that a light reference was taken to the underlying data rather than (apparently, but maybe I read this incorrectly?) a copy being taken. I've also taken out a single group in further lines of code and each group costs 1/10th of the RAM (160-200MB) which gives some further evidence that a copy is being taken.
Is my mental model wrong? Is it expected that a copy is taken of each group? Is there a way to run this code with a smaller total RAM envelope?
$ python -m memory_profiler dataframe_example2.py
(10000000, 20) shape for our array
df.head():
0 1 2 3 4 5 6 ... 14 15 16 17 18 19 indicator
0 0.236551 0.588519 0.629860 0.736470 0.064391 0.755922 0.693302 ... 0.031612 0.593927 0.523154 0.704383 0.800547 0.730927 1
1 0.044981 0.305559 0.156594 0.014646 0.339585 0.177476 0.242033 ... 0.428930 0.099833 0.256720 0.326671 0.037584 0.435411 8
2 0.837702 0.246343 0.380937 0.990791 0.586211 0.155818 0.990258 ... 0.453055 0.363815 0.979012 0.220975 0.650783 0.338048 7
3 0.721275 0.327818 0.689749 0.715901 0.617750 0.550584 0.686884 ... 0.172825 0.083338 0.474990 0.213201 0.236640 0.962145 5
4 0.698709 0.998042 0.805397 0.971646 0.260935 0.602839 0.012762 ... 0.458625 0.248945 0.114550 0.212636 0.019970 0.159915 4
[5 rows x 21 columns]
# row by row memory usage clipped and shown further below as a detail
Mean shape: (10, 20)
(20,)
Filename: dataframe_example2.py
Line # Mem usage Increment Line Contents
================================================
9 76.633 MiB 76.633 MiB @profile
10 def run():
11 # make a big dataframe with an indicator column and lots of random data
12 # use 20 columns to make it clear we have "a chunk of data"
13 # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
14 1602.570 MiB 1525.938 MiB arr = np.random.random((SIZE, 20))
15 1602.570 MiB 0.000 MiB print(f"{arr.shape} shape for our array")
16 1602.570 MiB 0.000 MiB df = pd.DataFrame(arr)
17 1602.570 MiB 0.000 MiB cols_to_agg = list(df.columns) # get [0, 1, 2,...]
18
19 # (0, 10] range for 10 indicator ints for grouping
20 1679.258 MiB 76.688 MiB df['indicator'] = np.random.randint(0, 10, SIZE)
21 1679.258 MiB 0.000 MiB print("df.head():")
22 1680.133 MiB 0.875 MiB print(df.head())
23 1680.133 MiB 0.000 MiB print("Memory usage:\n", df.memory_usage())
24
25 # calculate summary statistic across grouped rows by all columns
26 1680.133 MiB 0.000 MiB gpby = df.groupby('indicator')
27 3292.250 MiB 1612.117 MiB means = gpby.mean()
28 3292.250 MiB 0.000 MiB print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
29
30 3454.648 MiB 162.398 MiB gp0_indexes = gpby.groups[0]
31 3470.438 MiB 15.789 MiB manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
32 3470.438 MiB 0.000 MiB print(manual_lookup_mean.shape)
33 3470.820 MiB 0.383 MiB np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
34
35 3699.656 MiB 228.836 MiB gp0 = gpby.get_group(0)
36 3699.750 MiB 0.094 MiB manual_lookup_mean2 = gp0[cols_to_agg].mean()
37 3699.750 MiB 0.000 MiB np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
38 #breakpoint()
39 3699.750 MiB 0.000 MiB return df, gpby, means
# row by row memory usage clipped from above and pasted here as it is less relevant detail:
Memory usage:
Index 128
0 80000000
1 80000000
2 80000000
3 80000000
4 80000000
5 80000000
6 80000000
7 80000000
8 80000000
9 80000000
10 80000000
11 80000000
12 80000000
13 80000000
14 80000000
15 80000000
16 80000000
17 80000000
18 80000000
19 80000000
indicator 80000000
dtype: int64
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
# this assumes you have memory_profiler installed
#SIZE = 100_000 # quick run
SIZE = 10_000_000 # takes 20s or so & 4GB RAM
@profile
def run():
# make a big dataframe with an indicator column and lots of random data
# use 20 columns to make it clear we have "a chunk of data"
# float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
arr = np.random.random((SIZE, 20))
print(f"{arr.shape} shape for our array")
df = pd.DataFrame(arr)
cols_to_agg = list(df.columns) # get [0, 1, 2,...]
# (0, 10] range for 10 indicator ints for grouping
df['indicator'] = np.random.randint(0, 10, SIZE)
print("df.head():")
print(df.head())
print("Memory usage:\n", df.memory_usage())
# calculate summary statistic across grouped rows by all columns
gpby = df.groupby('indicator')
means = gpby.mean()
print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
gp0_indexes = gpby.groups[0]
manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
print(manual_lookup_mean.shape)
np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
gp0 = gpby.get_group(0)
manual_lookup_mean2 = gp0[cols_to_agg].mean()
np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
#breakpoint()
return df, gpby, means
if __name__ == "__main__":
df, gpby, means = run()
Output of pd.show_versions()
In [3]: pd.show_versions()
INSTALLED VERSIONS
commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.1-050801-generic
Version : #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.1.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200917
Cython : None
pytest : 6.1.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.2
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2