Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?) #37139

Closed
ianozsvald opened this issue Oct 15, 2020 · 5 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Groupby Performance Memory or execution speed performance

Comments

@ianozsvald
Copy link
Contributor

Whilst teaching "more efficient Pandas" I dug into memory usage in a groupby with memory_profiler and was surprised by the output (below). For a 1.6GB DataFrame of random data with a indicator column (all random data), a groupby on the indicator generates a result that doubles the RAM usage. I was surprised that the groupby would take a further 1.6GB during the operation when the result is a tiny DataFrame.

In this case there are 20 columns by 1 million rows of random floats with a 21st column as an int indicator in the range [0, 9], a gropuby on this creates 10 groups resulting in a mean groupby result of 10 rows by 20 columns. This works as expected.

The groupby operation, shown further below with memory_profiler, seems to make a copy of each group before performing a mean, so the total groupby costs a further 1.6GB. I'd have expected that a light reference was taken to the underlying data rather than (apparently, but maybe I read this incorrectly?) a copy being taken. I've also taken out a single group in further lines of code and each group costs 1/10th of the RAM (160-200MB) which gives some further evidence that a copy is being taken.

Is my mental model wrong? Is it expected that a copy is taken of each group? Is there a way to run this code with a smaller total RAM envelope?

$ python -m memory_profiler dataframe_example2.py 
(10000000, 20) shape for our array
df.head():
          0         1         2         3         4         5         6  ...        14        15        16        17        18        19  indicator
0  0.236551  0.588519  0.629860  0.736470  0.064391  0.755922  0.693302  ...  0.031612  0.593927  0.523154  0.704383  0.800547  0.730927          1
1  0.044981  0.305559  0.156594  0.014646  0.339585  0.177476  0.242033  ...  0.428930  0.099833  0.256720  0.326671  0.037584  0.435411          8
2  0.837702  0.246343  0.380937  0.990791  0.586211  0.155818  0.990258  ...  0.453055  0.363815  0.979012  0.220975  0.650783  0.338048          7
3  0.721275  0.327818  0.689749  0.715901  0.617750  0.550584  0.686884  ...  0.172825  0.083338  0.474990  0.213201  0.236640  0.962145          5
4  0.698709  0.998042  0.805397  0.971646  0.260935  0.602839  0.012762  ...  0.458625  0.248945  0.114550  0.212636  0.019970  0.159915          4

[5 rows x 21 columns]
# row by row memory usage clipped and shown further below as a detail

Mean shape: (10, 20)

(20,)
Filename: dataframe_example2.py

Line #    Mem usage    Increment   Line Contents
================================================
     9   76.633 MiB   76.633 MiB   @profile
    10                             def run():
    11                                 # make a big dataframe with an indicator column and lots of random data
    12                                 # use 20 columns to make it clear we have "a chunk of data"
    13                                 # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    14 1602.570 MiB 1525.938 MiB       arr = np.random.random((SIZE, 20))
    15 1602.570 MiB    0.000 MiB       print(f"{arr.shape} shape for our array")
    16 1602.570 MiB    0.000 MiB       df = pd.DataFrame(arr)
    17 1602.570 MiB    0.000 MiB       cols_to_agg = list(df.columns) # get [0, 1, 2,...]
    18                             
    19                                 # (0, 10] range for 10 indicator ints for grouping
    20 1679.258 MiB   76.688 MiB       df['indicator'] = np.random.randint(0, 10, SIZE)
    21 1679.258 MiB    0.000 MiB       print("df.head():")
    22 1680.133 MiB    0.875 MiB       print(df.head())
    23 1680.133 MiB    0.000 MiB       print("Memory usage:\n", df.memory_usage())
    24                                 
    25                                 # calculate summary statistic across grouped rows by all columns
    26 1680.133 MiB    0.000 MiB       gpby = df.groupby('indicator')
    27 3292.250 MiB 1612.117 MiB       means = gpby.mean()
    28 3292.250 MiB    0.000 MiB       print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    29                                 
    30 3454.648 MiB  162.398 MiB       gp0_indexes = gpby.groups[0]
    31 3470.438 MiB   15.789 MiB       manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    32 3470.438 MiB    0.000 MiB       print(manual_lookup_mean.shape)
    33 3470.820 MiB    0.383 MiB       np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
    34                             
    35 3699.656 MiB  228.836 MiB       gp0 = gpby.get_group(0)
    36 3699.750 MiB    0.094 MiB       manual_lookup_mean2 = gp0[cols_to_agg].mean()
    37 3699.750 MiB    0.000 MiB       np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    38                                 #breakpoint()
    39 3699.750 MiB    0.000 MiB       return df, gpby, means


# row by row memory usage clipped from above and pasted here as it is less relevant detail:
Memory usage:
 Index             128
0            80000000
1            80000000
2            80000000
3            80000000
4            80000000
5            80000000
6            80000000
7            80000000
8            80000000
9            80000000
10           80000000
11           80000000
12           80000000
13           80000000
14           80000000
15           80000000
16           80000000
17           80000000
18           80000000
19           80000000
indicator    80000000
dtype: int64

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
        
# this assumes you have memory_profiler installed
#SIZE = 100_000 # quick run
SIZE = 10_000_000 # takes 20s or so & 4GB RAM

@profile
def run():
    # make a big dataframe with an indicator column and lots of random data
    # use 20 columns to make it clear we have "a chunk of data"
    # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    arr = np.random.random((SIZE, 20))
    print(f"{arr.shape} shape for our array")
    df = pd.DataFrame(arr)
    cols_to_agg = list(df.columns) # get [0, 1, 2,...]

    # (0, 10] range for 10 indicator ints for grouping
    df['indicator'] = np.random.randint(0, 10, SIZE)
    print("df.head():")
    print(df.head())
    print("Memory usage:\n", df.memory_usage())
    
    # calculate summary statistic across grouped rows by all columns
    gpby = df.groupby('indicator')
    means = gpby.mean()
    print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    
    gp0_indexes = gpby.groups[0]
    manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    print(manual_lookup_mean.shape)
    np.testing.assert_allclose(manual_lookup_mean, means.loc[0])

    gp0 = gpby.get_group(0)
    manual_lookup_mean2 = gp0[cols_to_agg].mean()
    np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    #breakpoint()
    return df, gpby, means
    

if __name__ == "__main__":
    df, gpby, means = run()

Output of pd.show_versions()

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.1-050801-generic
Version : #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200917
Cython : None
pytest : 6.1.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.2
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

@ianozsvald ianozsvald added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 15, 2020
@ianozsvald
Copy link
Contributor Author

Here's a slightly more detailed view - taking a view is cheap (circa 0MB as expected) but using the index from the groupby still seems pretty heavyweight (but I might just misunderstand what's going on behind the scenes):

In [31]: !python -m memory_profiler dataframe_example2.py
(10000000, 20) shape for our array
df.head():
          0         1         2         3         4         5         6  ...        14        15        16        17        18        19  indicator
0  0.437260  0.113020  0.468664  0.031016  0.056863  0.298907  0.143580  ...  0.195952  0.657012  0.520817  0.338082  0.640738  0.906689          5
1  0.786875  0.954152  0.030518  0.161326  0.677397  0.555207  0.870358  ...  0.747006  0.518000  0.204037  0.316731  0.409375  0.445969          9
2  0.393063  0.517714  0.463096  0.951375  0.167449  0.892361  0.301859  ...  0.492387  0.787819  0.933163  0.548621  0.392948  0.453509          0
3  0.664898  0.265355  0.672949  0.444550  0.400675  0.052995  0.157249  ...  0.077314  0.193061  0.218931  0.098530  0.268247  0.657000          6
4  0.467095  0.191086  0.672740  0.222110  0.794777  0.167955  0.945457  ...  0.845072  0.292147  0.269523  0.325818  0.880962  0.124021          7

[5 rows x 21 columns]
Memory usage:
 Index             128
0            80000000
1            80000000
2            80000000
3            80000000
4            80000000
5            80000000
6            80000000
7            80000000
8            80000000
9            80000000
10           80000000
11           80000000
12           80000000
13           80000000
14           80000000
15           80000000
16           80000000
17           80000000
18           80000000
19           80000000
indicator    80000000
dtype: int64
Mean shape: (10, 20)
7,998,560 bytes used in the underlying index
df_subselected_via_index is view True
(20,)
Filename: dataframe_example2.py

Line #    Mem usage    Increment   Line Contents
================================================
     9   76.738 MiB   76.738 MiB   @profile
    10                             def run():
    11                                 # make a big dataframe with an indicator column and lots of random data
    12                                 # use 20 columns to make it clear we have "a chunk of data"
    13                                 # float64 * 10M is 80M, for 20 rows this is 80M*20 circa 1,600MB
    14 1602.730 MiB 1525.992 MiB       arr = np.random.random((SIZE, 20))
    15 1602.730 MiB    0.000 MiB       print(f"{arr.shape} shape for our array")
    16 1602.730 MiB    0.000 MiB       df = pd.DataFrame(arr)
    17 1602.730 MiB    0.000 MiB       cols_to_agg = list(df.columns) # get [0, 1, 2,...]
    18                             
    19                                 # (0, 10] range for 10 indicator ints for grouping
    20 1679.453 MiB   76.723 MiB       df['indicator'] = np.random.randint(0, 10, SIZE)
    21 1679.453 MiB    0.000 MiB       print("df.head():")
    22 1680.141 MiB    0.688 MiB       print(df.head())
    23 1680.141 MiB    0.000 MiB       print("Memory usage:\n", df.memory_usage())
    24                                 
    25                                 # calculate summary statistic across grouped rows by all columns
    26 1680.141 MiB    0.000 MiB       gpby = df.groupby('indicator')
    27 3292.289 MiB 1612.148 MiB       means = gpby.mean()
    28 3292.289 MiB    0.000 MiB       print(f"Mean shape: {means.shape}") # (10, 20) for 10 indicators and 20 columns
    29                                
    30                                 # try for a light reference using a view
    31                                 # note we can't test this as it ignores the groupby, this is to
    32                                 # check if taking roughly the same number of rows as a reference
    33                                 # is cheaper (hopefully!) than working with a grouped set of rows
    34 3292.289 MiB    0.000 MiB       df_subselect_10pct = df[:int(df.shape[0]/10)] # cost is 0MB
    35 3292.289 MiB    0.000 MiB       _ = df_subselect_10pct.mean() # calculated to check that this is effectively free
    36                             
    37                                 # try to use the groupby index - the following lines are expensive
    38 3454.668 MiB  162.379 MiB       gp0_indexes = gpby.groups[0]
    39 3454.668 MiB    0.000 MiB       print(f"{gpby.groups[0]._data.nbytes:,} bytes used in the underlying index")
    40 3454.668 MiB    0.000 MiB       index_values = gp0_indexes.values
    41                                 # indexing here feels very expensive, given that the df_subselect_10pct view above costs 0MB
    42 3622.969 MiB  168.301 MiB       df_subselected_via_index = df.loc[index_values, cols_to_agg]
    43 3622.969 MiB    0.000 MiB       print("df_subselected_via_index is view", df_subselected_via_index._data.is_view)
    44 3623.383 MiB    0.414 MiB       np.testing.assert_allclose(df_subselected_via_index.mean(), means.loc[0])
    45                             
    46                                 # since we now have the indexes, this is cheap (in the earlier example this was expensive)
    47 3623.402 MiB    0.020 MiB       manual_lookup_mean = df.loc[gp0_indexes, cols_to_agg].mean()
    48 3623.402 MiB    0.000 MiB       print(manual_lookup_mean.shape)
    49 3623.402 MiB    0.000 MiB       np.testing.assert_allclose(manual_lookup_mean, means.loc[0])
    50                             
    51                                 # fetching the group seems to be quite expensive again
    52 3852.152 MiB  228.750 MiB       gp0 = gpby.get_group(0)
    53 3852.281 MiB    0.129 MiB       manual_lookup_mean2 = gp0[cols_to_agg].mean()
    54 3852.281 MiB    0.000 MiB       np.testing.assert_allclose(manual_lookup_mean2, means.loc[0])
    55                             
    56                                 #breakpoint()
    57 3852.281 MiB    0.000 MiB       return df, gpby, means


@emeryberger
Copy link

I was able to verify these results with Scalene, a Python profiler that simultaneously tracks CPU execution time (split into Python & native), as well as memory consumption and (crucially here) copy volume. The copy volume column makes it clear that lines 30, 34, and 35 are doing a lot of copying, confirming @ianozsvald's hypothesis.

$ scalene dataframe_example2.py
                                           Memory usage: ▅███▆▆▆▆▆▆▆▆▆ (max: 5706.79MB)                                            
                                     dataframe_example.py: % of time = 100.00% out of   9.81s.                                     
                                                                                                                                   
  LineTime %Time %SysMem %NetMemory usageCopy   │                                                              
       │Pythonnative%Python │(MB)  │over time / % │(MB/s) │dataframe_example.py                                          
 ━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 
   ... │       │        │     │       │      │              │       │                                                              
     8 │       │        │     │       │      │              │       │@profile                                                      
     9 │       │        │     │       │      │              │       │def run():                                                    
    10 │       │        │     │       │      │              │       │    # make a big dataframe with an indicator column and lots  
       │       │        │     │       │      │              │       │of random data                                                
    11 │       │        │     │       │      │              │       │    # use 20 columns to make it clear we have "a chunk of     
       │       │        │     │       │      │              │       │data"                                                         
    12 │       │        │     │       │      │              │       │    # float64 * 10M is 80M, for 20 rows this is 80M*20 circa  
       │       │        │     │       │      │              │       │1,600MB                                                       
    13 │       │    29% │     │       │ 1526 │▁             │       │    arr = np.random.random((SIZE, 20))                        
    14 │       │        │     │       │      │              │       │    print(f"{arr.shape} shape for our array")                 
    15 │       │        │     │       │      │              │       │    df = pd.DataFrame(arr)                                    
    16 │       │        │     │       │      │              │       │    cols_to_agg = list(df.columns) # get [0, 1, 2,...]        
    17 │       │        │     │       │      │              │       │                                                              
    18 │       │        │     │       │      │              │       │    # (0, 10] range for 10 indicator ints for grouping        
    19 │       │     3%0% │       │  154 │▁▁            │     8df['indicator'] = np.random.randint(0, 10, SIZE)          
    20 │       │        │     │       │      │              │       │    print("df.head():")                                       
    211% │        │  0%85%-73 │▁▁▁▁▁▁▁       │       │    print(df.head())                                          
    22 │       │        │     │  100%3 │▁▁▁           │       │    print("Memory usage:\n", df.memory_usage())               
    23 │       │        │     │       │      │              │       │                                                              
    24 │       │        │     │       │      │              │       │    # calculate summary statistic across grouped rows by all  
       │       │        │     │       │      │              │       │columns                                                       
    25 │       │        │     │       │      │              │       │    gpby = df.groupby('indicator')                            
    26 │       │    19% │     │       │ 1714 │▂▂▂▂▂         │       │    means = gpby.mean()                                       
    27 │       │        │     │       │      │              │       │    print(f"Mean shape: {means.shape}") # (10, 20) for 10     
       │       │        │     │       │      │              │       │indicators and 20 columns                                     
    28 │       │        │     │       │      │              │       │                                                              
    291%3% │     │       │  244 │▁▁▁▁▁▁▁▁▁     │       │    gp0_indexes = gpby.groups[0]                              
    302%20% │     │       │ 1670 │▂▂▂▂▂▃▃▃▃     │    33manual_lookup_mean = df.loc[gp0_indexes,                  
       │       │        │     │       │      │              │       │cols_to_agg].mean()                                           
    31 │       │        │     │       │      │              │       │    print(manual_lookup_mean.shape)                           
    32 │       │        │     │  100%1 │▁             │       │    np.testing.assert_allclose(manual_lookup_mean,            
       │       │        │     │       │      │              │       │means.loc[0])                                                 
    33 │       │        │     │       │      │              │       │                                                              
    341%8% │     │       │-1625 │▁▁▁▁▁▁▁       │    16gp0 = gpby.get_group(0)                                   
    351%3%0% │       │  324 │▁▁▁▁▁         │    16manual_lookup_mean2 = gp0[cols_to_agg].mean()             
    36 │       │        │     │       │      │              │       │    np.testing.assert_allclose(manual_lookup_mean2,           
       │       │        │     │       │      │              │       │means.loc[0])                                                 
    37 │       │        │     │       │      │              │       │    #breakpoint()                                             
    38 │       │        │     │       │      │              │       │    return df, gpby, means                                    
   ... │       │        │     │       │      │              │       │                                                              

@mzeitlin11
Copy link
Member

Thanks for investigating this @ianozsvald and @emeryberger! Contributions to avoid any unnecessary copying here would certainly be welcome.

@mzeitlin11 mzeitlin11 added Groupby Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021
@lithomas1
Copy link
Member

This seems to be fixed by the new Copy on Write optimizations on main.
More info on how to enable them can be found here
https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#copy-on-write-improvements

More info on how to enable this can be found here.

@lithomas1 lithomas1 added the Closing Candidate May be closeable, needs more eyeballs label Jan 11, 2023
@emeryberger
Copy link

Confirming that enabling Copy-on-Write indeed has a substantial impact on reducing memory consumption. Nicely done! If this bug report helped motivate this optimization in any way, please let us know!

Before: Without copy-on-write (as before), peak memory consumption was 3.773GB. Line 34 consumes 1.671 GB.

Screenshot 2023-01-28 at 12 14 50 PM

After: Following the recommendations of the above-linked web site, I added the following copy-on-write directives before running the run() function:

pd.set_option("mode.copy_on_write", True)                                                                
pd.options.mode.copy_on_write = True                                                                     

Peak memory consumption drops to 2.63GB, and line 34 now consumes just 186MB (roughly 10% of what it consumed previously). The results of the computation appear to be identical.

Screenshot 2023-01-28 at 12 16 30 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants