Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage increasing with df.dropna(inplace=True) and df.head() #11050

Closed
markns opened this issue Sep 10, 2015 · 2 comments
Closed

Memory usage increasing with df.dropna(inplace=True) and df.head() #11050

markns opened this issue Sep 10, 2015 · 2 comments
Labels
Performance Memory or execution speed performance

Comments

@markns
Copy link

markns commented Sep 10, 2015

There appears to be a memory leak in the DataFrame.dropna(inplace=True) function. Please see the ipython session:

[1]:
import ipython_memory_usage.ipython_memory_usage as imu
imu.start_watching_memory()
​
import pandas as pd
import numpy as np
pd.__version__

Out[1]: '0.16.2'
In [1] used 23.6875 MiB RAM in 0.70s, peaked 0.00 MiB above current, total RAM usage 59.76 MiB

In [2]:
df = pd.DataFrame(np.ones(1e7))
df.info(memory_usage=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
0    float64
dtypes: float64(1)
memory usage: 152.6 MB
In [2] used 153.3477 MiB RAM in 0.14s, peaked 0.00 MiB above current, total RAM usage 213.11 MiB

In [3]:
df.dropna(inplace=True)
df.head(2)

Out[3]:
0
0   1
1   1
In [3] used 0.1758 MiB RAM in 0.60s, peaked 385.92 MiB above current, total RAM usage 213.29 MiB

In [4]:
df.dropna(inplace=True)
df.head(2)

Out[4]:
0
0   1
1   1
In [4] used 152.9375 MiB RAM in 0.61s, peaked 182.86 MiB above current, total RAM usage 366.22 MiB

In [5]:
df.dropna(inplace=True)
df.head(2)

Out[5]:
0
0   1
1   1
In [5] used 152.9297 MiB RAM in 0.58s, peaked 272.79 MiB above current, total RAM usage 519.15 MiB

Continuing to run cells 3,4 and 5 will add around 150MB to the memory usage each time. The behaviour is only materialised when the df.head() command is run after the dropna inplace.

The behaviour can also be seen in a cell which is not dropping na in place:

df = df.dropna()
df.head(2)

but this only happens if the df.dropna(inplace=True) has been run within the same cell already.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2015

don't think so. you need to garbage collect.

further, don't be fooled by the inplace=True, it doesn't actually do anything but assign the reference internally. it doesn't save anything.

In [13]: def f(df):
   ....:     x = df.copy()
   ....:     x = x.dropna()
   ....:     x.head()
   ....:     

In [14]: %memit f(df)
peak memory: 631.70 MiB, increment: 0.02 MiB

In [15]: %memit f(df)
peak memory: 631.70 MiB, increment: 0.00 MiB

In [16]: %memit f(df)
peak memory: 631.70 MiB, increment: 0.00 MiB

In [17]: %memit f(df)
peak memory: 631.70 MiB, increment: 0.00 MiB
In [8]: def g(df):
   ...:     x = df.copy()
   ...:     x.dropna(inplace=True)
   ...:     x.head()
   ...:     

In [9]: %memit g(df)
peak memory: 631.64 MiB, increment: 76.30 MiB

In [10]: %memit g(df)
peak memory: 631.65 MiB, increment: 0.00 MiB

In [11]: %memit g(df)
peak memory: 631.65 MiB, increment: 0.00 MiB

In [12]: %memit g(df)
peak memory: 631.67 MiB, increment: 0.00 MiB

@jreback jreback closed this as completed Sep 10, 2015
@jreback jreback added the Performance Memory or execution speed performance label Sep 10, 2015
@K11K11
Copy link

K11K11 commented Oct 8, 2015

Looks like dropna() is quite memory inefficient for my case too.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants