New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in pd.read_csv or DataFrame #21353

Closed
kuraga opened this Issue Jun 7, 2018 · 9 comments

Comments

Projects
None yet
6 participants
@kuraga
Copy link

kuraga commented Jun 7, 2018

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

  1. There is a "standard" leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
  2. We see a large leak in some other cases.

cc @gfyoung

Output of pd.show_versions()

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented Jun 7, 2018

@kuraga : Thanks for the updated issue!

cc @jreback @jorisvandenbossche

@kuraga

This comment has been minimized.

Copy link

kuraga commented Jun 13, 2018

Seems like it's not pd.read_csv issue only...

memory_leak_2

@nynorbert

This comment has been minimized.

Copy link

nynorbert commented Jun 13, 2018

I have a similiar issue. I have tried to debug it with memory_profiler but I don't see the source of the leak.
The output of the profiler:

 Line #    Mem usage    Increment   Line Contents
 ================================================
    187    261.3 MiB      0.0 MiB        if "history" in self.watch_list:
    188    491.9 MiB    230.6 MiB            self.history = pd.read_csv(self.path + '/' + self.history_files[self.current][1], delimiter=';', header=None)
    189    491.9 MiB      0.0 MiB            self.history_group = self.history.groupby([0])

This snippet of the code is inside a loop and every time it increments the memory usage. I also tried to delete the history and history_group object and calling gc.collect() manually, but nothing seems to work.
Is it possible that this is some cyclic dependency between history and history_group? And if it is then why deleting both history_group and history was not solving the problem?

p.s: My pandas version is 0.23.1

@nynorbert

This comment has been minimized.

Copy link

nynorbert commented Jun 13, 2018

Sorry, I was wrong. Not the read_csv which consumes the memory rather than a drop:

Line #    Mem usage    Increment   Line Contents
 ================================================
   265   1425.1 MiB      9.6 MiB                        self.history.drop(self.history_group.get_group(self.current_timestamp).index)

And I think I found out that malloc_trim solves the problem, similar to this: #2659

@kuraga Maybe you should try it.

@zhezherun

This comment has been minimized.

Copy link
Contributor

zhezherun commented Oct 8, 2018

I also noticed a memory leak in read_csv and ran it through valgrind, which said that the result of the kset_from_list function was never freed. I was able to fix this leak locally by patching parsers.pyx and rebuilding pandas.

@gfyoung, could you please review the patch below? It might also help with the leak discussed here, although I am not sure if it is the same leak or not. The patch

  • Moves the allocation of na_hashset further down, closer to where it is used. Otherwise it will not be freed if continue is executed,
  • Makes sure that na_hashset is deleted if there is an exception,
  • Also cleans up the allocation inside kset_from_list before raising an exception.
--- parsers.pyx	2018-08-01 19:57:16.000000000 +0100
+++ parsers.pyx	2018-10-08 15:25:32.124526087 +0100
@@ -1054,18 +1054,6 @@
 
             conv = self._get_converter(i, name)
 
-            # XXX
-            na_flist = set()
-            if self.na_filter:
-                na_list, na_flist = self._get_na_list(i, name)
-                if na_list is None:
-                    na_filter = 0
-                else:
-                    na_filter = 1
-                    na_hashset = kset_from_list(na_list)
-            else:
-                na_filter = 0
-
             col_dtype = None
             if self.dtype is not None:
                 if isinstance(self.dtype, dict):
@@ -1090,13 +1078,26 @@
                                               self.c_encoding)
                 continue
 
-            # Should return as the desired dtype (inferred or specified)
-            col_res, na_count = self._convert_tokens(
-                i, start, end, name, na_filter, na_hashset,
-                na_flist, col_dtype)
+            # XXX
+            na_flist = set()
+            if self.na_filter:
+                na_list, na_flist = self._get_na_list(i, name)
+                if na_list is None:
+                    na_filter = 0
+                else:
+                    na_filter = 1
+                    na_hashset = kset_from_list(na_list)
+            else:
+                na_filter = 0
 
-            if na_filter:
-                self._free_na_set(na_hashset)
+            try:
+                # Should return as the desired dtype (inferred or specified)
+                col_res, na_count = self._convert_tokens(
+                    i, start, end, name, na_filter, na_hashset,
+                    na_flist, col_dtype)
+            finally:
+                if na_filter:
+                    self._free_na_set(na_hashset)
 
             if upcast_na and na_count > 0:
                 col_res = _maybe_upcast(col_res)
@@ -2043,6 +2044,7 @@
 
         # None creeps in sometimes, which isn't possible here
         if not PyBytes_Check(val):
+            kh_destroy_str(table)
             raise ValueError('Must be all encoded bytes')
 
         k = kh_put_str(table, PyBytes_AsString(val), &ret)
@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented Oct 8, 2018

@zhezherun : That's a good catch! Create a PR, and we can review.

@jreback jreback added this to the 0.24.0 milestone Oct 10, 2018

@kuraga

This comment has been minimized.

Copy link

kuraga commented Oct 23, 2018

Trying to patch is cool but fear that #2659 (comment)...

gfyoung added a commit to zhezherun/pandas that referenced this issue Nov 19, 2018

BUG: Fixing memory leaks in read_csv
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.

gfyoung added a commit to zhezherun/pandas that referenced this issue Nov 19, 2018

BUG: Fixing memory leaks in read_csv
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.

TomAugspurger added a commit that referenced this issue Nov 19, 2018

BUG: Fixing memory leaks in read_csv (#23072)
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes gh-21353.
@kuraga

This comment has been minimized.

Copy link

kuraga commented Nov 19, 2018

@zhezherun , @TomAugspurger , thanks very much!

But could you, please, describe the connection with @nynorbert 's observation:

And I think I found out that malloc_trim solves the problem, similar to this: #2659

So, we had memory leak in Pandas in addition to glibc's feature to not trim after free?

Thanks.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Nov 19, 2018

I don't know C, so no. Perhaps @nynorbert can clarify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment